Map the distribution of your sample by geolocating ip addresses or zip codes

Using the maps package to plot your geolocated sample


Yesterday I wanted to create a map of participants from a study on social media and partisan selective exposure that Sean Westwood and I ran recently, with participants from Amazon’s Mechanical Turk.  We recorded ip addresses for each Turker participant, so I used a geolocation service to lookup the latitude and longitude associated with each participant’s ip address, and then created a nice little map in R.  This can also be done if you have each participant’s zip+4 code, which is probably more accurate as well.

At any rate, we made a map that looks something like the picture above.

In this post I’ll illustrate how to make a map like this first using zip+4 codes and then using ip addresses.  We’ll use RCurl to send an html POST command to some website that will give us the relevant geolocation data, and then extract the relevant info by parsing JSON or html output (using regular expressions).

First we’ll pass zip+4 codes to the Yahoo! PlaceFinder Application Programming Interface (API) to return the state, latitude and longitude.  We’ll use their JSON output, because it’s easier to parse in R than the alternative (XML).

The first thing you’ll need to do is to get yourself a Yahoo! Application ID, so that you can use Yahoo’s API. This should only take about ten minutes. When you’ve got it, replace your appid where it says “[INSERT YOUR YAHOO ID HERE]” in the code below.

Here’s the R code:

###############################################################################
################## Example 1: Geolocating Zip +4 Codes ########################
###############################################################################

zips <- c("48198-6275", "75248-5000", "27604-1520", "53010-3212", "95453-9630",
		"10548-1241", "70582-6107", "84106-2494", "78613-3249", "18705-3906",
		"23108-0274", "23108-0274", "67361-1732", "28227-1454", "37027-5902",
		"75067-4208", "32505-2315", "80905-4449", "49008-1818", "08034-2616",
		"04347-1204", "94703-1708", "32822-4969", "44035-6205", "60192-1114",
		"45030-4933", "32210-3625", "80439-8782", "68770-7013", "42420-8832",
		"83843-2219", "33064-2555", "75067-6605", "22301-1266", "78757-1652",
		"85140-5232", "19067-6225", "06410-4210", "63109-3347", "89103-4803",
		"99337-5714", "37067-8123", "94582-4921", "65401-3065", "46614-1320",
		"29732-3810", "79412-1142", "23226-3006", "78217-1542", "10128-4923",
		"15145-1635", "48135-1093", "50621-0683", "32713-3803", "08251-1332",
		"14445-2024", "83210-0054", "27525-8790", "32609-8860", "53711-3548",
		"49601-2214", "15324-0010", "37604-3508", "85041-7739", "01906-1317",
		"10552-2605", "20120-6322", "19934-1263", "87124-2735", "22192-4415",
		"45344-9130", "08618-1464", "83646-3473", "80911-9007", "13057-2023",
		"30033-2016", "30039-6323", "20109-8120", "98043-2105", "34655-3129",
		"60465-1326", "33433-5746", "30707-1636", "98102-4432", "92037-7407",
		"95054-4144", "94703-1708", "94110-2712", "92562-5073", "31418-0341",
		"10009-1819", "73703-6736", "98554-0175", "65649-8267", "21704-7864",
		"15209-2420", "61550-2715", "92506-5317", "60611-3801", "48161-4064",
		"77365-3268", "50022-1388", "63841-1227", "36207-5817", "22121-0232",
		"73115-4446", "85340-7341", "44420-3415", "07663-4913", "10065-6716",
		"91606-2213", "19120-9237", "80015-2718", "97401-4448", "46637-5424",
		"29609-5425", "20815-6414", "45373-1731", "57106-6205", "71744-9486",
		"14222-1922", "28306-3901", "84074-2684", "28605-8292", "59405-8649")
 
###############################################################################
############################# Geolocate data ##################################
###############################################################################

# First load RCurl, which we'll use to send HTML POST commands to the Yahoo! API
library(RCurl)

# Next load RJSONIO, an R library that deals with JavaScript Object Notation (JSON) 
# Input/Output.  In my experience this library is a bit faster and more robust than RJSON.   
library(RJSONIO)

# create variables to store output:
state <- character(length(zips))
lat <- character(length(zips))
lon <- character(length(zips))
appid <- "&appid=[INSERT YOUR YAHOO ID HERE]"

# try a sample query
ipinfo <- fromJSON(getURL(paste("http://where.yahooapis.com/geocode?flags=J&country=USA&q=", zips[1], appid, sep="")))

# the good stuff is here:
ipinfo$ResultSet$Results[[1]]

for(i in 1:length(zips)){
	#i=1
	print(paste("i = ", i, "  zip = ", zips[i] ))
	
	# Read in geolocation data
	ipinfo <- fromJSON(getURL(paste("http://where.yahooapis.com/geocode?flags=J&country=USA&q=", zips[i], appid, sep="")))
	
	if(as.numeric(ipinfo$ResultSet[6])>0){
		# state
		try(suppressWarnings(state[i] <- ipinfo$ResultSet$Results[[1]][24])) 
		print(ipinfo$ResultSet$Results[[1]][24])

		# get lattitude
		try(suppressWarnings(lat[i] <- ipinfo$ResultSet$Results[[1]][2])) 
		print(ipinfo$ResultSet$Results[[1]][2])
		
		# get longitude
		try(suppressWarnings(lon[i] <- ipinfo$ResultSet$Results[[1]][3])) 
		print(ipinfo$ResultSet$Results[[1]][3])
		
	}
	
}

###############################################################################
####################### draw maps based on ip addresses #######################
###############################################################################

# We'll draw points for each observation of lattitude and longitude, and
# we'll shade in each state according to how many cases we have in each state

# first count the number of cases in each state:
mtstates <- table(state)
mtstates

# Convert state abbreviations to full state names: 
mtfullstatenames <- tolower(state.name[match(names(mtstates), state.abb)])
mtfullstatenames

mtfullstatenames[names(mtstates)=="DC"] <- "district of columbia"
names(mtstates) <- mtfullstatenames

# use "cut" to group the count data into several groups
mtstatesb <- cut(mtstates, breaks= c(-Inf, 0, 1, 4, 7, Inf), labels=c("0", "1", "2-4", "5-7", "8+")  )  
mtstatesb 

library(maps)
# generate a map object so that we can match names to the state level shape files
usmap <- map("state")


# Some of the state level shape files are broken down by sub-state region.  We
# just care about the political boundaries for the purposes of this project, 
# so we will not differentiate.  Thus, we need to make all names within a 
# state the same, which will mean some names are repeated.  
# Luckily the usmap$names are as follows: state:subregion
usmap$names

# so we can just split the strings based on the ":" character.  We'll use
# sapply( ..., "[[", 1) which extracts the first element when a list is returned:
mapstates <- sapply(strsplit(usmap$names, ":"), "[[", 1)
mapstates

# Now we need to generate a color for each item in mapstates so we can color in the map. 
# We first use "match()" to index our count of observations per state in mtstatesb by mapstates.  
# shapeFileN will contain the numeric range specified in mstatesb, but indexed by the 
# the names in mapstates, some of which will of course be repeated as discussed above. 

shapeFileN <- mtstatesb[match(mapstates,names(mtstates))]

# Now we've can generate a series of grey colors indexed by the count of observations per state,
# indexed so that the the observations match the shape files that come with the "state" object in
# the maps package:
cols <- rev(grey.colors(5))[shapeFileN]

# now let's draw the map:
png("demosamplemap.png", width=1000, height=700)

# draw the map object using the colors specified above
map("state", col= cols, fill=T)

# draw lattitude and longitude points for each observation:
points(lon, lat, cex = .3, col = rgb(0,0,0,.5))

# draw a legend
legend("bottomright", levels(mtstatesb), fill = c("white", rev(grey.colors(5))[2:5]), bty="n")

# put a title on it:
title("Geographic distribution of Zip+4 codes")
dev.off()

Note that for publication purposes, you’ll want to create a pdf instead of a png file.

OK, now for our second example, which will geolocate using ip addresses.  I found an excellent ip database website, it’s pretty current and has nicely formatted html: http://www.ip-db.com/.  You can lookup an ip address as follows: http://www.ip-db.com/%5Bipaddress%5D.  Take a look at the html for http://www.ip-db.com/64.90.182.55, which is one of the NIST Internet Time Servers, which I’ll be using to create example code.  If you view the html source, you’ll see that the relevant state is embedded in this line:

Region: <span style="font-family: arial; font-size: x-small;">New York

Also embedded is the lattitude and longitude, which we will need to create the map:

Lat/Long:
<span style="font-family: arial; font-size: x-small;"><a href="http://maps.google.com/maps?om=1&q=40.7089+-74.0012&z=10" target="_blank">40.7089 -74.0012</a>

Now we’ll just use RCurl to send a set of ip addresses to this website, then use Regular Expressions to extract the relevant data from the html that the website returns. Code below:

###############################################################################
################## Example 2: Geolocating IP Addresses ########################
###############################################################################

ips <- c("64.90.182.55", "96.47.67.105", "206.246.122.250", "129.6.15.28", "64.236.96.53", 
		"216.119.63.113", "64.250.177.145", "208.66.175.36", "173.14.55.9", "64.113.32.5", 
		"66.219.116.140", "24.56.178.140", "132.163.4.101", "132.163.4.102", "132.163.4.103",
		"192.43.244.18", "128.138.140.44", "128.138.188.172", "198.60.73.8", "64.250.229.100",
		"131.107.13.100", "207.200.81.113", "69.25.96.13", "216.171.124.36", "64.147.116.229")

# create variables to store output:
state <- character(length(ips))
lat <- character(length(ips))
lon <- character(length(ips))

# iterate over ip addresses:
for(i in 18:length(ips)){
	#i=5	
	print(paste("i = ", i, " ip: ", i))

	ipinfo <- getURL(paste("http://www.ip-db.com/", ips[i], sep=""))
	
	# use RegEx to extract key info:
	library(stringr)
	
	# Get state
	# We use the regex (\\w\\s?)+ where \\w = word character, \\s? = possible space character, 
	# () groups the pattern, and + means one or more.  
	# note that we must escape " characters by \" to include them in the pattern
	pattrn <- "Region:</b></font></td><td width=\"10\">&nbsp;</td><td><font face=\"arial\" size=\"2\">(\\w\\s?)+</td>"
	( results <- str_extract(ipinfo, pattrn))
	
	# Extract the actual text of the state and clean it up so we can actually use it: 
	state[i] <- str_extract(results, ">(\\w\\s?)+<")
	state[i] <- gsub(">", "", state[i])
	state[i] <- gsub("<", "", state[i])
	
	print(state[i])
	
	# Now get lattitude and longitude:
	# We can limit our pattern to target="_blank" because it's a unique pattern in the html
	# An example of our substring of interest follows: "target=\"_blank\">40.7089 -74.0012</a></td>"
	# So we use this pattern \\-?\\d\\d(\\.\\d+)? \\-?\\d\\d(\\.\\d+)? where \\d means digit,
	# means optional, () groups the pattern, and \\- is an escaped dash
	
	pattrn <- "target=\"_blank\">\\-?\\d+(\\.\\d+)? \\-?\\d+(\\.\\d+)?</a></td>"
	( results <- str_extract(ipinfo, pattrn))
		
	# Extract both lattitude an longitude and clean it up so we can save it as numeric in the data: 
	latlon <- str_extract_all(results, "\\-?\\d+(\\.\\d+)?")
	
	# the code above returns a list in case results has more than one set of characters in the character
	# vector, so we need to unlist it to make it into a normal vector:
	latlon <- unlist(latlon)
	
	# since there are two results, we return one for each for lat and lon:
	lat[i] <- latlon[1] 
	lon[i] <- latlon[2] 
	print(lat[i])
	print(lon[i])
	
	# Put a delay between entries so there's no danger of overwhelming the server
	Sys.sleep(sample(15:25,1)/10) 
}

###############################################################################
####################### draw maps based on ip addresses #######################
###############################################################################

# We'll draw points for each observation of lattitude and longitude, and
# we'll shade in each state according to how many cases we have in each state

# first count the number of cases in each state:
mtstates <- table(state)

# then match to the state.name object (which the maps object also uses)
mtfullstatenames <- tolower(state.name[match(names(mtstates), state.name)])

# extract the state names to a separate object:
names(mtstates) <- mtfullstatenames

# use "cut" to group the count data into several groups
mtstatesb <- cut(mtstates, breaks= c(-Inf, 0, 1, 2, 4, Inf), labels=c("0", "1", "2", "3-4", "4+")  )  

library(maps)
# generate a map object so that we can match names to the state level shape files
usmap <- map("state")


# Some of the state level shape files are broken down by sub-state region.  We
# just care about the political boundaries for the purposes of this project, 
# so we will not differentiate.  Thus, we need to make all names within a 
# state the same, which will mean some names are repeated.  
# Luckily the usmap$names are as follows: state:subregion
usmap$names

# so we can just split the strings based on the ":" character.  We'll use
# sapply( ..., "[[", 1) which extracts the first element when a list is returned:
mapstates <- sapply(strsplit(usmap$names, ":"), "[[", 1)
mapstates

# Now we need to generate a color for each item in mapstates so we can color in the map. 
# We first use "match()" to index our count of observations per state in mtstatesb by mapstates.  
# shapeFileN will contain the numeric range specified in mstatesb, but indexed by the 
# the names in mapstates, some of which will of course be repeated as discussed above. 

shapeFileN <- mtstatesb[match(mapstates,names(mtstates))]

# Now we've can generate a series of grey colors indexed by the count of observations per state,
# indexed so that the the observations match the shape files that come with the "state" object in
# the maps package:
cols <- rev(grey.colors(5))[shapeFileN]

# now let's draw the map:
png("demosamplemap.png", width=1000, height=700)

# draw the map object using the colors specified above
map("state", col= cols, fill=T)

# draw lattitude and longitude points for each observation:
points(lon, lat, cex = .3, col = rgb(0,0,0,.5))

# draw a legend
legend("bottomright", levels(mtstatesb), fill = c("white", rev(grey.colors(5))[2:5]), bty="n")

# put a title on it:
title("Geographic distribution of NIST Internet Time Servers")
dev.off()
About these ads

About Solomon

Political Scientist, Facebook Data Science
This entry was posted in R. Bookmark the permalink.

2 Responses to Map the distribution of your sample by geolocating ip addresses or zip codes

  1. Gerald says:

    I have two questions:
    1) how many levels of administrative boundaries does the maps package in library(maps) have? For example, are there county shapefiles and are there zipcode shapefiles?
    2) can the map be drawn as an SVG file rather than a PNG file?

    Thanks

  2. Solomon says:

    Hi Gerald, (1) you’ll probably want to use some other package besides maps for that. Take a look at http://www.r-bloggers.com/plot-maps-like-a-boss/ for starters. (2) You can draw the map via pdf() which will produce a vector based file rather than a bitmap. You could also use the devSVG package, see http://svitsrv25.epfl.ch/R-doc/library/RSvgDevice/html/RSvgDevice.html for details.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s