Visualization series: Insight from Cleveland and Tufte on plotting numeric data by groups

After my post on making dotplots with concise code using plyr and ggplot, I got an email from my dad who practices immigration law and runs a website with a variety of immigration resources and tools.  He pointed out that the post was written for folks who already know that they want to make dot plots, and who already know about bootstrapped standard errors.  That’s not many people.

In an attempt to appeal to a broader audience, I’m starting a series in which I’ll outline the key principles I use when developing a visualization.  In this post, I’ll articulate these principles, which combine some of Tuft’s aesthetic guidelines with Cleveland’s scientific approach to visualization, which is based on the psychological processes involved in making sense of visualizations, and has been rigorously tested via randomized controlled experiments.  Based on these principles, I’ll argue that dotplots and scatterplots are better than other types of plots (especially pie charts) in most situations.  In later posts, I’ll demonstrate another innovation whose widespread use I’ll credit to Cleveland and Tufte: the use of multiple panels (aka small multiples, trellis graphics, facets, generalized draftsman’s displays, multivar charts) to clearly convey the same information embedded in more complex and difficult to read visualizations, including multiple line plots and mosaic plots. In future posts I’ll also emphasize why it is important to provide some indication of the noise present in the underlying data using error bars or bands.  Along the way, I’ll put you to the test–I’ll present some visualizations of the same data using different visualization techniques and ask you to try to get as much information as you can in 2 seconds from each type of visualization.

A good visualization conveys key information to those who may have trouble interpreting numbers and/or statistics, which can make your findings accessible to a wider audience (more on this below).  Visualizations also give your audience a break from lexical processing, which is especially useful when you are presenting your findings–people can listen to you and process the findings from a well-designed visual at the same time, but most people have trouble listening while reading your PowerPoint bullet points.  Visualizations also convey key information embedded in massive amounts of data, which can aid your own exploratory analysis of data, no matter how massive.

Yet most visualizations are flawed, drawn using elements that make it unnecessarily difficult for the human visual system to make sense of things.  I see a lot of these visualizations attending research presentations, screening incoming draft manuscripts as the assistant editor for Political Communication, and as a consumer of media info-graphics (CNN is especially bad, have a look at this monstrosity).  Kevin Fox has an especially compelling visual speaking to this here. A big part of the problem is that Microsoft makes it easy to draw flashy but ultimately confusing visualizations in Excel.  If you are too busy to read this post in full, follow this short list of guidelines and you’ll be on your way to producing elegant visualizations that impose a minimal cognitive burden on your audience:

  1. Never represent something in 2 or worse yet 3 dimensions if it can be represented in one—NEVER use pie charts, 3-D pie charts, stacked bar charts, or 3-D bar charts.
  2. Remove as much chart junk as possible–unnecessary gridlines, shading, borders, etc.
  3. Give your audience a sense of the noise present in your data–draw error bars or confidence bands if you are plotting estimates.
  4. If you want to plot multiple types of groups on a single outcome (the visual analog of cross-tabulations/marginals), use multi-paneled plots. These can also help if overploting looks too cluttered.
  5. Avoid mosaic plots. Instead use paneled histograms.
  6. Ditch the legend if you can (you almost always can).

The rest of the content in this series emphasizes why it makes sense to follow these guidelines. In this post I’ll look at the first point in detail and touch on the sixth. These two guidelines are most relevant when you want to look at a quantitative variable (e.g., earnings, vote-share, temperature, etc.) across different qualitative groupings (e.g., industry segment, candidate, party, racial group, season, etc.).  This is one of the most common visualization tasks in business, media, and social science, and for this task people often use pie charts and/or bar charts, and occasionally dot plots.

The science of graphical perception

When most people think about visualization, they think first of Edward Tufte.  Tufte emphasizes integrity to the data, showing relationships between phenomena, and above all else aesthetic minimalism.  I appreciate his ruthless crusade against chart junk and pie charts (nice quote from Data without Borders). We share an affinity for multipanel plotting approaches, which he calls “small multiples,” (thanks to Rebecca Weiss for pointing this out) though I think people give Tufte too much credit for their invention—both juiceanalytics and infovis-wiki write that Cleveland introduced the concept/principle. However, both Cleveland and Tufte published books in 1983 discussing the use of multipanel displays; David Smith over at Revolutions writes that “the “small-multiples” principle of data visualization [was] pioneered by Cleveland and popularized in Tufte’s first book”; and the earliest reference to a work containing multipanel displays I could find was published *long* before Tufte’s 1983 work–Seder, Leonard (1950), “Diagnosis with Diagrams—Part I”, Industrial Quality Control (New York, New York: American Society for Quality Control) 7 (1): 11–19.

I’m less sure about Tufte’s advice to always show axes starting at zero, which can make comparison between two groups difficult, and to “show causality,” which can end up misleading your readers.  Of course, the visualizations on display in the glossy pages of Tufte’s books are beautiful–they belong  in a museum.  But while his books are full of general advice that we should all keep in mind when creating plots, he does not put forth a theory of what works and what doesn’t when trying to visualize data.

Cleveland (with Robert McGill) develops such a theory and subjects it to rigorous scientific testing. In my last post I linked to one of Cleveland’s studies showing that dots (or bars) aligned on the same scale are indeed the best visualization to convey a series of numerical estimates.  In this work, Cleveland examined how accurately our visual system can process visual elements or “perceptual units” representing underlying data.  These elements include markers aligned on the same scale (e.g., dot plots, scatterplots, ordinary bar charts), the length of lines that are not aligned on the same scale (e.g., stacked bar plots), area (pie charts and mosaic plots), angles (also pie charts), shading/color, volume, curvature, and direction.

He runs two experiments: the first compares judgements about relative position (grouped bar charts) to judgements based only on length (stacked bar charts); the second compares judgements about relative position (ordinary bar charts) to judgements about angles/area (pie charts).  Here are the materials he uses, courtesy of the Stanford Computer Graphics Lab:

The results are resoundingly clear—judgements about position relative to a baseline are dramatically more accurate than judgements about angles, area, or length (with no baseline).  Hence, he suggests that we replace pie charts with bar charts or dot plots and that we substitute stacked bar charts for grouped bar charts.

A striking and often overlooked finding in this work is the fact that the group of participants without technical training, “mostly ordinary housewives” as Cleveland describes them, performed just as well as the group of mostly men with substantial technical training and experience.   This finding provides evidence for something that I’ve long suspected: that visualizations make it easier for people lacking quantitative experience to understand your results, serving to level the playing field.  If you want your findings to be broadly accessible, it’s probably better to present a visualization rather than a bunch of numbers.  It also suggests that if someone is having trouble interpreting your visualizations, it’s probably your fault.

Dotplots versus pie charts and stacked barplots

Now let’s put this to the test.  Take a look at each visualization below for two seconds, looking for the percent of the vote that Mitt Romney, Ron Paul, and Jon Huntsman got.

Which is easiest to read? Which conveys information most accurately? Let’s first take a look at the most critical information–the order in which the candidates placed.  In all plots, the candidates are arrayed in order from highest to least vote share, and it’s easy to see that Mitt won.  But once we start looking at who came in second, third, and so on, differences emerge.  It’s slightly harder to process order in the pie chart because your eye has to go around the plot rather than up and down in a straight line.  In the stacked bar chart, we need to look up which color corresponds to which candidate’s in the legend (as Tufte told us not to use), adding a layer of cognitive processing.

Second, which conveys estimates most accurately? The dot plot is the clear winner here.  We can quickly see that Romney got about 37%, Paul got about 24%, and Huntsman got about 16%, just by looking at dots relative to the axis.  When we look at the pie chart, it’s really tough to estimate the exact percent each candidate got.  Same with the stacked bar chart. We could add numbers to the pie and bar charts, which would even things out to some extent, but then why not just display a table with exact percents?

One argument I used to hear all the time when I worked in industry is that pie charts “convey a sense of proportion.”  Well, sure, I guess I can kind of guestimate that Ron Paul’s vote share is about 1/4.  What about Jon Huntsman? Hmm, it looks like about 15 percent, which is 3/20.  But wait, why do I want to convert things into fractions anyway? I don’t think in terms of fractions, I think in terms of percents.  And if I really care about proportion, I suppose I could extend the axis from 0 to 100.

Suppose I want to plot results for the top 15 candidates, not just the top 6?  Here’s what happens:

No contest, the pie chart fails completely.  We’d need to add a legend with colors for each candidate, which adds another layer of cognitive processing–we’d need to look up each color in the lengend as we go.  And even after adding the legend, you wouldn’t be able to distinguish the lower performing candidates from say write-in votes because the pie slices would be too small.  The stacked bar chart will fail for the same reasons, so I’ve excluded it in the interest of brevity.  Note that we don’t need to add colors to the dotplot to convey the same information, which saves an extra plotting element that we can use to represent something else (say candidate’s campaign funds or total assets).  And, on top of it all, the dot plot takes up less screen/page real estate!

Why do I use dot plots instead of ordinary bar charts? A nice visualization guide from perceptualedge.com points out that often we want to only visualize differences between groups in a narrow range (they use an example wherein monthly expenses vary from $4,250-$5,500). But the length of a bar is supposed to facilitate accurate comparisons between values, so when you use a bar plot starting from $4,250, the length between bars dramatically exaggerates the actual differences. Dot plots do not have this problem because dot encode values using only location, so one must reference the axis to interpret the value.

A related points is that bars are often used to convey counts–we use them in histograms to represent frequency and track say counts of dollars earned/raised in bar charts.  In fact, a team of doctors I work with at the med school recently sent in a manuscript to Radiology containing a bar chart plotting mean values between groups; they got back the following comment from the statistical reviewer: “the y-axis is quantitative but the data are represented using bars as if the data were counts.”  People often use bar plots to convey estimates of means (and I’ve certainly done this), which can serve to exaggerate differences in means and hence effect sizes if you do not plot the bars from zero.

In addition, dot plots have aesthetic advantages.  They convey the numerical estimate in question with a single one-dimensional point, rather than a two dimensional bar.  There’s simply less that the eye needs to process.  Accordingly, if a pattern across qualitative groupings exists, it’s often easier to see with a dot plot.  For example, below I plot the average user ratings for each article to which Sean Westwood and I exposed subjects in a news reading experiment.  The pattern that emerges is an “S” curve in which one or two stories dominate the ratings, most are sort of average, and a few are uniformly terrible. Note that you’d probably want to use something like this more for yourself than to communicate your results to others as it might overload your audience with too much information–you’d do better to select a subset of these articles or remove some of the ones in the middle (thanks to Yph Lelkes for making this point).

One question that remains is if pie charts are so bad, why are they so common? Perhaps we like them because we find them comforting just as we find pies and pizza? Well if so we’d expect pie charts to be less common in places like Japan and China where people grow up eating different food.  Consider info-graphics in newspapers: I haven’t yet done a systematic content analysis, but I was unable to find a single pie chart in Japan’s Yomimuri Shimbun nor the Asahi Shimbun; nor in China’s Beijing Daily nor Sing Tao Daily.  I did see plenty of maps, however, which I suppose one could argue are reminiscent of noodles.

Implementation

The most efficient way to produce solid visualizations with the ability to implement multiple panels, proper standard error estimates, and dot plots is probably in R using the ggplot2 package.  If you do not have time to learn R and remain tied to MS-Excel stick to ordinary barplots to visualize quantitative variables among multiple groups (not recommended).

Otherwise, if you don’t already use it, download R and a decent editor like Rstudio.  Then get started with ggplot2 and dot plots by running the following code chunk which will replicate the election figure above:


pres <- read.csv("https://dl.dropboxusercontent.com/u/25710348/Blogposts/data/primaryres.csv", as.is=T)

# sort data in order of percent of vote:
pres <- pres[order(pres$Percentage, decreasing=T), ]

# only show top 15 candidates:
pres <- pres[1:15,]

# create a precentage variable
pres$Percentage <- pres$Percentage*100

# reorder the Candidate factor by percentage for plotting purposes:
pres$Candidate <- reorder(pres$Candidate, pres$Percentage)

# To install ggplot2, run the following line after deleting the #
#install.packages("ggplot2")

library(ggplot2)
ggplot(pres, aes(x = Percentage, y = factor(Candidate) )) +
  geom_point() +
  theme_bw() +  xlab("Percent of Vote") + ylab("Candidate") +
  ggtitle("New Hampshire Primary 2012")

After loading our data and running a few preliminary data processing operations, we pass ggplot our data set, “pres,” then we tell it what aesthetic elements we want to use, in this case that x is going to be our “Percentage” variable and y is going to be our “Candidate” variable. We tell ggplot that we want to display points for every xy pair. We also tell it to use the black and white theme, and pass some obscure axis options that ensures the axis plot correctly. Then we tell it what to label the x and y axis, and give it a title.

We can also reproduce the article ratings by story plot above using ggplot2 (even though I originally produced the plot using the lattice package).

# To install ggplot2, run the following line after deleting the #
#install.packages("ggplot2")

library(ggplot2)

load(file("https://dl.dropboxusercontent.com/u/25710348/Blogposts/data/db.Rda"))

# if you haven't installed dplyr, delete the # and run this line:
# install.packages("dplyr")

library(dplyr)

table(db$story)

# first we use plyr to calculate the mean rating and SE for each story
ratingdat <- db %>% group_by(story) %>%
  summarise(M = mean(rating, na.rm=T),
            SE = sd(rating, na.rm=T)/sqrt(length(na.omit(rating))),
            N = length(na.omit(rating)))

# make story into an ordered factor, ordering by mean rating:
ratingdat$story <- factor(ratingdat$story)
ratingdat$story <- reorder(ratingdat$story, ratingdat$M)

# take a look at our handiwork:


ggplot(ratingdat, aes(x = M, xmin = M-SE, xmax = M+SE, y = story )) +
  geom_point() + geom_segment( aes(x = M-SE, xend = M+SE,
                                   y = story, yend=story)) +
  theme_bw() + xlab("Mean rating") + ylab("Story") +
  ggtitle("Rating article by Story, with SE")

# Now save
ggsave(file="plots/dotplot-story-rating.pdf", height=14, width=8.5)

Advertisements
Posted in R | 26 Comments

Putting it all together: concise code to make dotplots with weighted bootstrapped standard errors

Dotplots with plyr, boot, and ggplot2

I analyze a lot of experiments and there are many times when I want to quickly look at means and standard errors for each cell (experimental condition), or the same for each cell and individual-level attribute level (e.g., Democrat, Independent, Republican).  Many of these experiments are embedded in national or state-level surveys, for which each respondent is weighted so that the sample means for gender, age, and political affiliation approximate those of a probability sample. But if you use weights, it’s difficult to get unbiased estimates of the standard error of that weighted mean.

In this post, I’ll walk through how to use the powerful plyr package to summarize your weighted data, estimating weighted means and standard errors for each experimental cell using the bootstrap, and then generate dotplots to visualize the result using the ggplot2 package. Note that these are both Hadley Wickham’s packages; he’s also responsible for the immensely useful packages reshape and stringr, which I hope to write about in future posts.

First, I’ll describe the data we’re going to use. Just before the 2010 elections, my lab hired YouGov/Polimetrics to survey Illinois voters about a series of candidates for statewide office, most notably those for Secretary of State.  The major party candidates included the Black Democrat incumbent, Jesse White, and a Hispanic Republican challenger, Robert Enriquez. We exposed respondents to different images of Jesse White, according to three experimental conditions: light, dark, and control (no picture). Our dependent measures included a question about how each respondent intended to vote and a feeling thermometer measuring affect toward each candidate. I use the net of the thermometer rating for Obama minus the thermometer rating for McCain (“netther”). For more on this type of measure, here’s a link to a study on feeling thermometers and their precision compared to questionnaire measures with fewer categories.

Even before we have to deal with the issue of weighted standard errors, it would be nice to find an elegant way to extract weighted statistics on subsets of our data corresponding to our experimental cells (or other qualitative grouping of our data, such as respondent party or ethnicity). It’s rather clunky and/or difficult to do this with other packages/functions, such as base::by or doBy::summaryBy because of the difficulty of passing more than one variable to the function in question, say “weighted.mean(),” applied to the subset of your data specified by a particular variable (see this discussion for details — thanks to David Freedman for the pointer to plyr::ddply). The survey::svymean function provides another potential alternative, but is nowhere near as flexible as is ddply.

We’ll use plyr::ddply, which handles this problem quite well. In the code below, we use “ddply(),” which takes a data frame for input and returns a data frame for output. The “.data” parameter specifies the data frame and the “.variables” parameter specifies the qualitative variable used to subset the data. We use “summarise()” as the function, which allows us to specify not only the function we want to pass over the subsets of our data (experimental cells), but also to tell ddply which variables in our data subset to pass to the function. For example, below we run the function “mean()” on the variable “netther” for each treatment group.

install.packages("plyr")
library("plyr")

# load the Illinois survey data
load(file("dl.dropboxusercontent.com/u/25710348/Blogposts/data/IL2010.Rda"))

# look at ddply:
ddply(.data = IL2010, .variables= .(treat), .fun = summarise,
		mean = mean(x=netther, na.rm=T),
		se = sd(netther, na.rm=T)/sqrt(length(netther)) )

Now that we have some nice machinery to quickly summarize weighted data in hand, let’s get into bootstrapping. The computation of weighted standard errors for estimates of the weighted mean has no straighforward closed form solution, and hence bootstrapping is not a bad way to go. Bootstrapping consists of computing a statistic over and over, by resampling the data with replacement. For more see the UCLA stats webpage on bootstrapping in R.

I’ll use the boot package to implement bootstrapping, which requires us to specify a bootstrapping function that returns a statistic.

install.packages("boot")
library("boot")

sample.wtd.mean <- function(x, w, d) {
	return(weighted.mean(x = x[d], w = w[d], na.rm=T ))
}

Why do we need this function? It allows the boot function to sample our data many times (w/ replacement), each time computing the statistic we want to estimate, as mentioned above. Specifying that function make this process quite elegant by making use of R’s indexing capabilities. Boot pass this function an index of items to include from our original data in each of the resamples. To illustrate this, run the following code in R:

playdata <- runif(20, 1, 10)
playdata
d <- sample(20, replace = TRUE)
d
playdata[d]

The code “playdata[d]” returns the “dth” element from playdata. This is exactly how the boot function works. It passes our “sample.wtd.mean()” function our index of items (d) over and over again based on sampling without replacement — utilizing the fast c code that R uses for indexing rather than slow R for-loops.

When boot is done, we take the mean or median of these statistics as our estimate, which is a loose definition of the bootstrap. We also use the standard deviation of these estimates as the estimate of the standard error of the statistic. Here’s what happens when we pass our function over our data:

b <- boot( IL2010$netther, statistic = sample.wtd.mean, R=1000, w=IL2010$weight)
b
sd(b$t)

[UPDATE 26 JAN 2012: perviously this was coded incorrectly.  The “weights” parameter has been changed to “w.” Thanks to Gaurav Sood and Yph Lelkes for pointing this out!]

Now let’s put it all together using “plyr” and “ggplot2.” I use dotplots because they convey numbers more accurately than other types of plots, such as bar plots or (worst of all) pie charts. William Cleveland has conducted research showing that dots aligned on the same scale are indeed the best visualization to convey a series of numerical estimates (Correction, the actual study is here). His definition of “best” is based on the accuracy with which people interpret various graphical elements, or as he calls them, “perceptual units.”

We’ll first just visualize results by party:


install.packages("ggplot2")
library("ggplot2")

# Create nice party ID variable:
IL2010$pid[IL2010$pid==8] <- NA
IL2010$pidcut <- cut(IL2010$pid, c(-Inf, 1, 3, 4, 6, Inf) , 
		c("Strong Democrat", "Democrat", "Independent", "Republican", "Strong Republican"))
# clean up treatment variable
IL2010$treat <- factor(IL2010$treat, labels = c("Dark", "Light", "Control" ))
# Use plyr::ddply to produce estimates based on PID
pid.therm <- ddply(.data = IL2010, .variables= .(pidcut), .fun = summarise,		
		mean = weighted.mean(x=netther, w=weight, na.rm=T),
		se = sd(boot(netther, sample.wtd.mean, R = 1000, w = weight)$t))
# plot it
ggplot(na.omit(pid.therm), aes(x = mean, xmin = mean-se, xmax = mean+se, y = factor(pidcut))) +
		geom_point() + geom_segment( aes(x = mean-se, xend = mean+se,
						y = factor(pidcut), yend=factor(pidcut))) +
		theme_bw() + opts(axis.title.x = theme_text(size = 12, vjust = .25))+
		xlab("Mean thermometer rating") + ylab("Treatment") +
		opts(title = expression("Thermometer ratings for Jessie White by party"))

That’s a pretty big ggplot command, so let’s discuss each element. The first is the data to plot, which is the object produced via ddply. The second specifies the aesthetic elements of the plot, including the x and y values of the plot, and the boundaries of the plot based on specified minimum and maximum endpoints (“xmin = mean-se, xmax = mean+se”). Next, we add points to the plot by envoking “geom_point().” If we only wanted points, we could stop here. We plot lines representing standard errors with “geom_segment()” specifying x and xend. Next, we specify that we want to use theme_bw(), which I find cleaner than the default parameters. Then we make some minor adjustments via opts for aesthetics. Lastly, we set the y and x labels and the title.

Well, it’s pretty clear from our plot that there’s a lot of action on partisan ID. Let’s look at how each group is affected by the treatment. To do so, we’ll produce the plot at the top of this article, putting each partisan group in a panel via “facet_wrap(),” and plotting each treatment group. First, we’ll generate the summary data using plyr, all we need to do is add another variable (“treat”).

# PID x Treatment
trtbypid.therm <- ddply(.data = IL2010, .variables= .(treat, pidcut), .fun = summarise,		
		mean = weighted.mean(x=netther, w=weight, na.rm=T),
		se = sd(boot(netther, sample.wtd.mean, R = 1000, w = weight)$t))

ggplot(na.omit(trtbypid.therm), aes(x = mean, xmin = mean-se, xmax = mean+se, y = factor(treat))) + 
		geom_point() + geom_segment( aes(x = mean-se, xend = mean+se, 
						y = factor(treat), yend=factor(treat))) +
		facet_wrap(~pidcut, ncol=1) +
		theme_bw() + opts(axis.title.x = theme_text(size = 12, vjust = .25))+ 
		xlab("Mean thermometer rating") + ylab("Treatment") +
		opts(title = expression("Thermometer ratings for Jessie White by treatment condition and pid"))

It looks like Republicans generally don’t like the dark image, while Democrats do. But what if we want to look at the way that racial attitudes interact with the treatments? We measured implicit racial attitudes via the IAT). Let’s take a look.

# Fix up the data to remove the "Control" condition and all "na's"
ilplot <- IL2010[-which(is.na(IL2010$pidcut)),]
ilplot$treat[ilplot$treat=="Control"] <- NA
ilplot$treat <- ilplot$treat[drop=T]
ilplot <- ilplot[-which(is.na(ilplot$treat)),]
ilplot$dscore[which(ilplot$dscore< -1)] <- NA

# Plot it
ggplot(ilplot, aes(x = dscore, y = netther, colour=treat)) +
		geom_point() + geom_smooth(method = "loess", size = 1.5) +
		facet_wrap(~pidcut, nrow=1) +
		theme_bw() + opts(axis.title.x = theme_text(size = 12, vjust = .25))+
		xlab("IAT d-score (- pro black, + anti-black)") + ylab("Thermometer rating") +
		opts(title = expression("Thermometer ratings for Jessie White by treatment condition and iat"))

Visualization of IAT x skin complexion manipulation (conditional on party id ) via ggplot2

With this plot, we don’t really need to subset the data because we are using all of the points and just using lowess to draw lines representing the conditional means across every value of IAT score. The plot shows what seems to be an interaction between IAT score and treatment, such that those with high (anti-black) IAT scores generally favor the lighter skinned image.

Posted in R | 15 Comments

Map the distribution of your sample by geolocating ip addresses or zip codes

Using the maps package to plot your geolocated sample


Yesterday I wanted to create a map of participants from a study on social media and partisan selective exposure that Sean Westwood and I ran recently, with participants from Amazon’s Mechanical Turk.  We recorded ip addresses for each Turker participant, so I used a geolocation service to lookup the latitude and longitude associated with each participant’s ip address, and then created a nice little map in R.  This can also be done if you have each participant’s zip+4 code, which is probably more accurate as well.

At any rate, we made a map that looks something like the picture above.

In this post I’ll illustrate how to make a map like this first using zip+4 codes and then using ip addresses.  We’ll use RCurl to send an html POST command to some website that will give us the relevant geolocation data, and then extract the relevant info by parsing JSON or html output (using regular expressions).

First we’ll pass zip+4 codes to the Yahoo! PlaceFinder Application Programming Interface (API) to return the state, latitude and longitude.  We’ll use their JSON output, because it’s easier to parse in R than the alternative (XML).

The first thing you’ll need to do is to get yourself a Yahoo! Application ID, so that you can use Yahoo’s API. This should only take about ten minutes. When you’ve got it, replace your appid where it says “[INSERT YOUR YAHOO ID HERE]” in the code below.

Here’s the R code:

###############################################################################
################## Example 1: Geolocating Zip +4 Codes ########################
###############################################################################

zips <- c("48198-6275", "75248-5000", "27604-1520", "53010-3212", "95453-9630",
		"10548-1241", "70582-6107", "84106-2494", "78613-3249", "18705-3906",
		"23108-0274", "23108-0274", "67361-1732", "28227-1454", "37027-5902",
		"75067-4208", "32505-2315", "80905-4449", "49008-1818", "08034-2616",
		"04347-1204", "94703-1708", "32822-4969", "44035-6205", "60192-1114",
		"45030-4933", "32210-3625", "80439-8782", "68770-7013", "42420-8832",
		"83843-2219", "33064-2555", "75067-6605", "22301-1266", "78757-1652",
		"85140-5232", "19067-6225", "06410-4210", "63109-3347", "89103-4803",
		"99337-5714", "37067-8123", "94582-4921", "65401-3065", "46614-1320",
		"29732-3810", "79412-1142", "23226-3006", "78217-1542", "10128-4923",
		"15145-1635", "48135-1093", "50621-0683", "32713-3803", "08251-1332",
		"14445-2024", "83210-0054", "27525-8790", "32609-8860", "53711-3548",
		"49601-2214", "15324-0010", "37604-3508", "85041-7739", "01906-1317",
		"10552-2605", "20120-6322", "19934-1263", "87124-2735", "22192-4415",
		"45344-9130", "08618-1464", "83646-3473", "80911-9007", "13057-2023",
		"30033-2016", "30039-6323", "20109-8120", "98043-2105", "34655-3129",
		"60465-1326", "33433-5746", "30707-1636", "98102-4432", "92037-7407",
		"95054-4144", "94703-1708", "94110-2712", "92562-5073", "31418-0341",
		"10009-1819", "73703-6736", "98554-0175", "65649-8267", "21704-7864",
		"15209-2420", "61550-2715", "92506-5317", "60611-3801", "48161-4064",
		"77365-3268", "50022-1388", "63841-1227", "36207-5817", "22121-0232",
		"73115-4446", "85340-7341", "44420-3415", "07663-4913", "10065-6716",
		"91606-2213", "19120-9237", "80015-2718", "97401-4448", "46637-5424",
		"29609-5425", "20815-6414", "45373-1731", "57106-6205", "71744-9486",
		"14222-1922", "28306-3901", "84074-2684", "28605-8292", "59405-8649")
 
###############################################################################
############################# Geolocate data ##################################
###############################################################################

# First load RCurl, which we'll use to send HTML POST commands to the Yahoo! API
library(RCurl)

# Next load RJSONIO, an R library that deals with JavaScript Object Notation (JSON) 
# Input/Output.  In my experience this library is a bit faster and more robust than RJSON.   
library(RJSONIO)

# create variables to store output:
state <- character(length(zips))
lat <- character(length(zips))
lon <- character(length(zips))
appid <- "&appid=[INSERT YOUR YAHOO ID HERE]"

# try a sample query
ipinfo <- fromJSON(getURL(paste("http://where.yahooapis.com/geocode?flags=J&country=USA&q=", zips[1], appid, sep="")))

# the good stuff is here:
ipinfo$ResultSet$Results[[1]]

for(i in 1:length(zips)){
	#i=1
	print(paste("i = ", i, "  zip = ", zips[i] ))
	
	# Read in geolocation data
	ipinfo <- fromJSON(getURL(paste("http://where.yahooapis.com/geocode?flags=J&country=USA&q=", zips[i], appid, sep="")))
	
	if(as.numeric(ipinfo$ResultSet[6])>0){
		# state
		try(suppressWarnings(state[i] <- ipinfo$ResultSet$Results[[1]][24])) 
		print(ipinfo$ResultSet$Results[[1]][24])

		# get lattitude
		try(suppressWarnings(lat[i] <- ipinfo$ResultSet$Results[[1]][2])) 
		print(ipinfo$ResultSet$Results[[1]][2])
		
		# get longitude
		try(suppressWarnings(lon[i] <- ipinfo$ResultSet$Results[[1]][3])) 
		print(ipinfo$ResultSet$Results[[1]][3])
		
	}
	
}

###############################################################################
####################### draw maps based on ip addresses #######################
###############################################################################

# We'll draw points for each observation of lattitude and longitude, and
# we'll shade in each state according to how many cases we have in each state

# first count the number of cases in each state:
mtstates <- table(state)
mtstates

# Convert state abbreviations to full state names: 
mtfullstatenames <- tolower(state.name[match(names(mtstates), state.abb)])
mtfullstatenames

mtfullstatenames[names(mtstates)=="DC"] <- "district of columbia"
names(mtstates) <- mtfullstatenames

# use "cut" to group the count data into several groups
mtstatesb <- cut(mtstates, breaks= c(-Inf, 0, 1, 4, 7, Inf), labels=c("0", "1", "2-4", "5-7", "8+")  )  
mtstatesb 

library(maps)
# generate a map object so that we can match names to the state level shape files
usmap <- map("state")


# Some of the state level shape files are broken down by sub-state region.  We
# just care about the political boundaries for the purposes of this project, 
# so we will not differentiate.  Thus, we need to make all names within a 
# state the same, which will mean some names are repeated.  
# Luckily the usmap$names are as follows: state:subregion
usmap$names

# so we can just split the strings based on the ":" character.  We'll use
# sapply( ..., "[[", 1) which extracts the first element when a list is returned:
mapstates <- sapply(strsplit(usmap$names, ":"), "[[", 1)
mapstates

# Now we need to generate a color for each item in mapstates so we can color in the map. 
# We first use "match()" to index our count of observations per state in mtstatesb by mapstates.  
# shapeFileN will contain the numeric range specified in mstatesb, but indexed by the 
# the names in mapstates, some of which will of course be repeated as discussed above. 

shapeFileN <- mtstatesb[match(mapstates,names(mtstates))]

# Now we've can generate a series of grey colors indexed by the count of observations per state,
# indexed so that the the observations match the shape files that come with the "state" object in
# the maps package:
cols <- rev(grey.colors(5))[shapeFileN]

# now let's draw the map:
png("demosamplemap.png", width=1000, height=700)

# draw the map object using the colors specified above
map("state", col= cols, fill=T)

# draw lattitude and longitude points for each observation:
points(lon, lat, cex = .3, col = rgb(0,0,0,.5))

# draw a legend
legend("bottomright", levels(mtstatesb), fill = c("white", rev(grey.colors(5))[2:5]), bty="n")

# put a title on it:
title("Geographic distribution of Zip+4 codes")
dev.off()

Note that for publication purposes, you’ll want to create a pdf instead of a png file.

OK, now for our second example, which will geolocate using ip addresses.  I found an excellent ip database website, it’s pretty current and has nicely formatted html: http://www.ip-db.com/.  You can lookup an ip address as follows: http://www.ip-db.com/%5Bipaddress%5D.  Take a look at the html for http://www.ip-db.com/64.90.182.55, which is one of the NIST Internet Time Servers, which I’ll be using to create example code.  If you view the html source, you’ll see that the relevant state is embedded in this line:

Region: <span style="font-family: arial; font-size: x-small;">New York

Also embedded is the lattitude and longitude, which we will need to create the map:

Lat/Long:
<span style="font-family: arial; font-size: x-small;"><a href="http://maps.google.com/maps?om=1&q=40.7089+-74.0012&z=10" target="_blank">40.7089 -74.0012</a>

Now we’ll just use RCurl to send a set of ip addresses to this website, then use Regular Expressions to extract the relevant data from the html that the website returns. Code below:

###############################################################################
################## Example 2: Geolocating IP Addresses ########################
###############################################################################

ips <- c("64.90.182.55", "96.47.67.105", "206.246.122.250", "129.6.15.28", "64.236.96.53", 
		"216.119.63.113", "64.250.177.145", "208.66.175.36", "173.14.55.9", "64.113.32.5", 
		"66.219.116.140", "24.56.178.140", "132.163.4.101", "132.163.4.102", "132.163.4.103",
		"192.43.244.18", "128.138.140.44", "128.138.188.172", "198.60.73.8", "64.250.229.100",
		"131.107.13.100", "207.200.81.113", "69.25.96.13", "216.171.124.36", "64.147.116.229")

# create variables to store output:
state <- character(length(ips))
lat <- character(length(ips))
lon <- character(length(ips))

# iterate over ip addresses:
for(i in 18:length(ips)){
	#i=5	
	print(paste("i = ", i, " ip: ", i))

	ipinfo <- getURL(paste("http://www.ip-db.com/", ips[i], sep=""))
	
	# use RegEx to extract key info:
	library(stringr)
	
	# Get state
	# We use the regex (\\w\\s?)+ where \\w = word character, \\s? = possible space character, 
	# () groups the pattern, and + means one or more.  
	# note that we must escape " characters by \" to include them in the pattern
	pattrn <- "Region:</b></font></td><td width=\"10\">&nbsp;</td><td><font face=\"arial\" size=\"2\">(\\w\\s?)+</td>"
	( results <- str_extract(ipinfo, pattrn))
	
	# Extract the actual text of the state and clean it up so we can actually use it: 
	state[i] <- str_extract(results, ">(\\w\\s?)+<")
	state[i] <- gsub(">", "", state[i])
	state[i] <- gsub("<", "", state[i])
	
	print(state[i])
	
	# Now get lattitude and longitude:
	# We can limit our pattern to target="_blank" because it's a unique pattern in the html
	# An example of our substring of interest follows: "target=\"_blank\">40.7089 -74.0012</a></td>"
	# So we use this pattern \\-?\\d\\d(\\.\\d+)? \\-?\\d\\d(\\.\\d+)? where \\d means digit,
	# means optional, () groups the pattern, and \\- is an escaped dash
	
	pattrn <- "target=\"_blank\">\\-?\\d+(\\.\\d+)? \\-?\\d+(\\.\\d+)?</a></td>"
	( results <- str_extract(ipinfo, pattrn))
		
	# Extract both lattitude an longitude and clean it up so we can save it as numeric in the data: 
	latlon <- str_extract_all(results, "\\-?\\d+(\\.\\d+)?")
	
	# the code above returns a list in case results has more than one set of characters in the character
	# vector, so we need to unlist it to make it into a normal vector:
	latlon <- unlist(latlon)
	
	# since there are two results, we return one for each for lat and lon:
	lat[i] <- latlon[1] 
	lon[i] <- latlon[2] 
	print(lat[i])
	print(lon[i])
	
	# Put a delay between entries so there's no danger of overwhelming the server
	Sys.sleep(sample(15:25,1)/10) 
}

###############################################################################
####################### draw maps based on ip addresses #######################
###############################################################################

# We'll draw points for each observation of lattitude and longitude, and
# we'll shade in each state according to how many cases we have in each state

# first count the number of cases in each state:
mtstates <- table(state)

# then match to the state.name object (which the maps object also uses)
mtfullstatenames <- tolower(state.name[match(names(mtstates), state.name)])

# extract the state names to a separate object:
names(mtstates) <- mtfullstatenames

# use "cut" to group the count data into several groups
mtstatesb <- cut(mtstates, breaks= c(-Inf, 0, 1, 2, 4, Inf), labels=c("0", "1", "2", "3-4", "4+")  )  

library(maps)
# generate a map object so that we can match names to the state level shape files
usmap <- map("state")


# Some of the state level shape files are broken down by sub-state region.  We
# just care about the political boundaries for the purposes of this project, 
# so we will not differentiate.  Thus, we need to make all names within a 
# state the same, which will mean some names are repeated.  
# Luckily the usmap$names are as follows: state:subregion
usmap$names

# so we can just split the strings based on the ":" character.  We'll use
# sapply( ..., "[[", 1) which extracts the first element when a list is returned:
mapstates <- sapply(strsplit(usmap$names, ":"), "[[", 1)
mapstates

# Now we need to generate a color for each item in mapstates so we can color in the map. 
# We first use "match()" to index our count of observations per state in mtstatesb by mapstates.  
# shapeFileN will contain the numeric range specified in mstatesb, but indexed by the 
# the names in mapstates, some of which will of course be repeated as discussed above. 

shapeFileN <- mtstatesb[match(mapstates,names(mtstates))]

# Now we've can generate a series of grey colors indexed by the count of observations per state,
# indexed so that the the observations match the shape files that come with the "state" object in
# the maps package:
cols <- rev(grey.colors(5))[shapeFileN]

# now let's draw the map:
png("demosamplemap.png", width=1000, height=700)

# draw the map object using the colors specified above
map("state", col= cols, fill=T)

# draw lattitude and longitude points for each observation:
points(lon, lat, cex = .3, col = rgb(0,0,0,.5))

# draw a legend
legend("bottomright", levels(mtstatesb), fill = c("white", rev(grey.colors(5))[2:5]), bty="n")

# put a title on it:
title("Geographic distribution of NIST Internet Time Servers")
dev.off()
Posted in R | 3 Comments