Putting it all together: concise code to make dotplots with weighted bootstrapped standard errors

Dotplots with plyr, boot, and ggplot2

I analyze a lot of experiments and there are many times when I want to quickly look at means and standard errors for each cell (experimental condition), or the same for each cell and individual-level attribute level (e.g., Democrat, Independent, Republican).  Many of these experiments are embedded in national or state-level surveys, for which each respondent is weighted so that the sample means for gender, age, and political affiliation approximate those of a probability sample. But if you use weights, it’s difficult to get unbiased estimates of the standard error of that weighted mean.

In this post, I’ll walk through how to use the powerful plyr package to summarize your weighted data, estimating weighted means and standard errors for each experimental cell using the bootstrap, and then generate dotplots to visualize the result using the ggplot2 package. Note that these are both Hadley Wickham’s packages; he’s also responsible for the immensely useful packages reshape and stringr, which I hope to write about in future posts.

First, I’ll describe the data we’re going to use. Just before the 2010 elections, my lab hired YouGov/Polimetrics to survey Illinois voters about a series of candidates for statewide office, most notably those for Secretary of State.  The major party candidates included the Black Democrat incumbent, Jesse White, and a Hispanic Republican challenger, Robert Enriquez. We exposed respondents to different images of Jesse White, according to three experimental conditions: light, dark, and control (no picture). Our dependent measures included a question about how each respondent intended to vote and a feeling thermometer measuring affect toward each candidate. I use the net of the thermometer rating for Obama minus the thermometer rating for McCain (“netther”). For more on this type of measure, here’s a link to a study on feeling thermometers and their precision compared to questionnaire measures with fewer categories.

Even before we have to deal with the issue of weighted standard errors, it would be nice to find an elegant way to extract weighted statistics on subsets of our data corresponding to our experimental cells (or other qualitative grouping of our data, such as respondent party or ethnicity). It’s rather clunky and/or difficult to do this with other packages/functions, such as base::by or doBy::summaryBy because of the difficulty of passing more than one variable to the function in question, say “weighted.mean(),” applied to the subset of your data specified by a particular variable (see this discussion for details — thanks to David Freedman for the pointer to plyr::ddply). The survey::svymean function provides another potential alternative, but is nowhere near as flexible as is ddply.

We’ll use plyr::ddply, which handles this problem quite well. In the code below, we use “ddply(),” which takes a data frame for input and returns a data frame for output. The “.data” parameter specifies the data frame and the “.variables” parameter specifies the qualitative variable used to subset the data. We use “summarise()” as the function, which allows us to specify not only the function we want to pass over the subsets of our data (experimental cells), but also to tell ddply which variables in our data subset to pass to the function. For example, below we run the function “mean()” on the variable “netther” for each treatment group.

install.packages("plyr")
library("plyr")

# load the Illinois survey data
load(file("dl.dropboxusercontent.com/u/25710348/Blogposts/data/IL2010.Rda"))

# look at ddply:
ddply(.data = IL2010, .variables= .(treat), .fun = summarise,
		mean = mean(x=netther, na.rm=T),
		se = sd(netther, na.rm=T)/sqrt(length(netther)) )

Now that we have some nice machinery to quickly summarize weighted data in hand, let’s get into bootstrapping. The computation of weighted standard errors for estimates of the weighted mean has no straighforward closed form solution, and hence bootstrapping is not a bad way to go. Bootstrapping consists of computing a statistic over and over, by resampling the data with replacement. For more see the UCLA stats webpage on bootstrapping in R.

I’ll use the boot package to implement bootstrapping, which requires us to specify a bootstrapping function that returns a statistic.

install.packages("boot")
library("boot")

sample.wtd.mean <- function(x, w, d) {
	return(weighted.mean(x = x[d], w = w[d], na.rm=T ))
}

Why do we need this function? It allows the boot function to sample our data many times (w/ replacement), each time computing the statistic we want to estimate, as mentioned above. Specifying that function make this process quite elegant by making use of R’s indexing capabilities. Boot pass this function an index of items to include from our original data in each of the resamples. To illustrate this, run the following code in R:

playdata <- runif(20, 1, 10)
playdata
d <- sample(20, replace = TRUE)
d
playdata[d]

The code “playdata[d]” returns the “dth” element from playdata. This is exactly how the boot function works. It passes our “sample.wtd.mean()” function our index of items (d) over and over again based on sampling without replacement — utilizing the fast c code that R uses for indexing rather than slow R for-loops.

When boot is done, we take the mean or median of these statistics as our estimate, which is a loose definition of the bootstrap. We also use the standard deviation of these estimates as the estimate of the standard error of the statistic. Here’s what happens when we pass our function over our data:

b <- boot( IL2010$netther, statistic = sample.wtd.mean, R=1000, w=IL2010$weight)
b
sd(b$t)

[UPDATE 26 JAN 2012: perviously this was coded incorrectly.  The “weights” parameter has been changed to “w.” Thanks to Gaurav Sood and Yph Lelkes for pointing this out!]

Now let’s put it all together using “plyr” and “ggplot2.” I use dotplots because they convey numbers more accurately than other types of plots, such as bar plots or (worst of all) pie charts. William Cleveland has conducted research showing that dots aligned on the same scale are indeed the best visualization to convey a series of numerical estimates (Correction, the actual study is here). His definition of “best” is based on the accuracy with which people interpret various graphical elements, or as he calls them, “perceptual units.”

We’ll first just visualize results by party:


install.packages("ggplot2")
library("ggplot2")

# Create nice party ID variable:
IL2010$pid[IL2010$pid==8] <- NA
IL2010$pidcut <- cut(IL2010$pid, c(-Inf, 1, 3, 4, 6, Inf) , 
		c("Strong Democrat", "Democrat", "Independent", "Republican", "Strong Republican"))
# clean up treatment variable
IL2010$treat <- factor(IL2010$treat, labels = c("Dark", "Light", "Control" ))
# Use plyr::ddply to produce estimates based on PID
pid.therm <- ddply(.data = IL2010, .variables= .(pidcut), .fun = summarise,		
		mean = weighted.mean(x=netther, w=weight, na.rm=T),
		se = sd(boot(netther, sample.wtd.mean, R = 1000, w = weight)$t))
# plot it
ggplot(na.omit(pid.therm), aes(x = mean, xmin = mean-se, xmax = mean+se, y = factor(pidcut))) +
		geom_point() + geom_segment( aes(x = mean-se, xend = mean+se,
						y = factor(pidcut), yend=factor(pidcut))) +
		theme_bw() + opts(axis.title.x = theme_text(size = 12, vjust = .25))+
		xlab("Mean thermometer rating") + ylab("Treatment") +
		opts(title = expression("Thermometer ratings for Jessie White by party"))

That’s a pretty big ggplot command, so let’s discuss each element. The first is the data to plot, which is the object produced via ddply. The second specifies the aesthetic elements of the plot, including the x and y values of the plot, and the boundaries of the plot based on specified minimum and maximum endpoints (“xmin = mean-se, xmax = mean+se”). Next, we add points to the plot by envoking “geom_point().” If we only wanted points, we could stop here. We plot lines representing standard errors with “geom_segment()” specifying x and xend. Next, we specify that we want to use theme_bw(), which I find cleaner than the default parameters. Then we make some minor adjustments via opts for aesthetics. Lastly, we set the y and x labels and the title.

Well, it’s pretty clear from our plot that there’s a lot of action on partisan ID. Let’s look at how each group is affected by the treatment. To do so, we’ll produce the plot at the top of this article, putting each partisan group in a panel via “facet_wrap(),” and plotting each treatment group. First, we’ll generate the summary data using plyr, all we need to do is add another variable (“treat”).

# PID x Treatment
trtbypid.therm <- ddply(.data = IL2010, .variables= .(treat, pidcut), .fun = summarise,		
		mean = weighted.mean(x=netther, w=weight, na.rm=T),
		se = sd(boot(netther, sample.wtd.mean, R = 1000, w = weight)$t))

ggplot(na.omit(trtbypid.therm), aes(x = mean, xmin = mean-se, xmax = mean+se, y = factor(treat))) + 
		geom_point() + geom_segment( aes(x = mean-se, xend = mean+se, 
						y = factor(treat), yend=factor(treat))) +
		facet_wrap(~pidcut, ncol=1) +
		theme_bw() + opts(axis.title.x = theme_text(size = 12, vjust = .25))+ 
		xlab("Mean thermometer rating") + ylab("Treatment") +
		opts(title = expression("Thermometer ratings for Jessie White by treatment condition and pid"))

It looks like Republicans generally don’t like the dark image, while Democrats do. But what if we want to look at the way that racial attitudes interact with the treatments? We measured implicit racial attitudes via the IAT). Let’s take a look.

# Fix up the data to remove the "Control" condition and all "na's"
ilplot <- IL2010[-which(is.na(IL2010$pidcut)),]
ilplot$treat[ilplot$treat=="Control"] <- NA
ilplot$treat <- ilplot$treat[drop=T]
ilplot <- ilplot[-which(is.na(ilplot$treat)),]
ilplot$dscore[which(ilplot$dscore< -1)] <- NA

# Plot it
ggplot(ilplot, aes(x = dscore, y = netther, colour=treat)) +
		geom_point() + geom_smooth(method = "loess", size = 1.5) +
		facet_wrap(~pidcut, nrow=1) +
		theme_bw() + opts(axis.title.x = theme_text(size = 12, vjust = .25))+
		xlab("IAT d-score (- pro black, + anti-black)") + ylab("Thermometer rating") +
		opts(title = expression("Thermometer ratings for Jessie White by treatment condition and iat"))

Visualization of IAT x skin complexion manipulation (conditional on party id ) via ggplot2

With this plot, we don’t really need to subset the data because we are using all of the points and just using lowess to draw lines representing the conditional means across every value of IAT score. The plot shows what seems to be an interaction between IAT score and treatment, such that those with high (anti-black) IAT scores generally favor the lighter skinned image.

About these ads

About Solomon

Political Scientist, Facebook Data Science
This entry was posted in R. Bookmark the permalink.

13 Responses to Putting it all together: concise code to make dotplots with weighted bootstrapped standard errors

  1. Yph says:

    this is helpful thanks sol. i’m confused about the se’s around the lowess curves in the bottom graph. it seems like the bands are too tight for “pro-black” republicans, given the number of data points. since pro-black reps look like anti-black reps, I wonder about the validity of the measure.

    • Solomon says:

      Hey Yph, thanks for the comment. Regarding the SEs on the curve, lowess is smoothing things a lot here, and also keep in mind that the SEs are strictly vertical (rather than 90 degrees from the curve) which can make them appear narrower than they actually are when the curve is steep. Regarding the validity of the measure, yes, dscore is a noisy indicator of implicit attitudes–it is certainly not perfect. But remember that party affiliation is the primary driver of affect here–implicit attitudes are secondary–so even if we measured implicit attitudes perfectly we would have no way of knowing from this data whether a particular respondent is relying primarily on their political affiliation or their implicit attitudes to form their judgement. All we can say is that in general respondents with high anti-black bias as measured on the IAT seem to be affected by the skin comp manipulation (see study 3 in this manuscript on the validity of implicit versus explicit measures of race bias).

  2. solipscience says:

    What does the variable “netther” refer to in the data set?

    • Solomon says:

      Hi John, thanks for the question: netther is the net thermometer rating: feeling thermometer toward Obama minus feeling thermometer toward McCain. I’ll update the post.

  3. solipscience says:

    Hello Solomon,

    Thank you for the prompt, clear reply. This is an excellent post on boot-strapping (a topic I have been trying to familiarize myself with). I apologize in advance, but I have a code question. When I run the final ggplot2 graph code I get the error “Error in eval(expr, envir, enclos) : object ‘pidcut’ not found”

    I can see in the preceding code where pidcut is referred too, but I cannot find it in the workspace (I believe it should be created in the IL2010 data set, I could be mistaken). Do you have any thoughts.

    Thanks,
    John

    • Solomon says:

      Hi John, thanks for your comment. I thought I relabeled the treatment variable in the data set I made available online. I’ve added the appropriate line of code to the post. If you run the code below, you should be able to produce the last plot directly:

      # load the Illinois survey data
      load(file("http://www.stanford.edu/~messing/data/IL2010.Rda&quot;))

      # clean up treatment variable
      IL2010$treat <- factor(IL2010$treat, labels = c("Dark", "Light", "Control" ))

      # Create nice party ID variable:
      IL2010$pid[IL2010$pid==8] <- NA
      IL2010$pidcut <- cut(IL2010$pid, c(-Inf, 1, 3, 4, 6, Inf) ,
      c("Strong Democrat", "Democrat", "Independent", "Republican", "Strong Republican"))

      library("ggplot2")

      # Fix up the data to remove the "Control" condition and all "na’s"
      ilplot <- IL2010[-which(is.na(IL2010$pidcut)),]
      ilplot$treat[ilplot$treat=="Control"] <- NA
      ilplot$treat <- ilplot$treat[drop=T]
      ilplot <- ilplot[-which(is.na(ilplot$treat)),]

      # Plot it
      ggplot(ilplot, aes(x = dscore, y = netther, colour=treat)) +
      geom_point() + geom_smooth(method = "loess", size = 1.5) +
      facet_wrap(~pidcut, nrow=1) +
      theme_bw() + opts(axis.title.x = theme_text(size = 12, vjust = .25))+
      xlab("IAT d-score (- pro black, + anti-black)") + ylab("Thermometer rating") +
      opts(title = expression("Thermometer ratings for Jessie White by treatment condition and iat"))

  4. solipscience says:

    Hello Solomon,

    Your revised code works perfectly. Thank you for your help and patience. Great post!

    Best,
    John

  5. Pingback: Visualization series: Insight from Cleveland and Tufte on plotting numeric data by groups | Solomon Messing

  6. George Ellis says:

    Hi Soloman,
    This is a very impressive. I stumbled across your blog on R-bloggers. I’m a hobbyist R and statistical analysis user and I really appreciate blogs such as this, which are topical and educational. I usually try to solve my own problems before bothering others unnecessarily. However, I though you might want to know that I receive the following error when running some of your functions.

    Warning messages:
    1: sd() is deprecated.
    Use apply(*, 2, sd) instead.

    • Solomon says:

      Hi George, thanks very much. The message you append is just a warning message and should not affect things. I don’t get that message when I run the code, but perhaps you’ve upgraded to a newer version of R?

  7. Pingback: Putting it all together: concise code to make dotplots with weighted bootstrapped standard errors | Estadística y R | Scoop.it

  8. Vania says:

    Hi, I am trying to load the data file to run your example but i get a “access forbidden” message. Is the file available somewhere else?
    Thanks in advance.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s