Response to FiveThirtyEight’s Podcast about our paper, “Projecting confidence”

This article has moved:

Do you remember the night of Nov 8, 2016? I was glued to election coverage and obsessively checking probabilistic forecasts, wondering whether Clinton might do so well that she’d win in places like my home state of Arizona. Although FiveThirtyEight had Clinton’s chances at beating Trump at around 70%, most other forecasters had her at around 90%.

When she lost, many on both sides of the aisle were shocked. My co-authors and I wondered if America’s seeming confidence in a Clinton victory wasn’t driven in part by increasing coverage of probabilistic forecasts. And, if a Clinton victory looked inevitable, what did that do to turnout?


We weren’t alone. Clinton herself was quoted in New York Magazine after the election:

I had people literally seeking absolution… ‘I’m so sorry I didn’t vote. I didn’t think you needed me.’ I don’t know how we’ll ever calculate how many people thought it was in the bag, because the percentages kept being thrown at people — ‘Oh, she has an 88 percent chance to win!

Enter our recent blog post and paper released on SSRN, “Projecting confidence: How the probabilistic horse race confuses and de-mobilizes the public,” by Sean Westwood, Solomon Messing, and Yphtach Lelkes.

While our work cannot definitively say whether probabilistic forecasts played a decisive role in the 2016 election, it does indeed show that compared to more conventional vote share projections, probabilistic forecasts can confuse people, can give people more confidence that the candidate depicted as being ahead will win, may decrease turnout, and that liberals in the U.S. are more likely to encounter them.

We appreciate the media attention to this work, including coverage by the Washington Post, New York Magazine, and the Political Wire.

What’s more, FiveThirtyEight devoted much of their Feb. 12 Politics Podcast to a spirited, and at points critical discussion of our work. We are open to criticism and will respond to some of the questions raised in this post.

Below, we’ll show that the evidence in our study and in other research is not inconsistent with our headline, as the hosts suggest—we’ll detail the evidence that probabilistic forecasts confuse people, irrespective of their technical accuracy. We’ll also discuss where we agree with the podcast hosts.

Furthermore, we’ll discuss a few topics which, judging from the hosts discussion, may not have come through clearly enough in our paper. We’ll reiterate what this work contributes to social science—how the paper adds to our understanding of how people think about probabilistic forecasts and how they may decrease voting, particularly for the leading candidate’s supporters and among liberals in the U.S. We’ll then walk readers through the way we mapped vote share projections to probabilities in the study.

Finally we’ll discuss why this work matters, and conclude by pointing out future research we’d like to see in this area.

What’s new here?

The research contains a number findings that are new to social science:

  • Presenting forecasted win-probabilities gives potential voters the impression that one candidate will win more decisively, compared with vote share projections (Study 1).
  • Higher win probabilities, but not vote share estimates, decrease voting in the face of the trade-offs embedded in our election simulation (Study 2). This helps confirm the findings in Study 1 and adds to the evidence from past research that people vote at lower rates when they perceive an election to be uncompetitive.
  • In 2016, probabilistic forecasts were covered more extensively than in the past and tended to be covered by outlets with more liberal audiences.

Where we agree

If what you care about is conveying an accurate sense of whether one candidate will win, probabilistic forecasts do this slightly better than vote share. And, they seem to give people an edge on accuracy when interpreting the vote share if your candidate is behind. Of course, people can be confused and still end up being accurate, as we’ll discuss below.

We also agree that people often do not accurately judge the likelihood of victory after seeing a vote share projection. That makes sense because, as the study shows, people appear to largely ignore the margin of error, which they’d need to map between vote share estimates and win probabilities.

We also agree that a lot of past work shows that people stay home when they think an election isn’t close. What we’re adding to that body of work is evidence that compared with vote share projections, probabilistic forecasts give people the impression that one candidate will win more decisively, and may thus more powerfully affect turnout.

Does the evidence in our study contradict our headline?

Our headline isn’t about accuracy, it’s about confusion. And the evidence from this research and past work taken as a whole suggests that probabilistic forecasts confuse people — something that came up at the end of segment — even if the result sometimes is technically higher accuracy.

1. People in the study who saw only probabilistic forecasts were more likely to confuse probability and vote share. After seeing probabilistic forecasts, 8.6% of respondents mixed up vote share and probability, while only 0.6% of respondents did so after seeing vote share projections. We’re defining “mixed-up” as reporting the win-probability we provided as the vote share and vice-versa.

2. Figure 2B (Study 1) shows that people get their candidate’s likelihood of winning very wrong, even when we explicitly told them the probability a candidate will win. It’s true that they got slightly closer with a probability forecast, but they are still far off.

Why might this be? A lot of past research and evidence suggests that people have trouble understanding probabilities, as noted at the end of the podcast. People have a tendency to think about probabilities in subjective terms, so they have trouble understanding medical risks and even weather forecasts.

Nate Silver has himself made the argument that the backlash we saw to data and analytics in the wake of the 2016 election is due in part to the media misunderstanding probabilistic forecasts.

As the podcast hosts pointed out, people underestimated the true likelihood of winning after seeing both probabilistic forecasts and vote share projections. It’s possible that people are skeptical of any probabilistic forecast in light of the 2016 election. It’s possible they interpreted the likelihood not as hard-nosed odds, but in rather subjective terms — what might happen, consistent with past research. Regardless, they do not appear to reason about probability in a way that is consistent with how election forecasters define the probability of winning.

3. Looking at how people reason about vote share — the way people have traditionally encountered polling data — it’s clear from our results that when a person’s candidate is ahead and they see a probabilistic forecast, they rather dramatically overestimate the vote share. On the other hand, when they are behind, they get closer to the right answer.

But we know from past research that people have a “wishful thinkingbias, meaning they say their candidate is doing better than polling data suggests. That’s why there’s a positive bias when people are evaluating how their candidate will do, according to Figure 2A (Study 1).

The pattern in the data suggest that people are more accurate after seeing a probabilistic forecast for a losing candidate because of this effect, and not necessarily because they better understand that candidate’s actual chances of victory.

4. Perhaps even more importantly, none of the results here changed when we excluded the margin of error from the projections we presented to people. That suggests that the public may not understand error in the same way that statisticians do, and therefore may not be well-equipped to understand what goes into changes in probabilistic forecast numbers. And of course, very small changes in vote share projection numbers and estimates of error correspond to much larger swings in probabilistic forecasts.

5. Finally, as we point out in the paper, if probabilistic forecasters do not account for total error, they can really overestimate a candidate’s probability of winning. Of course, that’s because an estimate of the probability of victory bakes in estimates of error, which recent work has found is often about twice as large as the estimates of sampling error provided in many polls.

As Nate Silver has alluded to, if the forecaster does not account unobserved error, including error that may be correlated across surveys — he/she will artificially inflate the estimated probability of victory or defeat. Of course, FiveThirtyEight does attempt to account for this error, and released far more conservative forecasts than others in this space in 2016.

Speaking in part to this issue, Andrew Gelman and Julia Azari recently concluded that “polling uncertainty could best be expressed not by speculative win probabilities but rather by using the traditional estimate and margin of error.” They seemed to be speaking about other forecasters, and did not directly reference FiveThirtyEight.

At the end of the day, it’s easy to see that a vote share projection of 55% means that “55% of the votes will go to Candidate A, according to our polling data and assumptions.” However, it’s less clear that an 87% win probability means that “if the election were held 1000 times, Candidate A would win 870 times, and lose 130 times, based on our polling data and assumptions.”

And most critically, we show that probabilistic forecasts showing more of a blowout could potentially lower voting. In Study 1, we provide limited evidence of this based on self reports. In Study 2, we show that when participants are faced with incentives designed to simulate real world voting, they are less likely to vote when probabilistic forecasts show higher odds of one candidate winning. Yet they are not responsive to changes in vote share.

What’s with our mapping between vote share and probability?

The podcast questions how a 55% vote share with a 2% margin of error is equivalent to an 87% win probability. This illustrates a common problem people have when trying to understand win probabilities — -it’s difficult to reason about the relationship between win-probabilities and vote share without actually running the numbers.

You can express a projection as either (1) the average vote share (can be an electoral college vote share or the popular vote share)

\hat \mu_v = \frac{1}{N}\sum_{i}^{N}\bar x_i

and margin of error

\hat \mu_v \pm T^{0.975}_{df = N} \times \frac{ \hat \sigma_v}{\sqrt N}

Here the average for each survey is \bar x_i, and there are N = 20 surveys.

Or (2) the probability of winning — the probability that the vote share is greater than half, based on the observed vote share and standard error:

1 - \Phi \left (\frac{\hat \mu_v - .5}{\hat\sigma_v} \right )

Going back to the example above, here’s the R code to generate those quantities:

svy_mean = .55
svy_SD = 0.04483415 # see appendix
N_svy = 20
margin_of_error = qt(.975, df = N_svy) * svy_SD/sqrt(N_svy)

[1] 0.55

[1] 0.02091224

prob_win = 1-pnorm(q = .50, mean = svy_mean, sd = svy_SD)
[1] 0.8676222

More details about this approach are in our appendix. This is similar to how the Princeton Election Consortium generated win probabilities in 2016.


Of course, one can also use an approach based on simulation, as FiveThirtyEight does. In the case of the data we generated for our hypothetical election in Study 1, this approach is not necessary. However we recognize that in the case of real-world presidential elections, a simulation approach has clear advantages by virtue of allowing more flexible statistical assumptions and a better accounting of error.

Why does this matter?

To be clear, we are not analyzing real-world election returns. However, a lot of past research shows that when people think an election is in the bag, they tend to vote in real-world elections at lower rates. Our study provides evidence that probabilistic forecasts give people more confidence that one candidate will win and suggestive evidence that we should expect them to vote at lower rates after seeing probabilistic forecasts.

This matters a lot more if one candidate’s potential voters are differentially affected, and there’s evidence that may be the case.

1. Figure 2C in Study 1 suggests that the candidate who is ahead in the polls will be more affected by the increased certainty that probabilistic forecasts convey.

2. When you look at the balance of coverage of probabilistic forecasts on major television broadcasts, there is more coverage on MSNBC, which has a more liberal audience.

3. Consider who shares this material in social media–specifically the average self-reported ideology of people who share links to various sites hosting poll-aggregators on Facebook, data that come from this paper’s replication materials. The websites that present their results in terms of probabilities have left-leaning (negative) social media audiences. Only, which doesn’t emphasize win-probabilities, has a conservative audience:

4. In 2016, the proportion of American National Election Study (ANES) respondents who thought the leading candidate would “win by quite a bit” was unusually high for Democrats…


5. And we know that people who say the leading presidential candidate will “win by quite a bit” in pre-election polling are about three percentage points less likely to report voting shortly after the election than people who say it’s a close race — and that’s after conditioning on election year, prior turnout, and party identification. The data here are from the ANES and go back to 1952.

These data do not conclusively show that probabilistic forecasts affected turnout in the 2016 election, but they do raise questions about the real world consequences of probabilistic forecasts.

What about media narratives?

We acknowledge that these effects may change depending on the context in which people encounter them — though people can certainly encounter a lone probability number in media coverage of probabilistic forecasts. We also acknowledge that our work cannot address how these effects compare to and/or interact with media narratives.

However, other work that is relevant to this question has found that aggregating all polls reduces the likelihood that news outlets focus on unusual polls that are more sensational or support a particular narrative.

In some ways, the widespread success and reliance on these forecasts represents a triumph of scientific communication. In addition to greater precision compared with one-off horserace polls, probabilistic forecasts can quantify how likely a given U.S. presidential candidate is to win using polling data and complex simulation, rather than leaving the task of making sense of state and national polls to speculative commentary about “paths to victory,” as we point out in the paper. And as one of the hosts noted, we aren’t calling for an end to election projections.

Future work

We agree with the hosts that there are open questions about whether the public gives more weight to these probabilistic forecasts than other polling results and speculative commentary. We have also heard questions raised about how much probabilistic forecasts might drive media narratives. These questions may prove difficult to answer and we encourage research that explores them.

We hope this research continues to create a dialogue about how to best communicate polling data to the public. We would love to see more research into how the public consumes and is affected by election projections, including finding the most effective ways to convey uncertainty.

Continue reading

Posted in R, Uncategorized | Leave a comment

Replication of Study 2 in Bias in the Flesh: Skin Complexion and Stereotype Consistency in Political Campaigns

this article has moved:

Replication of Study 2 in “Bias in the Flesh: Skin Complexion and Stereotype Consistency in Political Campaigns”

This post presents a replication of Messing et al. (2016, study 2), which showed that exposure to darker images of Barack Obama increased stereotype activation, as indicated by the tendency to finish incomplete word prompts—such as “W E L _ _ _ _”—in stereotype-consistent ways (“WELFARE”).

Overall, the replication shows that darker images of even counter-stereotypical exemplars like Barack Obama can increase stereotype activation, but that the strength of the effect is weaker than conveyed in the original study.  A reanalysis of the original study conducted in the course of this replication effort unearthed a number of problems that, when corrected, yield estimates of the effect that are consistent with those documented in the replication. This reanalysis also follows.

I’m posting this to

  1. disseminate a corrected version of the original study;
  2. show how I found those problems with the original study in the course of conducting this replication;
  3. circulate these generally confirmatory findings, along with a pooled analysis revealing a stronger effect among conservatives; and
  4. provide a demonstration of how replication almost always enhances our knowledge about the original research, which I hope may encourage others to invest the time and money in such efforts.

First some context.

The original study that formed the basis of the manuscript shows that more negative campaign ads in 2008 were also more likely to contain darker images of President Obama. In 2009 when I started this work, I was most proud of the method to collect data on skin complexion outlined in study 1.  I included another study, what’s now study 3, which shows that 2012 ANES survey-takers were more likely to respond negatively to Chinese characters after being presented with darker images of Obama (this is called the Affect Misattribution Procedure (AMP)).

But the AMP was not a true experiment and a reviewer was concerned that Study 3 did not provide sufficiently rigorous, causal evidence that darker images alone can cause negative affect.  So I conducted an experiment that would establish a causal link between darker images of Obama and something I thought was even more important—stereotype activation. There were strong reasons to expect this effect based on past lab studies showing links between darker skin and negative stereotypes about Blacks, and past observational studies showing far more negative socioeconomic outcomes across the board among darker versus lighter skinned Black Americans. We found an effect and published the three studies.

This replication effort was prompted by a post-publication reanalysis and critique, which raised questions about potential weaknesses in the original analysis. My aim in replicating the study was to bring new data to the discussion and make sure we hadn’t polluted the literature with a false discovery.

The main objection was the way we formed our stereotype consistency index. The items assessing stereotype consistency comprised 11 words with missing blank spaces (e.g., L A _ _). Each fragment had as one possible solution a stereotype-related completion. The complete list follows: L A _ _ (LAZY): C R _ _ _ (CRIME); _ _ O R (POOR); R _ _ (RAP); WEL _ _ _ _ (WELFARE); _ _ C E (RACE); D _ _ _ Y (DIRTY); B R _ _ _ _ _ (BROTHER); _ _ A C K (BLACK); M I _ _ _ _ _ _ (MINORITY); D R _ _ (DRUG).

The author pointed out that there were many potential ways to analyze the original data—he claimed over 16 thousand. Yet very few of these are consistent with generally accepted research practices. We’ve known, arguably since the 16th century, that combining several measures reduces measurement error and hence variance in estimation. This is particularly important in social science, and especially for this particular study—it would be unwise to attempt to use a single word completion or an arbitrary subset thereof to measure a complex, noisy construct like stereotype activation as measured via a word completion game. Rather, taking the average or constructing an index based on clustering several measures should be expected to result in far less measurement error, which is what we did.

Still, I am sympathetic to concerns about the garden of forking paths, which is part of the motivation for this replication.

In the original study, I formed this index based on what I judged to be the most unambiguously negative word-completions (lazy, dirty, poor), consistent with past work suggesting that darker complexion activates the most negative stereotypes about Blacks. I calculated that these were the three variables that also maximized interclass correlation (ICC). As a robustness check, I also computed a measure that maximized alpha reliability (AR). This measure contained more items, and also seemed to include stereotype-consistent word completions that were on balance negative—lazy, dirty, poor, crime, black, and welfare. I should have but did not report results based on a simple average of these items, which was not conclusive.

The critical reanalysis cited above shows a handful of statistically significant patterns that are inconsistent with the expectations in the original study, which is suggestive evidence that it’s quite possible to find signal in noise if you’re analyzing arbitrary sets of variables with the originally collected data. However, as shown below in the much larger replication sample below, none of these patterns replicate.

The critique also noted that we did not include an analysis of several trailing questions we included on the original survey. The concern is the file drawer problem – the incentives against and frequent failure to report null results – which obscures knowledge and is bad for the scientific enterprise.

I included those measures based on past work using the same images as stimuli, which found that darker images prompted more negative evaluations of Obama among people with more negative associations with Blacks, as measured using the Implicit Associations Test (IAT). But testing a specification that conditioned on our main outcome of interest—stereotype-consistent word completions—would mean conditioning on a post-treatment variable, particularly worrisome since we saw an effect on stereotype activation in the study.

Below, I pool the data and report another specification that does not require us to condition on post-treatment variables. It takes advantage of the fact that conservatives had significantly higher levels of stereotype activation (which was documented in the original study), and shows that the effect is in fact stronger among this subgroup, providing preliminary evidence in favor of this hypothesis.

The remainder of this post will present my own reanalysis of the original data, the replication, and finally some additional analysis of the data now possible with the larger, pooled data set.

Re-analysis of original data

In the process of collecting data for the replication studies, I used the same interface, simply appending the new data as additional respondents completed the survey experiment. When I geo-coded the IP address data in the full data set, I found a discrepancy between the cases I originally geo-coded as U.S. cases, and the cases that now resolved to U.S. locations in the complete data set.  Many of these respondents appeared in sequence, suggesting they may have been skipped, perhaps due to issues related to connectivity to the geo-location server I used.

This prompted me to conduct a full re-analysis of the data, which yields smaller estimates of stereotype activation. First, re-estimating the index yielded different items—‘black’ in place of ‘dirty’ for the ICC measure and ‘race’ in place of ‘welfare’, ‘crime’, and ‘dirty’ in the AR measure. This is due in part to the way I computed the original indices and in part due to correcting the geo-coding issue. In the original study, I computed the index of variables that maximized alpha and ICC by hand because the epiCalc::alphaBest function (now epiDisplay::alphaBest) does not return results (nor an error message) for these data. For reanalysis, I wrote a function that computed variables to include in the index via successive removal of items. The overall alpha is actually slightly lower in new AR measure, while the new ICC measure has a slightly higher correlation coefficient.

For the sake of transparency, I first report results based on the original items included in the index as reported in Messing et al. 2015 using the updated data, then report the new ICC and AR measures.

Using the original indices with the errantly remove cases included, instead of a 36% increase in stereotype-consistent word completions using the ICC measure, this meant a revised estimate of a 20% increase in stereotype activation (M_Light = 0.33, M_Dark = 0.41, T(859.0) = 2.08, P = 0.038, two-sided). For the AR measure, instead of a 13% increase (M_Light = 0.97, M_Dark = 1.11, T(626.72) = 1.77, P = 0.078, two-sided), this meant an 8% increase (M_Light = 0.98, M_Dark = 1.06, T(850.9) = 1.12, P = 0.265, two-sided).

Re-estimating the indices when including all U.S. cases translates to less conclusive findings—a revised estimate of an 8% increase in stereotype activation in the original study (M_Light = 0.79, M_Dark = 0.86, T(850.7) = 1.27, P = 0.203, two-sided) using the ICC measure, and an 8% increase (M_Light = 0.87, M_Dark = 0.91, T(839.1) = 0.77, P = 0.439, two-sided) using the AR measure.

A slightly smaller effect was also observed when examining differences between conservatives and other participants. Correcting the geo-coding error and updating the indices reduced the estimate of stereotype activation for conservatives. Instead of a 53% increase, the original ICC measure yields a 29% increase (M_Other = 0.35, M_Conservative = 0.49, T(205.9) = 2.49, P = 0.013, two-sided).  The new ICC measure yields an 18% increase (M_Other = 0.80, M_Conservative = 0.98, T(207.9) = 2.41, P = 0.017, two-sided).  For the AR measure, instead of a 29% increase, this meant an 18% increase using either measure (original: M_Other = 0.99, M_Conservative = 1.18, T(210.4) = 2.11, P = 0.036, two-sided) (new: M_Other = 0.86, M_Conservative = 1.05, T(214.3) = 2.47, P = 0.014, two-sided).

The replication

I conducted one exact replication and one very close replication with slightly different images, which I pooled for a total of 3,151 respondents, substantially more than the 630 included in the original writeup.  This gives me more statistical power and more precise estimates of the effect in question. (I provide results for each design separately – one of which appears underpowered – at the end of this post).

To be clear, I did not pre-register this replication. However, I’ve tried to err on the side of exhaustive reporting when the original study did not provide exacting specificity in analyzing the new data. Due to the nature of this replication—the presentation of the same analysis conducted in the original study—the p-values provide highly informative, if not conclusive evidence regarding the nature of the effect.

The average reported age was 36; 52% of participants identified as female; 84% identified as White, 8% as Black; 5% as Hispanic; and 3% as Other. 52% identified as liberal, 27% as moderate, 22% as conservative.

Recomputing the ICC index yielded the following items: black, poor, drug. Recomputing the AR index yielded: lazy, black, poor, welfare, crime, drug, which is close to the original study.

In the replication data, the ICC yields a 5% increase in stereotype activation (M_Light = 0.90, M_Dark = 0.95, T(3142.8) = 1.56, P = 0.119, two-sided). Similarly, the alpha measure yields a 5% increase (M_Light = 1.04, M_Dark = 1.09, T(3145.8) = 1.70, P = 0.089, two-sided).

The original study isn’t completely clear on the question of whether a replication should report on the recomputed ICC and AP measures, or the exact same items as in the original study, so it’s worth reporting those as well. The original ICC measure yields a 3% increase in stereotype activation (M_Light = 0.36, M_Dark = 0.37, T(3148.9) = 0.51, P = 0.611, two-sided). Using the original AR measure yields a 6% increase (M_Light = 1.01, M_Dark = 1.07, T(3147.9) = 1.91, P = 0.057, two-sided).

Finally, it’s worth reporting on an index that simply uses all stereotype-consistent items in the replication reveals a 5% increase in stereotype activation (M_Light = 1.30, M_Dark = 1.37, T(3147.7) = 2.05, P = 0.040, two-sided).

A pooled analysis, after normalizing the ICC and AR measures, yields similar results:

                      ICC     Alpha      ALL
  (Intercept)       -0.041   -0.028    1.292***
                    (0.034)  (0.034)  (0.036)
  cond: Dark/Light   0.067*   0.058    0.067*
                    (0.032)  (0.032)  (0.034)
  study              0.006   -0.001    0.008
                    (0.023)  (0.023)  (0.025)
  R-squared             0.0      0.0       0.0
  N                  4012     4012      4012

I also replicated this study with a different, lesser-known Black politician (Jesse White). However, a manipulation check revealed that only 36% of respondents said the candidate was Black in the “light” condition, compared to 83% in the darker condition, suggesting that any analysis would be severely confounded by perceived race of the target politician. (I did not ask this question in the Barack Obama studies).

Additional analysis

The superior power afforded by pooling all three studies may allow the exploration of treatment heterogeneity. Past work suggests the possibility that darker images might cause people inclined toward more stereotype-consistent responses to evaluate politicians more negatively. However, this analysis would condition on post-treatment variables, which in this case is particularly concerning since the treatment affects stereotype activation according to the original study and replication above.  As an alternative, I consider a specification that uses conservative identification instead, which is a strong predictor of stereotype activation (as shown in the original study), but shouldn’t be affected by the treatment. It reveals evidence for the predicted interactions, suggesting that when conservatives are exposed to darker rather than lighter images of Obama, they have slightly “colder” feelings toward the former president (P = 0.039), perceive him to be less competent (P = 0.061), and less trustworthy (P = 0.083).

                             obama_therm  competence    trust
  (Intercept)                 69.446***    4.213***    3.867***
                              (0.705)     (0.029)     (0.031)
  cond: Dark/Light             0.451       0.024       0.045
                              (0.989)     (0.041)     (0.044)
  iscons                     -40.497***   -1.574***   -1.625***
                              (1.565)     (0.064)     (0.069)
  cond: Dark/Light x iscons   -4.462*     -0.167      -0.165
                              (2.160)     (0.089)     (0.095)
  R-squared                        0.3         0.3         0.2
  N                             3932        3928        3926

A plot of the model predictions for the thermometer ratings suggests that the effect is concentrated among conservatives.


The more items one uses to form an index, the less noise we should expect, and the more likely any replication attempt should be expected to succeed.  It should also mean greater statistical precision. This could explain the remaining discrepancy between this study and the original after adjusting for the geo-coding error pointed out above. It’s also possible that something about the timing or the subjects recruited in the replication studies that explain the observed differences.

Nonetheless, this replication provides evidence that darker images of Black political figures, or at least of President Barack Obama, do in fact activate stereotypes.  This much larger sample suggests that the true effect is smaller than what I found in the original study, which as noted above, contained some errors.

Replication materials available on dataverse.


Below I present alternate specifications estimated without pooling. These specifications suggest first that replication 2 (as well as the original study) was not well-powered. It also suggests that the outcome measures with more items yield more reliable estimates.

Outcome measure summing all items:
Replication 1: M_Light = 1.29, M_Dark = 1.36, T(2115.4) = -1.59, P = 0.113, two-sided
Replication 2: M_Light = 1.30, M_Dark = 1.39, T(982.7) = -1.35, P = 0.177, two-sided

Original Alpha outcome measure:
Replication 1: M_Light = 0.99, M_Dark = 1.06, T(2114.9) = -1.79, P = 0.073, two-sided
Replication 2: M_Light = 1.04, M_Dark = 1.09, T(979.8) = -0.85, P = 0.393, two-sided

Newly estimated Alpha outcome measure:
Replication 1: M_Light = 1.02, M_Dark = 1.09, T(2109.4) = -1.77, P = 0.077, two-sided
Replication 2: M_Light = 1.07, M_Dark = 1.10, T(984.6) = -0.53, P = 0.597, two-sided

Original ICC outcome measure:
Replication 1: M_Light = 0.89, M_Dark = 0.94, T(2100.6) = -1.49, P = 0.137, two-sided.
Replication 2: M_Light = 0.93, M_Dark = 0.96, T(985.9) = -0.67, P = 0.506, two-sided

Newly estimated ICC outcome measure
Replication 1: M_Light = 0.35, M_Dark = 0.36, T(2122.5) = -0.22, P = 0.826, two-sided
Replication 2: M_Light = 0.38, M_Dark = 0.40, T(978.7) = -0.72, P = 0.472, two-sided


A replication of prior critique and reanalysis.  The patterns that run contrary to our original findings are not significant in the replication data.

Variable effect size p-value
feeling therm -0.028 0.439
race minority welfare crime rap 0.022 0.538
race minority welfare rap 0.017 0.629
race minority rap 0.011 0.752
race 0.006 0.861
minority -0.013 0.713
rap 0.024 0.502
welfare 0.014 0.695
comp -0.013 0.72
crime 0.016 0.659
trust 0.002 0.946
brother 0.042 0.236
drug 0.008 0.822
lazy 0.026 0.474
black 0.084 0.019
dirty 0.028 0.438
poor -0.002 0.957
allwcs 0.073 0.04
original 0.018 0.611
alpha 0.068 0.057
Posted in Uncategorized | Leave a comment

Exposure to Ideologically Diverse News and Opinion, Future Research

this article has moved:

Eytan Bakshy & Solomon Messing

Earlier this month, we published an early access version of our paper in ScienceExpress (Bakshy et al. 2015), “Exposure to ideologically diverse news and opinion on Facebook.” The paper constitutes the first attempt to quantify the extent to which ideologically cross-cutting hard news and opinion is shared by friends, appears in algorithmically ranked News Feeds, and is actually consumed (i.e., click through to read).

We are grateful for the widespread interest this paper, which grew out of two threads of related research that we began nearly five years ago: Eytan and Lada’s work on the role of social networks in information diffusion (Bakshy et al. 2012) and Sean and Solomon’s work on selective exposure in social media (Messing and Westwood 2012).

While Science papers are explicitly prohibited from suggesting future directions for research, we would like to shed additional light on our study and raise a few questions that we would be excited to see addressed in future work.

Tradeoffs when Selecting a Population
There were tradeoffs when deciding on who to include in this study. While we could have examined all U.S. adults on Facebook, we focused on people who identify as liberals or conservatives and encounter hard news, opinion, and other political content in social media regularly. We did so because many important questions around “echo chambers” and “filter bubbles”on Facebook relate to this subpopulation, and we used self-reported ideological preferences to define it.

Using self-reported ideological preferences in online profiles is not the only a way to measure ideology or define the population of interest. Yet, people who publicly identify as liberals or conservatives in their Facebook profiles are an interesting and important subpopulation worthy of study for many reasons. As Hopkins and King 2010 have pointed out, studying the expression and behavior of those who are politically engaged online is of interest to political scientists studying activists (Verba, Schlozman, and Brady 1995), the media (Drezner and Farrell 2004), public opinion (Gamson 1992), social networks (Adamic and Glance 2005; Huckfeldt and Sprague 1995), and elite influence (Grindle 2005; Hindman, Tsioutsiouliklis, and Johnson 2003; Zaller 1992).

This subpopulation has limitations and is not the only population of interest. The data are not appropriate for those who seek estimates of the entire U.S. public, people without strong opinions, or people not on Facebook (at least not without additional extrapolation, re-weighting, additional evidence, etc.). While our data could plausibly also provide good estimates of the population of people who are ideologically active and have clear preferences, we are not claiming that’s necessarily the case—that remains to be determined in future work.

We’d like to help other researchers looking to study other populations understand more about the population we’ve defined. An important question in this regard is what proportion of active U.S. adults actually report an identifiable left/right/center ideology in their profile. That number is 25%, or 10.1 million people.

It’s also informative to examine the proportion of those users who provide identifiable profile affiliations conditional on demographics and Facebook usage:

Age Percent reporting ideological affiliation
18-24 21.60%
25-44 28.50%
45-64 24.30%
65+ 21.40%
Gender Percent reporting ideological affiliation
Female 21.90%
Male 30.60%
Login Days Percent reporting ideological affiliation
105-140 18.90%
140-185 26.70%

Clearly those who report an ideology in their profile tend to be more active on Facebook. They are also more likely to be men, which is consistent with the well-documented gender gap in American politics (Box-Steffensmeier 2004).

It’s possible that these individuals differ from other Facebook users in other ways. It seems plausible to expect these people to have higher levels of political interest, a stronger sense of political ideology and political identity, and to be more likely to be active in politics than most others on Facebook. It’s also possible that these individuals are more extroverted than the average user, especially in the somewhat taboo domain of politics. These possibilities also strike us as interesting questions for study in future work.

How to Measure Ideology
We hope others will replicate this work using other populations and ways of measuring ideology, which will provide a broader view of exposure to political media. Data on ideology could be collected by, for example, surveying users, imputing ideology based on user behavior, or joining data to the voter file. Each of these methods have advantages and potential challenges.

Using surveys in future work would allow researchers to collect data on ideology in a way that can facilitate comparisons with much of the extant literature in political science, and allow researchers to sample from a less politically engaged population. Of course, this could be tricky because survey response rates might be affected by the phenomenon under study. In other words, the salience of political discussion from the right or left, and/or prior choices to consume content could make people more/less likely to respond to a survey asking about ideology, or affect the way they report the strength of their ideological preferences. This could confound measurement in a way that would be difficult to detect and correct. Yet it would be fascinating to see how survey results compare to the results in this study.

We would also encourage the application of large-scale methods that impute individuals’ ideological leanings using social networks or revealed preferences. This would have the advantage of allowing researchers to estimate ideological preferences for a broader population, and could be applied to empirical contexts for which self-reported ideological affiliations are not present.

However, these approaches present challenges. Imputing ideology based on social networks would make it difficult to estimate what proportion of people’s networks contain individuals from the other side. Bond and Messing, 2015 and Barberá 2014 discuss some of the challenges related to estimating ideology based on revealed preferences. Another challenge specific to the quantities estimated in our paper is that because behavior may be caused by the composition of individuals’ social networks, what their friends share, and how they engage with Facebook, using revealed preferences to select the population could introduce endogenous selection bias (Elwert and Winship 2014). A study that negotiates these issues would be a tremendously valuable contribution. Similar methods could also be used to obtain measures of ideological alignment of content.

Lastly, researchers could use party registration from the voter file. This approach would yield millions of records, but have different selection problems—match rates may differ by region, state, age, gender, etc. Again, the advantage of approaches like this are that these studies compliment each other and provide a fuller picture of how exposure to viewpoints from the other side occur in social media.

Future work should also examine how exposure varies in different subpopulations. For example, one hypothesis to test is whether those with weaker or less consistent ideological preferences have more cross cutting content shared by friends, rendered in social media streams, and selected for reading. Some preliminary analysis suggests that indeed, among the individuals in our study, those with a weaker stated ideological affiliation have on average more cross-cutting content at each stage in the exposure process.


Other Data Sources
There are many other important questions related to this paper that necessitate new data sources: Does encountering cross-cutting content increase or decrease attitude polarization? What about attitudes toward members of the other side? Does it change specific policy preferences? Are liberals and conservatives more or less likely to see content in News Feed because it was cross-cutting? Do they actively avoid cross-cutting political content because of expressions in the title or because of the fact that the media source is suggestive of a cross-cutting article? How do changes to ranking algorithms and user interfaces affect selective exposure? And how can we better understand actual discourse about politics in social media, rather than merely shared media content?

Answering these questions necessitates collecting innovative data sets via online experimentation (Berinsky et al. 2012), social media (Ryan and Broockman 2012), crowdsourcing (Budak et al. 2014), large scale field experimentation (King et al. 2014), observational social media data, clever ways to collect data about individual differences in ranking (Hannak et al. 2013), smart ways to combine behavioral and survey data (Chen et al. 2014), and panel data (Athey and Mobius 2012, Flaxman et al. 2014).

Many of these are causal questions necessitating experimental and/or quasi experimental designs. For example, the extent to which people select content because it is cross-cutting could be investigated using experiments like this one (e.g., Messing and Westwood 2012) or through identifying sources of natural exogenous variation. And while Diana Mutz and others have done ground-breaking research on the effects of encountering cross-cutting arguments on political attitudes (Mutz 2002b) and behavior (Mutz 2002a), more research into how these effects play out in the long term (using approaches like Druckman et al 2012) would be of tremendous benefit to the literature. It is difficult to expose people to any sort of argument for a long period of time (say over the course of a U.S. national political campaign cycle), in a way that is not confounded with people’s existing preferences and the social environment, though creative quasi-experimental work (Martin and Yurukoglu 2014) is emerging in this area.

Many of these questions necessitate that researchers identify the effects of cross-cutting arguments both on and off Facebook. To get a full picture of how cross-cutting arguments affect politics requires understanding the myriad of ways individuals get information, both on the Internet (Flaxman et al. 2014) and offline (Mutz 2002a), what kinds of information people discuss in offline contexts (Mutz 2002b), and the relative influence of all of these factors on opinions.

Finally, if individuals’ online networks and choices do substantially impact the diversity of news in individuals’ overall “information diets,” future research could examine the effects of connecting those with more disparate views (Klar 2014), encouraging consumption of cross-cutting content (Agapie and Munson 2015), or simply encouraging individuals to read more diverse news by making individuals more aware of the balance of news they consume (Munson et al. 2013).

These questions are especially important in light of the fact that there are substantial opportunities for people to read more news on Facebook. The plots below illustrate the average proportion of stories shared by friends, those that are seen in News Feed, and those clicked on for liberals and conservatives in the study. Clearly there is an opportunity to read more news from either side.


Finally, we believe that reproducing, replicating, and conducting additional analyses on extant data sets is extremely important and helps generate ideas for future work (King 1995, Leeper 2015). In that spirit, we have created a Facebook Dataverse repository in the Harvard Dataverse. The repository includes replication data, scripts, as well as some additional supplementary data and code for extending our work.


E. Bakshy, S. Messing, L.A. Adamic. 2015. Exposure to ideologically diverse news and opinion on Facebook. Science.

E. Bakshy, I. Rosenn, C.A. Marlow, L.A. Adamic. 2012. The Role of Social Networks in Information Diffusion. ACM WWW 2012.

S. Messing and S.J. Westwood. 2012. Selective Exposure in the Age of Social Media: Endorsements Trump Partisan Source Affiliation When Selecting News Online. Communication Research.

P. Barberá (2015). Birds of the same feather tweet together: Bayesian ideal point estimation using Twitter data. Political Analysis, 23(1), 76-91.

R. Bond, S. Messing, Quantifying Social Media’s Political Space: Estimating Ideology from Publicly Revealed Preferences on Facebook. American Political Science Review

F. Elwert and C. Winship. 2014. Endogenous Selection Bias: The Problem of Conditioning on a Collider Variable. Annual Review of Sociology.

G. King, J. Pan, and M. E. Roberts. 2014. Reverse-Engineering Censorship in China: Randomized Experimentation and Participant Observation. Science.

C. Budak, S. Goel, & J. M. Rao. (2014). Fair and Balanced? Quantifying Media Bias Through Crowdsourced Content Analysis. Quantifying Media Bias Through Crowdsourced Content Analysis (November 17, 2014).

S. Athey, M. Mobius. The Impact of News Aggregators on Internet News Consumption: The Case of Localization. Working paper.

A. Hannak, P. Sapiezynski, A. Molavi Kakhki, B. Krishnamurthy, D. Lazer, A. Mislove, C. Wilson. 2013. Measuring personalization of web search. ACM WWW 2013.

A. Chen and A. Owen and M. Shi. Data Enriched Linear Regression. Working paper.

G.J. Martin, A. Yurukoglu. Working paper. Bias in Cable News: Real Effects and Polarization. Working paper.

S.R. Flaxman, S. Goel, J.M. Rao. Filter Bubbles, Echo Chambers, and Online News Consumption. Working paper.

D.C. Mutz. 2002. The Consequences of Cross-Cutting Networks for Political Participation. American Journal of Political Science.

D.C. Mutz. 2002. Cross-cutting Social Networks: Testing Democratic Theory in Practice. American Political Science Review.

J. N. Druckman, J. Fein, & T. Leeper. 2012. A source of bias in public opinion stability. American Political Science Review.

E. Agapie, S.A. Munson. 2015. “Social Cues and Interest in Reading Political News Stories.” AAAI ICWSM 2015.

S. Klar. 2014. Partisanship in a Social Setting. American Journal of Political Science.

S.A. Munson, S.Y. Lee, P. Resnick. 2013. Encouraging Reading of Diverse Political Viewpoints with a Browser Widget. AAAI ICWSM 2013.

G. King. 1995. “Replication, Replication.” Political Science and Politics.

T. Leeper. 2015. What’s in a Name? The Concepts and Language of Replication and Reproducibility. Blog post.

Posted in Uncategorized | 5 Comments

When to Use Stacked Barcharts?

this article has moved:


Yesterday a few of us on Facebook’s Data Science Team released a blogpost showing how candidates are campaigning on Facebook in the 2014 U.S. midterm elections. It was picked up in the Washington Post, in which Reid Wilson calls us “data wizards.” Outstanding.

I used Hadly Wickham’s ggplot2 for every visualization in the post except a map that Arjun Wilkins produced using D3, and for the first time I used stacked bar charts.  Now as I’ve stated previously, one should generally avoid bar charts, and especially stacked bar charts, except in a few specific circumstances.

But let’s talk about when not to use stacked bar charts first—I had the pleasure of chatting with Kaiser Fung of JunkCharts fame the other day, and I think what makes his site so compelling is the mix of schadenfreude and Fremdscham that makes taking apart someone else’s mistake such an effective teaching strategy and such a memorable read. I also appreciate the subtle nod to junk art.

Here’s a typical, terrible stacked bar chart, which I found on and originally published on a Wall Street Journal blogpost. It shows the share of the personal computing device market by operating system, over time. The problem with using a stacked bar chart is that there are only two common baselines for comparison (the top and bottom of the plotting area), but we are interested in the relative share for more than two OS brands. The post is really concerned with Microsoft, so one solution would be to plot Microsoft versus the rest, or perhaps Microsoft on top versus Apple on the bottom with “Other” in the middle. Then we’d be able to compare the over time market share for Apple and Microsoft. As the author points out, an over time trend can also be visualized with line plots.

By far the worst offender I found in my 5 minute Google search was from junkcharts and originally published on Vox. These cumulative sum plots are so bad I was surprised to see them still up. The first problem is that the plots represent an attempt to convey way too much information—either plot total sales or pick a few key brands that are most interesting and plot them on a multi-line chart or set of faceted time series plots. The only brand for which you can quickly get a sense of sales over time is the Chevy Volt because it’s on the baseline. I’m sure the authors wanted to also convey the proportion of sales each year, but if you want to do that just plot the relative sales. Of course, the order in which the bars appear on the plot has no organizing principle, and you need to constantly move your eyes back and forth from the legend to the plot when trying to make sense of this monstrosity.

As Kaiser notes in his post, less is often more. Here’s his redux, which uses lines and aggregates by both quarter and brand, resulting in a far superior visualization:

So when *should* you use a stacked bar chart? Here are a two scenarios with examples, inspired by work with Eytan Bakshy and conversations with Ta Chiraphadhanakul and John Myles White.

1. You care about comparing the proportion of two things, in this case the share of posts by Democrats and Republicans, along a variety of dimensions.  In this case those dimensions consist of keyword (dictionary-based) categories (above) and LDA topics (below).  When these are sorted by relative proportion, the reader gains insight into which campaign strategies and issues are used more by Republican or Democratic candidates.


2. You care about comparing proportions along an ordinal, additive variable such as 5-point party identification, along a set of dimensions.  I provide an example from a forthcoming paper below (I’ll re-insert the axis labels once it’s published).  Notice that it draws the reader toward two sets of comparisons across dimensions — one for strong democrats and republicans, the other for the set of *all* Democrats and *all* Republicans.

Screen Shot 2014-10-11 at 2.34.09 PM

Of course, R code to produce these plots follows:

# Uncomment these lines and install if necessary:


# We start with the raw number of posts for each party for
# each candidate. Then we compute the total by party and
# category.

catsByParty %>% group_by(party, all_cats) %>%
summarise(tot = summ(posts))

# Next, compute the proportion by party for each category
# using dplyr::mutate
catsByParty %
group_by(all_cats) %>%
mutate(prop = tot/sum(tot)) 

# Now compute the difference by category and order the
# categories by that difference:
catsByParty % group_by(all_cats) %>%
    mutate(pdiff = diff(prop))

catsByParty$all_cats <- reorder(catsByParty$all_cats, -catsByParty$pdiff)

# And plot:
ggplot(catsByParty, aes(x=all_cats, y=prop, fill=party)) +
scale_y_continuous(labels = percent_format()) +
geom_bar(stat='identity') +
geom_hline(yintercept=.5, linetype = 'dashed') +
coord_flip() +
theme_bw() +
ylab('Democrat/Republican share of page posts') +
xlab('') +
scale_fill_manual(values=c('blue', 'red')) +
theme(legend.position='none') +
ggtitle('Political Issues Discussed by Party\n')
Posted in R | 12 Comments

Visualization Series: Using Scatterplots and Models to Understand the Diamond Market

this article has moved:


My last post railed against the bad visualizations that people often use to plot quantitive data by groups, and pitted pie charts, bar charts and dot plots against each other for two visualization tasks.  Dot plots came out on top.  I argued that this is because humans are good at the cognitive task of comparing position along a common scale, compared to making judgements about length, area, shading, direction, angle, volume, curvature, etc.—a finding credited to Cleveland and McGill.  I enjoyed writing it and people seemed to like it, so I’m continuing my visualization series with the scatterplot.

A scatterplot is a two-dimensional plane on which we record the intersection of two measurements for a set of case items—usually two quantitative variables.  Just as humans are good at comparing position along a common scale in one dimension, our visual capabilities allow us to make fast, accurate judgements and recognize patterns when presented with a series of dots in two dimensions.  This makes the scatterplot a valuable tool for data analysts both when exploring data and when communicating results to others.  

In this post—part 1—I’ll demonstrate various uses for scatterplots and outline some strategies to help make sure key patterns are not obscured by the scale or qualitative group-level differences in the data (e.g., the relationship between test scores and income differs for men and women). The motivation in this post is to come up with a model of diamond prices that you can use to help make sure you don’t get ripped off, specified based on insight from exploratory scatterplots combined with (somewhat) informed speculation. In part 2, I’ll discuss the use of panels aka facets aka small multiples to shed additional light on key patterns in the data, and local regression (loess) to examine central tendencies in the data. There are far fewer bad examples of this kind of visualization in the wild than the 3D barplots and pie charts mocked in my last post, though I was still able to find a nice case of MS-Excel gridline abuse and this lovely scatterplot + trend-line. scatterplot has a richer history than the visualizations I wrote about in my last post.  The scatterplot’s face forms a two-dimensional Cartesian coordinate system, and DeCartes’ invention/discovery of this eponymous plane in around 1657 represents one of the most fundamental developments in science.  The Cartesian plane unites measurement, algebra, and geometry, depicting the relationship between variables (or functions) visually. Prior to the Cartesian plane, mathematics was divided into algebra and geometry, and the unification of the two made many new developments possible.  Of course, this includes modern map-making—cartography, but the Cartesian plane was also an important step in the development of calculus, without which very little of our modern would would be possible.


Indeed, the scatterplot is a powerful tool to help understand the relationship between two variables in your data set, and especially if that relationship is non-linear. Say you want to get a sense of whether you’re getting ripped off when shopping for a diamond. You can use data on the price and characteristics of many diamonds to help figure out whether the price advertised for any given diamond is reasonable, and you can use scatterplots to help figure out how to model that data in a sensible way. Consider the important relationship between the price of a diamond and its carat weight (which corresponds to its size):


A few things pop out right away.  We can see a non-linear relationship, and we can also see that the dispersion (variance) of the relationship also increases as carat size increases.  With just a quick look at a scatterplot of the data, we’ve learned two important things about the functional relationship between price and carat size. And, we also therefore learned that running a linear model on this data as-is would be a bad idea.

Before moving forward, I want provide a proper introduction to the diamonds data set, which I’m going to use for much of the rest of this post. Hadley’s ggplot2 ships with a data set that records the carat size and the price of more than 50 thousand diamonds, from collected in in 2008, and if you’re in the market for a diamond, exploring this data set can help you understand what’s in store and at what price point. This is particularly useful because each diamond is unique in a way that isn’t true of most manufactured products we are used to buying—you can’t just plug a model number and look up the price on Amazon. And even an expert cannot cannot incorporate as much information about price as a picture of the entire market informed by data (though there’s no substitute for qualitative expertise to make sure your diamond is what the retailer claims).

But even if you’re not looking to buy a diamond, the socioeconomic and political history of the diamond industry is fascinating.  Diamonds birthed the mining industry in South Africa, which is now by far the largest and most advanced economy in Africa.  I worked a summer in Johannesburg, and can assure you that South Africa’s cities look far more like L.A. and San Francisco than Lagos, Cairo, Mogadishu, Nairobi, or Rabat.  Diamonds drove the British and Dutch to colonize southern Africa in the first place, and have stoked conflicts ranging from the Boer Wars to modern day wars in Sierra Leone, Liberia, Côte d’Ivoire, Zimbabwe and the DRC, where the 200 carat Millennium Star diamond was sold to DeBeers at the height of the civil war in the 1990s.  Diamonds were one of the few assets that Jews could conceal from the Nazis during the “Aryanization of Jewish property” in the 1930s, and the Congressional Research Service reports that Al Qaeda has used conflict diamonds to skirt international sanctions and finance operations from the 1998 East Africa Bombings to the September 11th attacks.

Though the diamonds data set is full of prices and fairly esoteric certification ratings, hidden in the data are reflections of how a legendary marketing campaign permeated and was subsumed by our culture, hints about how different social strata responded, and insight into how the diamond market functions as a result.

The story starts in 1870 according to The Atlantic, when many tons of diamonds were discovered in South Africa near the Orange River.  Until then, diamonds were rare—only a few pounds were mined from India and Brazil each year.  At the time diamonds had no use outside of jewelry as they do today in many industrial applications, so price depended only on scarce supply.  Hence, the project’s investors formed the De Beers Cartel in 1888 to control the global price—by most accounts the most successful cartel in history, controlling 90% of the world’s diamond supply until about 2000.  But World War I and the Great Depression saw diamond sales plummet.

In 1938, according to the New York Times’ account, the De Beers cartel wrote Philadelphia ad agency N. W. Ayer & Son, to investigate whether “the use of propaganda in various forms” might jump-start diamond sales in the U.S., which looked like the only potentially viable market at the time.  Surveys showed diamonds were low on the list of priorities among most couples contemplating marriage—a luxury for the rich, “money down the drain.”  Frances Gerety, who the Times compares to Madmen’s Peggy Olson, took on the DeBeers’ account at N.W. Ayer & Son, and worked toward the company’s goal “to create a situation where almost every person pledging marriage feels compelled to acquire a diamond engagement ring.”  A few years later, she coined the slogan, “Diamonds are forever.”$(KGrHqV,!jkE1Iw6dKBJBN,cRTIdWw~~_35.

The Atlantic’s Jay Epstein argues that this campaign gave birth to modern demand-advertising—the objective was not direct sales, nor brand strengthening, but simply to impress the glamour, sentiment and emotional charge contained in the product itself.  The company gave diamonds to movie stars, sent out press packages emphasizing the size of diamonds celebrities gave each other, loaned diamonds to socialites attending prominent events like the Academy Awards and Kentucky Derby, and persuaded the British royal family to wear diamonds over other gems.  The diamond was also marketed as a status symbol, to reflect “a man’s … success in life,” in ads with “the aroma of tweed, old leather and polished wood which is characteristic of a good club.”  A 1980s ad introduced the two-month benchmark: “Isn’t two months’ salary a small price to pay for something that lasts forever?”

Screen Shot 2014-01-11 at 10.26.34 AM

By any reasonable measure, Frances Gerety succeeded—getting engaged means getting a diamond ring in America. Can you think of a movie where two people get engaged without a diamond ring? When you announce your engagement on Facebook, what icon does the site display?  Still think this marketing campaign might not be the most successful mass-persuasion effort in history?  I present to you a James Bond film, whose title bears the diamond cartel’s trademark:

Awe-inspiring and terrifying.  Let’s open the data set.  The first thing you should consider doing is plotting key variables against each other using the ggpairs() function.  This function plots every variable against every other, pairwise.  For a data set with as many rows as the diamonds data, you may want to sample first otherwise things will take a long time to render.  Also, if your data set has more than about ten columns, there will be too many plotting windows, so subset on columns first.

# Uncomment these lines and install if necessary:


diasamp = diamonds[sample(1:length(diamonds$price), 10000),]
ggpairs(diasamp, params = c(shape = I('.'), outlier.shape = I('.')))

* R style note: I started using the “=” operator over “<-” after reading John Mount’s post on the topic, which shows how using “<-” (but not “=”) incorrectly can result in silent errors.  There are other good reasons: 1.) WordPress and R-Bloggers occasionally mangle “<-” thinking it is HTML code in ways unpredictable to me; 2.) “=” is what every other programming language uses; and 3.) (as pointed out by Alex Foss in comments) consider “foo<-3” — did the author mean to assign foo to 3 or to compare foo to -3?  Plus, 4.) the way R interprets that expression depends on white space—and if I’m using an editor like Emacs or Sublime where I don’t have a shortcut key assigned to “<-“, I sometimes get the whitespace wrong.  This means spending extra time and brainpower on debugging, both of which are in short supply.  Anyway, here’s the plot:


What’s happening is that ggpairs is plotting each variable against the other in a pretty smart way. In the lower-triangle of plot matrix, it uses grouped histograms for qualitative-qualitative pairs and scatterplots for quantitative-quantitative pairs.  In the upper-triangle, it plots grouped histograms for qualitative-qualitative pairs (using the x-instead of y-variable as the grouping factor), boxplots for qualitative-quantitative pairs, and provides the correlation for quantitative-quantitative pairs.

What we really care about here is price, so let’s focus on that.  We can see what might be relationships between price and clarity, and color, which we’ll keep in mind for later when we start modeling our data, but the critical factor driving price is the size/weight of a diamond. Yet as we saw above, the relationship between price and diamond size is non-linear. What might explain this pattern?  On the supply side, larger contiguous chunks of diamonds without significant flaws are probably much harder to find than smaller ones.  This may help explain the exponential-looking curve—and I thought I noticed this when I was shopping for a diamond for my soon-to-be wife. Of course, this is related to the fact that the weight of a diamond is a function of volume, and volume is a function of x * y * z, suggesting that we might be especially interested in the cubed-root of carat weight.

On the demand side, customers in the market for a less expensive, smaller diamond are probably more sensitive to price than more well-to-do buyers. Many less-than-one-carat customers would surely never buy a diamond were it not for the social norm of presenting one when proposing.  And, there are *fewer* consumers who can afford a diamond larger than one carat.  Hence, we shouldn’t expect the market for bigger diamonds to be as competitive as that for smaller ones, so it makes sense that the variance as well as the price would increase with carat size.

Often the distribution of any monetary variable will be highly skewed and vary over orders of magnitude. This can result from path-dependence (e.g., the rich get richer) and/or the multiplicitive processes (e.g., year on year inflation) that produce the ultimate price/dollar amount. Hence, it’s a good idea to look into compressing any such variable by putting it on a log scale (for more take a look at this guest post on Tal Galili’s blog).

p = qplot(price, data=diamonds, binwidth=100) +
    theme_bw() +

p = qplot(price, data=diamonds, binwidth = 0.01) +
    scale_x_log10() +
    theme_bw() +
    ggtitle('Price (log10)')


Indeed, we can see that the prices for diamonds are heavily skewed, but when put on a log10 scale seem much better behaved (i.e., closer to the bell curve of a normal distribution).  In fact, we can see that the data show some evidence of bimodality on the log10 scale, consistent with our two-class, “rich-buyer, poor-buyer” speculation about the nature of customers for diamonds.

Let’s re-plot our data, but now let’s put price on a log10 scale:

p = qplot(carat, price, data=diamonds) +
    scale_y_continuous(trans=log10_trans() ) +
    theme_bw() +
    ggtitle('Price (log10) by Cubed-Root of Carat')


Better, though still a little funky—let’s try using use the cube-root of carat as we speculated about above:

cubroot_trans = function() trans_new('cubroot', transform= function(x) x^(1/3), inverse = function(x) x^3 )

p = qplot(carat, price, data=diamonds) +
    scale_x_continuous(trans=cubroot_trans(), limits = c(0.2,3),
        breaks = c(0.2, 0.5, 1, 2, 3)) +
    scale_y_continuous(trans=log10_trans(), limits = c(350,15000),
        breaks = c(350, 1000, 5000, 10000, 15000)) +
    theme_bw() +
    ggtitle('Price (log10) by Cubed-Root of Carat')


Nice, looks like an almost-linear relationship after applying the transformations above to get our variables on a nice scale.


Note that until now I haven’t done anything about overplotting—where multiple points take on the same value, often due to rounding.  Indeed, price is rounded to dollars and carats are rounded to two digits.  Not bad, though when we’ve got this much data we’re going to have some serious overplotting.

head(sort(table(diamonds$carat), decreasing=TRUE ))
head(sort(table(diamonds$price), decreasing=TRUE ))
 0.3 0.31 1.01  0.7 0.32    1 
2604 2249 2242 1981 1840 1558 

605 802 625 828 776 698 
132 127 126 125 124 121 

Often you can deal with this by making your points smaller, using “jittering” to randomly shift points to make multiple points visible, and using transparency, which can be done in ggplot using the “alpha” parameter.

p = ggplot( data=diamonds, aes(carat, price)) +
    geom_point(alpha = 0.5, size = .75, position='jitter') +
    scale_x_continuous(trans=cubroot_trans(), limits = c(0.2,3),
        breaks = c(0.2, 0.5, 1, 2, 3)) +
    scale_y_continuous(trans=log10_trans(), limits = c(350,15000),
        breaks = c(350, 1000, 5000, 10000, 15000)) +
    theme_bw() +
    ggtitle('Price (log10) by Cubed-Root of Carat')


This gives us a better sense of how dense and sparse our data is at key places.

Using Color to Understand Qualitative Factors

When I was looking around at diamonds, I also noticed that clarity seemed to factor in to price.  Of course, many consumers are looking for a diamond of a certain size, so we shouldn’t expect clarity to be as strong a factor as carat weight.  And I must admit that even though my grandparents were jewelers, I initially had a hard time discerning a diamond rated VVS1 from one rated SI2.  Surely most people need a loop to tell the difference.  And, according to BlueNile, the cut of a diamond has a much more consequential impact on that “fiery” quality that jewelers describe as the quintessential characteristic of a diamond.  On clarity, the website states, “Many of these imperfections are microscopic, and do not affect a diamond’s beauty in any discernible way.” Yet, clarity seems to explain an awful lot of the remaining variance in price when we visualize it as a color on our plot:

p = ggplot( data=diamonds, aes(carat, price, colour=clarity)) +
    geom_point(alpha = 0.5, size = .75, position='jitter') +
    scale_colour_brewer(type = 'div',
        guide = guide_legend(title = NULL, reverse=T,
            override.aes = list(alpha = 1))) +
    scale_x_continuous(trans=cubroot_trans(), limits = c(0.2,3),
        breaks = c(0.2, 0.5, 1, 2, 3)) +
    scale_y_continuous(trans=log10_trans(), limits = c(350,15000),
        breaks = c(350, 1000, 5000, 10000, 15000)) +
    theme_bw() + theme(legend.key = element_blank()) +
    ggtitle('Price (log10) by Cubed-Root of Carat and Color')


Despite what BlueNile says, we don’t see as much variation on cut (though most diamonds in this data set are ideal cut anyway):

p = ggplot( data=diamonds, aes(carat, price, colour=cut)) +
    geom_point(alpha = 0.5, size = .75, position='jitter') +
    scale_colour_brewer(type = 'div',
        guide = guide_legend(title = NULL, reverse=T,
            override.aes = list(alpha = 1))) +
    scale_x_continuous(trans=cubroot_trans(), limits = c(0.2,3),
        breaks = c(0.2, 0.5, 1, 2, 3)) +
    scale_y_continuous(trans=log10_trans(), limits = c(350,15000),
        breaks = c(350, 1000, 5000, 10000, 15000)) +
    theme_bw() + theme(legend.key = element_blank()) +
    ggtitle('Price (log10) by Cube-Root of Carat and Cut')


Color seems to explain some of the variance in price as well, though BlueNile states that all color grades from D-J are basically not noticeable.

p = ggplot( data=diamonds, aes(carat, price, colour=color)) +
    geom_point(alpha = 0.5, size = .75, position='jitter') +
    scale_colour_brewer(type = 'div',
        guide = guide_legend(title = NULL, reverse=T,
            override.aes = list(alpha = 1))) +
    scale_x_continuous(trans=cubroot_trans(), limits = c(0.2,3),
        breaks = c(0.2, 0.5, 1, 2, 3)) +
    scale_y_continuous(trans=log10_trans(), limits = c(350,15000),
        breaks = c(350, 1000, 5000, 10000, 15000)) +
    theme_bw() + theme(legend.key = element_blank()) +
    ggtitle('Price (log10) by Cube-Root of Carat and Color')


At this point, we’ve got a pretty good idea of how we might model price. But there are a few problems with our 2008 data—not only do we need to account for inflation but the diamond market is quite different now than it was in 2008. In fact, when I fit models to this data then attempted to predict the price of diamonds I found on the market, I kept getting predictions that were far too low. After some additional digging, I found the Global Diamond Report. It turns out that prices plummeted in 2008 due to the global financial crisis, and since then prices (at least for wholesale polished diamond) have grown at a roughly a 6 percent compound annual rate. The rapidly-growing number of couples in China buying diamond engagement rings might also help explain this increase. After looking at data on PriceScope, I realized that diamond prices grew unevenly across different carat sizes, meaning that the model I initially estimated couldn’t simply be adjusted by inflation. While I could have done ok with that model, I really wanted to estimate a new model based on fresh data.

Thankfully I was able to put together a python script to scrape without too much trouble. This dataset is about 10 times the size of the 2008 diamonds data set and features diamonds from all over the world certified by an array of authorities besides just the Gemological Institute of America (GIA). You can read in this data as follows (be forewarned—it’s over 500K rows):

diamondsurl = getBinaryURL('')

My github repository has the code necessary to replicate each of the figures above—most look quite similar, though this data set contains much more expensive diamonds than the original. Regardless of whether you’re using the original diamonds data set or the current larger diamonds data set, you can estimate a model based on what we learned from our scatterplots. We’ll regress carat, the cubed-root of carat, clarity, cut and color on log-price. I’m using only GIA-certified diamonds in this model and looking only at diamonds under $10K because these are the type of diamonds sold at most retailers I’ve seen and hence the kind I care most about. By trimming the most expensive diamonds from the dataset, our model will also be less likely to be thrown off by outliers at the high end of price and carat. The new data set has mostly the same columns as the old one, so we can just run the following (if you want to run it on the old data set, just set data=diamonds).

diamondsbig$logprice = log(diamondsbig$price)
m1 = lm(logprice~  I(carat^(1/3)),
    data=diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert == 'GIA',])
m2 = update(m1, ~ . + carat)
m3 = update(m2, ~ . + cut )
m4 = update(m3, ~ . + color + clarity)

mtable(m1, m2, m3, m4)

Here are the results for my recently scraped data set:

                    m1          m2          m3          m4     
(Intercept)       2.671***    1.333***    0.949***   -0.464*** 
                 (0.003)     (0.012)     (0.012)     (0.009)   
I(carat^(1/3))    5.839***    8.243***    8.633***    8.320*** 
                 (0.004)     (0.022)     (0.021)     (0.012)   
carat                        -1.061***   -1.223***   -0.763*** 
                             (0.009)     (0.009)     (0.005)   
cut: V.Good                               0.120***    0.071*** 
                                         (0.002)     (0.001)   
cut: Ideal                                0.211***    0.131*** 
                                         (0.002)     (0.001)   
color: K/L                                            0.117*** 
color: J/L                                            0.318*** 
color: I/L                                            0.469*** 
color: H/L                                            0.602*** 
color: G/L                                            0.665*** 
color: F/L                                            0.723*** 
color: E/L                                            0.756*** 
color: D/L                                            0.827*** 
clarity: I1                                           0.301*** 
clarity: SI2                                          0.607*** 
clarity: SI1                                          0.727*** 
clarity: VS2                                          0.836*** 
clarity: VS1                                          0.891*** 
clarity: VVS2                                         0.935*** 
clarity: VVS1                                         0.995*** 
clarity: IF                                           1.052*** 
R-squared             0.888       0.892      0.899        0.969
N                338946      338946     338946       338946    

Now those are some very nice R-squared values—we are accounting for almost all of the variance in price with the 4Cs.  If we want to know what whether the price for a diamond is reasonable, we can now use this model and exponentiate the result (since we took the log of price).  We need to multiply the result by exp(sigma^2/2), because the our error is no longer zero in expectation:

\displaystyle ~\\E(log(y) \mid \mathbf{X} = \mathbf{x}) = E(\mathbf{X}\beta + \epsilon)\\ ~~~~~~~E(y \mid \mathbf{X} = \mathbf{x}) = E( exp( \mathbf{X}\beta + \epsilon ) )\\ ~~~~~~~~~~~~~~~~~~~~~= E( exp( \mathbf{X}\beta ) \times exp( \epsilon ) ) \\ ~~~~~~~~~~~~~~~~~~~~~= E( exp( \mathbf{X}\beta ) ) \times E( exp( \epsilon ) ) \\ ~~~~~~~~~~~~~~~~~~~~~= exp(\mathbf{X}\hat\beta) \times exp( \frac{\hat\sigma^2}{2} ) \\

To dig further into that last step, have a look at the Wikipedia page on log-normal distributed variables.

Thanks to Miguel for catching this.

Let’s take a look at an example from Blue Nile. I’ll use the full model, m4.

# Example from BlueNile
# Round 1.00 Very Good I VS1 $5,601

thisDiamond = data.frame(carat = 1.00, cut = 'V.Good', color = 'I', clarity='VS1')
modEst = predict(m4, newdata = thisDiamond, interval='prediction', level = .95)
exp(modEst) * exp(summary(m4)$sigma^2/2)

The results yield an expected value for price given the characteristics of our diamond and the upper and lower bounds of a 95% CI—note that because this is a linear model, predict() is just multiplying each model coefficient by each value in our data. Turns out that this diamond is a touch pricier than expected value under the full model, though it is by no means outside our 95% CI. BlueNile has by most accounts a better reputation than however, and reputation is worth a lot in a business that relies on easy-to-forge certificates and one in which the non-expert can be easily fooled.

This illustrates an important point about generalizing a model from one data set to another. First, there may be important differences between data sets—as I’ve speculated about above—making the estimates systematically biased. Second, overfitting—our model may be fitting noise present in data set. Even a model cross-validated against out-of-sample predictions can be over-fit to noise that results in differences between data sets. Of course, while this model may give you a sense of whether your diamond is a rip-off against diamonds, it’s not clear that should be regarded as a source of universal truth about whether the price of a diamond is reasonable. Nonetheless, to have the expected price at with a 95% interval is a lot more information than we had about the price we should be willing to pay for a diamond before we started this exercise.

An important point—even though we can predict prices almost perfectly based on a function of the 4c’s, one thing that you should NOT conclude from this exercise is that *where* you buy your diamond is irrelevant, which apparently used to be conventional wisdom in some circles.  You will almost surely pay more if you buy the same diamond at Tiffany’s versus Costco.  But Costco sells some pricy diamonds as well. Regardless, you can use this kind of model to give you an indication of whether you’re overpaying.

Of course, the value of a natural diamond is largely socially constructed. Like money, diamonds are only valuable because society says they are—-there’s no obvious economic efficiencies to be gained or return on investment in a diamond, except perhaps in a very subjective sense concerning your relationship with your significant other. To get a sense for just how much value is socially constructed, you can compare the price of a natural diamond to a synthetic diamond, which thanks to recent technological developments are of comparable quality to a “natural” diamond. Of course, natural diamonds fetch a dramatically higher price.

One last thing—there are few guarantees in life, and I offer none here. Though what we have here seems pretty good, data and models are never infallible, and obviously you can still get taken (or be persuaded to pass on a great deal) based on this model. Always shop with a reputable dealer, and make sure her incentives are aligned against selling you an overpriced diamond or worse one that doesn’t match its certificate. There’s no substitute for establishing a personal connection and lasting business relationship with an established jeweler you can trust.

One Final Consideration

Plotting your data can help you understand it and can yield key insights.  But even scatterplot visualizations can be deceptive if you’re not careful.  Consider another data set the comes with the alr3 package—soil temperature data from Mitchell, Nebraska, collected by Kenneth G. Hubbard from 1976-1992, which I came across in Weisberg, S. (2005). Applied Linear Regression, 3rd edition. New York: Wiley (from which I’ve shamelessly stolen this example).

Let’s plot the data, naively:

qplot(Month, Temp, data = Mitchell) + theme_bw()


Looks kinda like noise.  What’s the story here?

When all else fails, think about it.

What’s on the X axis?  Month.  What’s on the Y-axis?  Temperature.  Hmm, well there are seasons in Nebraska, so temperature should fluctuate every 12 months.  But we’ve put more than 200 months in a pretty tight space.

Let’s stretch it out and see how it looks:


Don’t make that mistake.

That concludes part I of this series on scatterplots.  Part II will illustrate the advantages of using facets/panels/small multiples, and show how tools to fit trendlines including linear regression and local regression (loess) can help yield additional insight about your data. You can also learn more about exploratory data analysis via this Udacity course taught by my colleagues Dean Eckles and Moira Burke, and Chris Saden, which will be coming out in the next few weeks (right now there’s no content up on the page).

Posted in R | 17 Comments

Streamline Your Mechanical Turk Workflow with MTurkR

I’ve been using Thomas Leeper‘s MTurkR package to administer my most recent Mechanical Turk study—an extension of work on representative-constituent communication claiming credit for pork benefits, with Justin Grimmer and Sean Westwood.  MTurkR is excellent, making it quick and easy to:

  • test a project in your MTurk sandbox
  • deploy a project live
  • download the completed HIT data from Amazon
  • issue credit to workers after you run your own validation checks in R
  • issue bonuses en-masse
  • all programmatically via the API.

In this post I’ll walk you through the set-up process and a couple of basic examples.  But first a few words to motivate your interest in Mechanical Turk.

On Mechanical Turk and Social Science

Mechanical Turk has proven to be a boon for social science research.  I discussed how it’s a great way attain data based on human judgements (e.g., content analysis, coding images, etc.) in my last post on creating labels for supervised text classification.

It’s also a great source of experimental subjects in a field often plagued by small samples, which have lower power and higher false discovery rates (especially when researchers exploit “undisclosed flexibility” in their design and analysis).  Mechanical Turk provides a large, geographically and demographically diverse sample of participants, freeing researchers from confines of undergraduate samples, which can respond differently than other key populations.

Mechanical Turk also provides a much lower-cost experimental platform.  Compared to a national survey research firm, the cost per subject is dramatically lower, and most survey firms simply do not allow the same design flexibility, generally limiting you to question-wording manipulations.  Mechanical Turk is also cheaper than running a lab study in terms of the researcher’s time, or that of her graduate students.  This is true even for web studies—just tracking undergraduate participants and figuring out who should get course credit is a mind-numbing, yet error-prone and so frustrating process.   Mechanical Turk makes it easy for the experimenter to manage data collection, perform validation checks, and quickly disburse payment to subjects who actually participated.  I’ll show you how to streamline this process to automate as much as possible in the next section.

But are results from Mechanical Turk valid and will I be able to publish on Mturk data?  Evidence that the answer is “yes” accumulates.  Berinksy, Huber, and Lenz (2012) published a validation study in Political Analysis that replicates important published experimental work using the platform.  Furthermore, Justin Grimmer, Sean Westwood and I published a study in the American Political Science Review on how people respond when congressional representatives announce political pork in which we provide extensive Mechanical Turk validation in the online appendix.  We provide demographic comparisons to the U.S. Census, show the regional distribution of Turkers by state and congressional district, and show that correlations between political variables are about the same as in national survey samples.  Sean and I also published a piece in Communication Research on how social cues drive the news we consume which among other things replicates past findings on partisans’ preferences for news with a partisan slant.

But are Turkers paying enough attention to actually process my stimuli?  It’s hard to know for sure, but there are an array of questions one can deploy that are designed to make sure Turkers are paying attention.  When my group runs Mturk studies, we typically ask a few no-brainer questions to qualify at the beginning of any task (e.g., 5 + 7 = 13, True or False?), then burry some specific instructions in a paragraph of text toward the end (e.g., bla bla bla, write “I love Mechanical Turk” in the box asking if you have any comments).

But we are paying Turkers, doesn’t that make them more inclined toward demand effects?  Well, compared to what? Compared to a student who needs class credit or a paid subject?  On the contrary, running a study on the web ensures that researchers in a lab do not provide non-verbal cues (of which they are unaware) that can give rise to demand effects.  Of course, in any experimental setting it’s important to think carefully about potential demand effects, and especially so when deploying noticeable/obvious manipulations, and or when you have a within-subject design such that respondents observe multiple conditions.


Install the most recent version of MTurkR from Tom’s github repo.  You can use Hadley’s excellent devtools package to install it:

# install.packages("devtools")
    username = "leeper")

If you don’t have a Mechanical Turk account, get one.  Next you want to set up your account, fund it, etc., there are a lot of good walkthroughs just google it if you have trouble.

Also, set up your sandbox so you can test things before you actually spend real money.

You’ll also need to get your AWS Access Key ID and AWS Secret Access Key to get access to the API.  Click on “Show” under “Secret Access Key” to actually see it.

I believe you can use the MTurkR package to design a questionnaire and handle randomization, but that’s beyond the scope of this post—for now just create a project as you normally would via the GUI.


First, make sure your credentials work correctly and check your account balance (replace the dummy credentials with your own):


Great, now R is talking to the API.  Now you can actually create HITs.  Let’s start in your sandbox.  Create a project in your sandbox, or just copy a project you’ve already created—copy the html in the design layout screen from your production account to your sandbox account.


Now you need to know the HIT layout id.  Make sure you are in the “Create” tab in the requester interface, and click on your Project Name.  This window will pop up:


Copy and paste the Layout ID in your R script, you’ll use it in a second.  With the Layout ID in hand, you can create HITs based on this layout.  Let’s create a HIT in our sandbox first.  We’ll first set the qualifications that Mechanical Turkers must meet, then tell the API to create a new HIT in the sandbox with 1200 assignments for $.50 each.  Modify the parameters below to fit your needs.

# First set qualifications
# ListQualificationTypes() to see different qual types
qualReqs = paste(

	# Set Location to US only

	# Worker_PercentAssignmentsApproved
		"000000000000000000L0", ">", "80",

	# Un-comment after sandbox test
	# Worker_NumberHITsApproved
	# GenerateQualificationRequirement(
	#	"00000000000000000040", ">", "100",
	#	qual.number=3),

	sep="" )

# Create new batch of hits:
newHIT = CreateHIT(

	# layoutid in sandbox:

	# layoutid in production:
	# hitlayoutid="2C9X7H57DZKPHWJIU98DS25L8N41BW",

	annotation = "HET Experiment with Pre-Screen",
	assignments = "1200",
	title="Rate this hypothetical representative",
	description="It's easy, just rate this
		hypothetical representative on how well
		she delivers funds to his district",
	keywords="survey, question, answers, research,
			politics, opinion",

To check to make sure everything worked as intended, go to your worker sandbox and search for the HIT you just created.  See it? Great.

Here’s how to check on the status of your HIT:

# Get HITId (record result below)
# "2C2CJ011K274LPOO4SX1EN488TRCAG"


And now here’s where the really awesome and time-saving bit comes in, downloading the most recent results.  If you don’t use the API, you have to do this manually from the website GUI, which is a pain.  Here’s the code:

review = GetAssignments(hit="2C2CJ011K274LPOO4SX1EN488TRCAG",
	status="Submitted", return.all=T)

Now you probably want to actually run your study with real subjects.  You can copy your project design HTML back to the production site and repeat the above for production when you are ready to launch your HIT.

Often, Turkers will notice something about your study and point it out, hoping that it will prove useful and you will grant them a bonus.  If they mention something helpful, grant them a bonus!  Here’s how (replace dummy workerid with the id for the worker to whom you wish to grant a bonus):

## Grant bonus to worker who provided helpful comment:
bonus 	assignments=review$AssignmentId[review$WorkerId=="A2VDVPRPXV3N59"],
	reasons="Thanks for the feedback!")

You’ll also save time when you go to validate and approve HITs, which MTurkR allows you to do from R.  My group usually uses Qualtrics for questionnaires that accompany our studies where we build in the attention check described above.  Here’s how to automate the attention check and approve HITs for those who passed:

svdat = read.csv(unzip("data/"), skip=1,
approv = agrep("I love Mechanical Turk", max.distance=.3,

correct = review$AssignmentId[
	gsub(" ", "", review$confcode) %in% svdat$GUID[approv] ]

# To approve:
approve = approve(assignments=correct)

Note that “agrep()” is an approximate matching function which I use in case Turkers add quotes or other miscellaneous text that shouldn’t disqualify their work.

But that’s not all–suppose we want to check the average time per HIT (to make sure we are paying Turkers enough) and/or start analyzing our data—we can do this via the API rather than downloading things from the web with MTurkR.

# check average time:
review = GetAssignments(hit="2C2CJ011K274LPOO4SX1EN488TRCAG",
	status="Approved", return.all=T)

# Check distribution of time each Turker takes:

If we discover we aren’t paying Turkers enough in light of how long they spend on the HIT, which in addition to being morally wrong means that we will probably not collect a sufficient number of responses in a timely fashion, MTurkR allows us to quickly remedy the situation.  We can expire the HIT, grant bonuses to those who completed the low-pay HIT, and relaunch.  Here’s how:

# Low pay hits:

# Review remaining HITS and approve:
review = GetAssignments(hit="2JV21O3W5XH0L74WWYBYKPLU3XABH0",
status="Submitted", return.all=T)

correct = review$AssignmentId[ which( gsub(" ", "", review$confcode) %in% svdat$GUID[approv] ) ]

approve = approve(assignments=correct)

# Expire remaining:

approvedLowMoneyHits = GetAssignments(hit="2JV21O3W5XH0L74WWYBYKPLU3XABH0",
	status="Approved", return.all=T)

# Grant bonus to workers who completed hit already:
bonus = GrantBonus(workers=approvedLowMoneyHits$WorkerId,
	reasons="Upping the pay for this HIT, thanks for completing it!")

We can now run the administrative side of our Mechanical Turk HITs from R directly.  It’s a much more efficient workflow.

Posted in R | 4 Comments

Generating Labels for Supervised Text Classification using CAT and R


The explosion in the availability of text has opened new opportunities to exploit text as data for research. As Justin Grimmer and Brandon Stewart discuss in the above paper, there are a number of approaches to reducing human text to data, with various levels of computational sophistication and human input required. In this post, I’ll explain how to use the Coding Analysis Toolkit (CAT) to help you collect human evaluations of documents, which is a necessary part of many text analyses, and especially so when you have a specific research question that entails precisely characterizing whether a particular document contains a particular type of content. CAT facilitates fast data entry and handles data management when you have multiple human coders. It’s default output can be tricky to deal with however, so I’ll also provide R code to extract useable data from CAT’s XML output, which should serve as a good into to data munging with XML to the uninitiated. I’ll also show you how to compute metrics that will help diagnose the reliability of your coding system, which entails using the melt and cast functionality in Hadley’s ‘reshape’ package to get the data in the right shape then feeding the results to the ‘irr’ package.

In future posts, I’ll explain how to use these labels to train various machine learning algorithms aka classification models to automatically classify documents in a large corpus. I’ll talk about how to extract features using R and the ‘tm’ package; the problem of classification in high-dimensional spaces (e.g., there are many many words in natural language) and how we can exploit the bias-variance tradeoff to get traction on this problem; the various models that are generally well suited for text classification like the lasso, elastic net, SVMs and random forests; the importance of properly tuning these models; and how to use cross-validation to avoid overfitting these models to your data. (For a preview check out my slides and R labs for my short course on Analyzing Text as Data that I presented at the Stanford Computational Social Science Workshop)

Is human categorization/content analysis the right solution?

Social scientists have been trying to reduce the complexities and subjectivities of the human language to objective data for a long time, calling it “content analysis.” It is no easy task. If you decide to use human coders, I suggest you read Kim Neuendorf’s book and you can find some nice resources on her website that may prove helpful. You’d also do well to read Krippendoff’s eponymous classic.

If you are trying to characterize the type of things that occur or discover categories in your documents, it might make more sense to go with a fully computational approach, employing unsupervised machine learning methods that cluster documents based on word features (e.g. simple word counts, unigrams, or combinations thereof, N-grams). You can take a look at the Grimmer and Stewart paper above for more details on this approach, or check out the notes from lectures 5 – 7 from Justin’s course on the topic. If your are interested in additional computational approaches/features, have a look at Dan Jurafsky’s course From Language to Information.

But if you have a specific research question in mind, which entails precisely characterizing whether a particular document contains a particular type of content, you probably need people to read and classify each document according to a coding scheme. Of course, this can become expensive or impossible if you are dealing with a large corpus of documents. If so, you can first have people classify a sample of documents, which can then be used as labels to train supervised classifiers. Once you have a classifier that performs well, you can use it to classify the rest of the documents in your large data set. I have put together some slides on this process, along with an R lab with example code. Also of interest may be some slides I put together on acquiring and pre-processing text data from the web with R, along with computational labs on interacting with APIs and scraping/regex.

If instead you care about making generalizations about the proportion of documents in any given category (in your population of documents based on your sample), check out the R package ReadMe, which implements A Method of Automated Nonparametric Content Analysis for Social Science, and an industrial implementation of the method is offered at big data social media analytics company Crimson Hexagon. If you decide on this approach, you’ll still need human-labeled categories to start, so keep reading.

Is CAT the right solution for your categorization task?

Before we get to CAT, I want to talk about human classification with Amazon’s Mechanical Turk, which is great when the classification task is simple and the documents/units of text to classify are short. Mturk is especially useful when the categorization task becomes mind-numbingly boring with repetition, because when one Turker burns out on your task and stops working, fresh Turkers can continue the work. Hence, the main advantage to Mturk is that you can get simple classification tasks done much more quickly than by relying upon research assistants/undergraduates/employees/etc—in fact I’ve used Mturk to classify thousands of short open responses to the question “What’s the most important issue facing the nation” into Gallup’s categories in a matter of hours.

Often overlooked are the nice interfaces that Mturk provides both to it’s requesters (e.g., you), which makes data management far easier than keeping track of excel spreadsheets/google docs edited by multiple coders, and to it’s workers, which translate to less mouse clicks, fatigue, and probably lower error rates. Panos Ipeirotis has some nice slides (best stuff in 30-50) + open source code to help ensure this kind of crowd-sourced data is of the highest quality.

But often the classification task is complicated and people need training to do it correctly and efficiently. I’d direct you to the various links above for guidance on building codebooks for complicated classification schemes and training coders. A great human-coder system is well-conceptualized, highly relevant to the corpus in question, and contains crystal clear instructions for your coders—providing flow charts and diagrams seems to be especially helpful. When Justin, Sean and I were implementing a coding protocol recently, we used this flow chart (from this latex file ) to compliment our actual codebook.

Why is CAT better?

My experiences with Mechanical Turk spoiled me—all of the complexities of dealing with multiple coders entering data and managing that data were abstracted away by the system that Mturk has in place. What I’d done in the past—having analysts enter data in a Google Doc or worse, MS-Excel—was keystroke and mouse-click intensive, which meant it was time-consuming and error-prone for coders when they were entering data, and for me when I was merging/cleaning data from multiple coders.

CAT is the best solution I’ve come across yet. It’s interface isn’t aesthetically perfect, but it gets the job done well. It minimizes key strokes and requires no mouse clicks for data-entry, so you’ll see your coders work faster and probably happier. It maintains your data, alleviating the need to manage spreadsheets and concerns about your coders making errors due to transcribing codes to a spreadsheet. Because it also handles the back-end of things, there’s no need to maintain your own servers/sql database/etc. But it’s open-source so if you need to use your own servers, you can download the source and set it up yourself.
Screen Shot 2013-01-05 at 5.34.36 PM

Getting Started

Head over to the CAT website and register for an account. Maybe poke around a bit on the website a bit, familiarize yourself with the menus. When you’re ready to upload a data set, go to Datasets –> Upload Raw Data Set. CAT wants a .zip file with all of your documents in individual .txt files.

If you have your text in a vector in R, say called text_vector, you can output each element to individual .txt files as follows:

for( i in 1:length(text_vector)){
capture.output(text_vector[i], file = paste("doc_number_", i, ".txt", sep="") )

Next put these files in a zip archive and upload to CAT. You can upload another file that specifies the coding scheme, but it’s probably easier just to make the coding scheme using the interface on the website later.

When you’re done, go to Datasets –> View Raw Datasets and click on the dataset you just uploaded. From this page, you can manage coders associated with the data sets. If you click on Manage Sub-Accounts, you can easily create new accounts for your coders. Add yourself and other coders to the dataset and be sure to click the “Set Chosen Coders” button when you’re done.

Next, implement your carefully constructed coding scheme. From the same “View Raw Dataset” page, click on “Add or modify codes in the dataset” (on the right under “Toolbox”). Add a sensical name for each code and enter a shortcut key—this will make it so your coders can just hit a button to code a document and move on to the next. When you’re done, hit finished.

I highly recommend you test out your coding scheme yourself. You’ll also probably want to consult your coders and iteratively tweak your codebook according to qualitative input from your coders (this is discussed in full in the content analysis links above).

Then set your coders loose on a hundred documents or so.

Exporting Data from CAT

This process is a bit more complicated than it sounds. If you download data in CSV format, it comes out jagged (i.e., not rectangular), and hence it’s not immediately useful in R. The .CSV file becomes especially convoluted if you change your code labels, add coders, etc.

Better to just deal with the XML output. I’ll introduce data-munging/ETL with XML below. XML is like HTML, but tags describe data, not formatting. We’ll use those tags to create queries that return the data we want. XML is semi-structured data, with a general tree structure, rather than the rectangular structure that R likes. This tree structure saves space and is highly flexible, though it can be hard to work with initially. If you’ve never seen XML before, you’d do well to check out Jennifer Widom’s excellent lecture on XML in her Coursera course. XML is often used in various useful data APIs, which you can learn more about by checking out Sean Westwood’s short course on the topic.

To get the XML output for your project, from the “View Raw Dataset” page, select “Download Coded Text File (XML Format)” from the drop-down menu and then click on “Download Data” (on the right under “Toolbox”).

Here’s how to read in and clean the resulting file. You need to do this step because (1) R wants the encoding to be utf-8 but the CAT file says it’s utf-16, and (2) the XML package doesn’t like “&#x0​;” strings (HTML notation for NULL), which frequently occur in the CAT output. Note, you’ll need to delete the “\” character before using the code below, because the WordPress code interpreter widget displays “&#x0​;” as “”. Thanks to Jannike Ballo for pointing out that what I was doing previously me figure this out.

doc <- readLines("")

# check out what's in the file:

# Fix utf-8 issue:
doc <- gsub("utf-16", "utf-8", doc)

# Remove bad "&amp;#x0" characters
# NOTE: YOU MUST DELETE THE "\" escape character

grep("&#x0\;", doc)

doc <- gsub("&#x0\;", "", doc)

First a bit of background on this particular data. These represent a small subsample of documents that Justin Grimmer, Sean Westwood and I were putting together for a project looking at how congressional representatives claim credit for expenditures in their district. We were most concerned with identifying press releases that claimed credit for an expenditure in the district (CC_expenditure). But we also wanted to code items that explicitly criticized earmarks or advocated for earmark reform (Egregious); items that were speaking to local constituents—advertising constituent service that the candidate performed or community events the candidate attended to build name recognition (Adv_par_const); and items that were explicitly taking a national policy position or speaking to non-local audiences (Position_taking_other).

Have a look at the file above in an editor so you get a sense of its structure. The file first provides a lot of meta-data about the project, about each of the codes used, the coders, then the full text of each document. After the full text comes the actual data we want—how each item (paragraphId) was coded (codeId) and by which coder (coderId). It looks like this:


It’s the values inside each of the tags that we want. Here’s how we can get them: (1) parse the XML so R recognizes the tags and values properly using the XML package, and (2) extract those values and get them into a data frame for analysis using XPATH. (2) involves telling R to traverse the XML data and return the value in each of the paragraphId, codeId and coderId tags.

# Parse the XML
# uncomment the line below to install the XML package
# install.packages('XML')

doc <- xmlInternalTreeParse(doc, asText=T)

# That was easy, now for #2:
para = unlist(xpathApply(doc, "//paragraphId", xmlValue))
code = unlist(xpathApply(doc, "//codeId", xmlValue))
coderid = unlist(xpathApply(doc, "//coderId", xmlValue))

# Now put into a data frame:
alldat <- data.frame(para, coder=coderid, code)

That’s great, but if you want human-readable data, you need to do a few more things. Let’s pull each of the codeIds and codenames, then use that to map each of the codeIds in our data back to human-readable codes. We’ll do the same thing for our coders and give a number to each of the coding units (paragraphId).

# now map back to human readable values:
codeids <- unlist(xpathApply(doc, "//code", xmlGetAttr, "codeId" ))
codenames <- unlist(xpathApply(doc, "//code", xmlValue))
alldat$codes <- codenames[match(alldat$code, codeids)]

coderids <- unlist(xpathApply(doc, "//coder", xmlGetAttr, "coderId" ))
codernames <- unlist(xpathApply(doc, "//coder", xmlValue))
alldat$coder <- codernames[match(alldat$coder, coderids)]

# paragraph num:
paragraphIds <- unlist(xpathApply(doc, "//paragraph", xmlGetAttr, "paragraphId"))
pgnum <- as.numeric(unlist(lapply(strsplit(paragraphCodes, "_"), function(x) x[[2]] )))
alldat$pgnum <- pgnum[match(para, paragraphIds)]

# paragraph tag:
paragraphTag <- unlist(xpathApply(doc, "//paragraph", xmlGetAttr, "paragraphTag"))
alldat$paragraphTag <- paragraphTag[match(para, paragraphIds)]

Excellent, now we have our data in a very nice rectangular format.

Basic diagnostics

Two of the most helpful diagnostics when assessing inter-coder reliability are confusion matrices and Krippendorff’s Alpha. Confusion matrices are a bit easier to produce when the data is in this format so that’s where I’ll start.

A confusion matrix is just a contingency table, or incidence matrix, that helps us figure out if any two coders are scoring things in the same way. It consists of the incidence matrix of codes for a pair of coders where the entries are the sum of the incidences—if this sounds confusing, don’t worry this will become clear in the example below. One compact way to get this is to use the paragraph-code incidence matrix for each coder, then multiply each pair of matrices. Here’s how to do it:

# get paragraph-code incidence matrix for each coder:
alltabs <- table(alldat$para, alldat$codes, alldat$coder )

coder1 <- alltabs[,,1]
coder2 <- alltabs[,,2]
coder3 <- alltabs[,,3]

# Multiply together to get confusion matrix for each pair
# of coders:
coder12 <- t(coder1) %*% coder2
coder23 <- t(coder2) %*% coder3
coder13 <- t(coder1) %*% coder3

# Clean up column names so we can read things clearly:
dimnames(coder12)[[2]] <- substr( dimnames(coder12)[[2]], 1, 6)
dimnames(coder23)[[2]] <- substr( dimnames(coder23)[[2]], 1, 6)
dimnames(coder13)[[2]] <- substr( dimnames(coder13)[[2]], 1, 6)

# Take a look:

# Pay attention to the sum on the diagonal:

Here’s what the first confusion matrix looks like:

> coder12
                        Adv_pa CC_exp Egregi Positi
  Adv_par_const             25      0      0      6
  CC_expenditure             4     12      0      6
  Egregious                  0      0      0      0
  Position_taking_other     11      1      0     35

It shows the incidence between coder 1 and coder 2’s codes, with coder 1’s codes on the rows and coder 2’s codes on the columns. So coder 1 and coder 2 coded 25 of the same items as “Adv_par_const” but coder 1 coded “Position_taking_other” when coder 2 coded “Adv_par_const” for 11 items.

This can help diagnose which categories are creating the most confusion. We can see that our coders are confusing “Adv_par_const” and “Position_taking_other” more often than “CC_expenditure” and “Adv_par_const.” For us, that meant we focused on distinguishing these two categories in our training sessions.

It’s also useful to look at Krippendorff’s alpha to get a sense for the global agreement between all coders. We can compute Krippendorff’s alpha using the “irr” package.

But first a little data-munging is in order. The irr package expects data in a matrix with a single row for each document and columns for each coder. But of course, currently our data is in “long” format, with one line for each document-coder pair. Luckily, we can use the “reshape” package to “melt” our data then “cast” it into the format we want. In this case, “melt” does not actually change the shape of our data—it’s already long. It simply adds a “variable” column and a “value” column, which is necessary to use “cast.” Next, transform the variable to numeric so that irr will be happy. Lastly, cast the data into the format we want, with a column for each coder.

alltabsm <- melt(alldat, measure.vars=c("codes"))

# Add make "value" column numeric for irr
alltabsm$value <- as.numeric(alltabsm$value)
alltabsrs <- cast(alltabsm[,which(names(alltabsm)!="code")], ... ~ coder)

And lastly, run the kripp.alpha() function on the columns that contain the coders and codes.

kripp.alpha(t(as.matrix(alltabsrs[,5:7])) )

Now, all that’s left is to get our output. What we want is to take the mode value for each article (which in this case is the same as the median). Let’s take a look at the histogram of the results.

alltabsrs$modecode <- apply(alltabsrs[,5:7], 1, median)


You can see from the histogram that code 3 (Eggregious) was rare, as we were expecting. The other codes look good.

And now we’ve got what we need! A data frame with each item, each code for each coder, and the most commonly occurring code that we can use as the actual label in our analysis.

Posted in R | 5 Comments