Visualization Series: Using Scatterplots and Models to Understand the Diamond Market

ggpairs

My last post railed against the bad visualizations that people often use to plot quantitive data by groups, and pitted pie charts, bar charts and dot plots against each other for two visualization tasks.  Dot plots came out on top.  I argued that this is because humans are good at the cognitive task of comparing position along a common scale, compared to making judgements about length, area, shading, direction, angle, volume, curvature, etc.—a finding credited to Cleveland and McGill.  I enjoyed writing it and people seemed to like it, so I’m continuing my visualization series with the scatterplot.

A scatterplot is a two-dimensional plane on which we record the intersection of two measurements for a set of case items—usually two quantitative variables.  Just as humans are good at comparing position along a common scale in one dimension, our visual capabilities allow us to make fast, accurate judgements and recognize patterns when presented with a series of dots in two dimensions.  This makes the scatterplot a valuable tool for data analysts both when exploring data and when communicating results to others.  

In this post—part 1—I’ll demonstrate various uses for scatterplots and outline some strategies to help make sure key patterns are not obscured by the scale or qualitative group-level differences in the data (e.g., the relationship between test scores and income differs for men and women). The motivation in this post is to come up with a model of diamond prices that you can use to help make sure you don’t get ripped off, specified based on insight from exploratory scatterplots combined with (somewhat) informed speculation. In part 2, I’ll discuss the use of panels aka facets aka small multiples to shed additional light on key patterns in the data, and local regression (loess) to examine central tendencies in the data. There are far fewer bad examples of this kind of visualization in the wild than the 3D barplots and pie charts mocked in my last post, though I was still able to find a nice case of MS-Excel gridline abuse and this lovely scatterplot + trend-line.

http://tutorial.math.lamar.edu/Classes/CalcI/Tangents_Rates_files/image003.gifThe scatterplot has a richer history than the visualizations I wrote about in my last post.  The scatterplot’s face forms a two-dimensional Cartesian coordinate system, and DeCartes’ invention/discovery of this eponymous plane in around 1657 represents one of the most fundamental developments in science.  The Cartesian plane unites measurement, algebra, and geometry, depicting the relationship between variables (or functions) visually. Prior to the Cartesian plane, mathematics was divided into algebra and geometry, and the unification of the two made many new developments possible.  Of course, this includes modern map-making—cartography, but the Cartesian plane was also an important step in the development of calculus, without which very little of our modern would would be possible.

Diamonds

Indeed, the scatterplot is a powerful tool to help understand the relationship between two variables in your data set, and especially if that relationship is non-linear. Say you want to get a sense of whether you’re getting ripped off when shopping for a diamond. You can use data on the price and characteristics of many diamonds to help figure out whether the price advertised for any given diamond is reasonable, and you can use scatterplots to help figure out how to model that data in a sensible way. Consider the important relationship between the price of a diamond and its carat weight (which corresponds to its size):

caratprice

A few things pop out right away.  We can see a non-linear relationship, and we can also see that the dispersion (variance) of the relationship also increases as carat size increases.  With just a quick look at a scatterplot of the data, we’ve learned two important things about the functional relationship between price and carat size. And, we also therefore learned that running a linear model on this data as-is would be a bad idea.

Before moving forward, I want provide a proper introduction to the diamonds data set, which I’m going to use for much of the rest of this post. Hadley’s ggplot2 ships with a data set that records the carat size and the price of more than 50 thousand diamonds, from http://www.diamondse.info/ collected in in 2008, and if you’re in the market for a diamond, exploring this data set can help you understand what’s in store and at what price point. This is particularly useful because each diamond is unique in a way that isn’t true of most manufactured products we are used to buying—you can’t just plug a model number and look up the price on Amazon. And even an expert cannot cannot incorporate as much information about price as a picture of the entire market informed by data (though there’s no substitute for qualitative expertise to make sure your diamond is what the retailer claims).

But even if you’re not looking to buy a diamond, the socioeconomic and political history of the diamond industry is fascinating.  Diamonds birthed the mining industry in South Africa, which is now by far the largest and most advanced economy in Africa.  I worked a summer in Johannesburg, and can assure you that South Africa’s cities look far more like L.A. and San Francisco than Lagos, Cairo, Mogadishu, Nairobi, or Rabat.  Diamonds drove the British and Dutch to colonize southern Africa in the first place, and have stoked conflicts ranging from the Boer Wars to modern day wars in Sierra Leone, Liberia, Côte d’Ivoire, Zimbabwe and the DRC, where the 200 carat Millennium Star diamond was sold to DeBeers at the height of the civil war in the 1990s.  Diamonds were one of the few assets that Jews could conceal from the Nazis during the “Aryanization of Jewish property” in the 1930s, and the Congressional Research Service reports that Al Qaeda has used conflict diamonds to skirt international sanctions and finance operations from the 1998 East Africa Bombings to the September 11th attacks.  

http://www.gemnation.com/images/debeers/3_stone_ring_she_already_kn.jpg

Though the diamonds data set is full of prices and fairly esoteric certification ratings, hidden in the data are reflections of how a legendary marketing campaign permeated and was subsumed by our culture, hints about how different social strata responded, and insight into how the diamond market functions as a result.

The story starts in 1870 according to The Atlantic, when many tons of diamonds were discovered in South Africa near the Orange River.  Until then, diamonds were rare—only a few pounds were mined from India and Brazil each year.  At the time diamonds had no use outside of jewelry as they do today in many industrial applications, so price depended only on scarce supply.  Hence, the project’s investors formed the De Beers Cartel in 1888 to control the global price—by most accounts the most successful cartel in history, controlling 90% of the world’s diamond supply until about 2000.  But World War I and the Great Depression saw diamond sales plummet.

http://i.tfcdn.com/img2/i7Ta_VYAY1pjxs2QlVqemlNUGZ-YlxJfnliSnJFazJBRUlJgpa-fmlyhl5mbmJ5arJuYm1iVn6eXnJ-rDxHR99Q3NSwo83JP8XR399GL

In 1938, according to the New York Times’ account, the De Beers cartel wrote Philadelphia ad agency N. W. Ayer & Son, to investigate whether “the use of propaganda in various forms” might jump-start diamond sales in the U.S., which looked like the only potentially viable market at the time.  Surveys showed diamonds were low on the list of priorities among most couples contemplating marriage—a luxury for the rich, “money down the drain.”  Frances Gerety, who the Times compares to Madmen’s Peggy Olson, took on the DeBeers’ account at N.W. Ayer & Son, and worked toward the company’s goal “to create a situation where almost every person pledging marriage feels compelled to acquire a diamond engagement ring.”  A few years later, she coined the slogan, “Diamonds are forever.”

http://i.ebayimg.com/t/2000-De-Beers-Diamond-Mines-Ring-Magazine-Advertisement-Ad-Page-/00/$(KGrHqV,!jkE1Iw6dKBJBN,cRTIdWw~~_35.

The Atlantic’s Jay Epstein argues that this campaign gave birth to modern demand-advertising—the objective was not direct sales, nor brand strengthening, but simply to impress the glamour, sentiment and emotional charge contained in the product itself.  The company gave diamonds to movie stars, sent out press packages emphasizing the size of diamonds celebrities gave each other, loaned diamonds to socialites attending prominent events like the Academy Awards and Kentucky Derby, and persuaded the British royal family to wear diamonds over other gems.  The diamond was also marketed as a status symbol, to reflect “a man’s … success in life,” in ads with “the aroma of tweed, old leather and polished wood which is characteristic of a good club.”  A 1980s ad introduced the two-month benchmark: “Isn’t two months’ salary a small price to pay for something that lasts forever?”

Screen Shot 2014-01-11 at 10.26.34 AM

By any reasonable measure, Frances Gerety succeeded—getting engaged means getting a diamond ring in America. Can you think of a movie where two people get engaged without a diamond ring? When you announce your engagement on Facebook, what icon does the site display?  Still think this marketing campaign might not be the most successful mass-persuasion effort in history?  I present to you a James Bond film, whose title bears the diamond cartel’s trademark:

http://www.dvice.com/sites/dvice/files/diamonds-are-forever.jpg

Awe-inspiring and terrifying.  Let’s open the data set.  The first thing you should consider doing is plotting key variables against each other using the ggpairs() function.  This function plots every variable against every other, pairwise.  For a data set with as many rows as the diamonds data, you may want to sample first otherwise things will take a long time to render.  Also, if your data set has more than about ten columns, there will be too many plotting windows, so subset on columns first.

# Uncomment these lines and install if necessary:
#install.packages('GGally')
#install.packages('ggplot2')
#install.packages('scales')
#install.packages('memisc')

library(ggplot2)
library(GGally)
library(scales)
data(diamonds)

diasamp = diamonds[sample(1:length(diamonds$price), 10000),]
ggpairs(diasamp, params = c(shape = I("."), outlier.shape = I(".")))

* R style note: I started using the “=” operator over “<-” after reading John Mount’s post on the topic, which shows how using “<-” (but not “=”) incorrectly can result in silent errors.  There are other good reasons: 1.) WordPress and R-Bloggers occasionally mangle “<-” thinking it is HTML code in ways unpredictable to me; 2.) “=” is what every other programming language uses; and 3.) (as pointed out by Alex Foss in comments) consider “foo<-3″ — did the author mean to assign foo to 3 or to compare foo to -3?  Plus, 4.) the way R interprets that expression depends on white space—and if I’m using an editor like Emacs or Sublime where I don’t have a shortcut key assigned to “<-”, I sometimes get the whitespace wrong.  This means spending extra time and brainpower on debugging, both of which are in short supply.  Anyway, here’s the plot:

ggpairs

What’s happening is that ggpairs is plotting each variable against the other in a pretty smart way. In the lower-triangle of plot matrix, it uses grouped histograms for qualitative-qualitative pairs and scatterplots for quantitative-quantitative pairs.  In the upper-triangle, it plots grouped histograms for qualitative-qualitative pairs (using the x-instead of y-variable as the grouping factor), boxplots for qualitative-quantitative pairs, and provides the correlation for quantitative-quantitative pairs.

What we really care about here is price, so let’s focus on that.  We can see what might be relationships between price and clarity, and color, which we’ll keep in mind for later when we start modeling our data, but the critical factor driving price is the size/weight of a diamond. Yet as we saw above, the relationship between price and diamond size is non-linear. What might explain this pattern?  On the supply side, larger contiguous chunks of diamonds without significant flaws are probably much harder to find than smaller ones.  This may help explain the exponential-looking curve—and I thought I noticed this when I was shopping for a diamond for my soon-to-be wife. Of course, this is related to the fact that the weight of a diamond is a function of volume, and volume is a function of x * y * z, suggesting that we might be especially interested in the cubed-root of carat weight.

On the demand side, customers in the market for a less expensive, smaller diamond are probably more sensitive to price than more well-to-do buyers. Many less-than-one-carat customers would surely never buy a diamond were it not for the social norm of presenting one when proposing.  And, there are *fewer* consumers who can afford a diamond larger than one carat.  Hence, we shouldn’t expect the market for bigger diamonds to be as competitive as that for smaller ones, so it makes sense that the variance as well as the price would increase with carat size.

Often the distribution of any monetary variable will be highly skewed and vary over orders of magnitude. This can result from path-dependence (e.g., the rich get richer) and/or the multiplicitive processes (e.g., year on year inflation) that produce the ultimate price/dollar amount. Hence, it’s a good idea to look into compressing any such variable by putting it on a log scale (for more take a look at this guest post on Tal Galili’s blog).

p = qplot(price, data=diamonds, binwidth=100) +
    theme_bw() +
    ggtitle("Price")
p

p = qplot(price, data=diamonds, binwidth = 0.01) +
    scale_x_log10() +
    theme_bw() +
    ggtitle("Price (log10)")
p

pricepricelog10

Indeed, we can see that the prices for diamonds are heavily skewed, but when put on a log10 scale seem much better behaved (i.e., closer to the bell curve of a normal distribution).  In fact, we can see that the data show some evidence of bimodality on the log10 scale, consistent with our two-class, “rich-buyer, poor-buyer” speculation about the nature of customers for diamonds.

Let’s re-plot our data, but now let’s put price on a log10 scale:

p = qplot(carat, price, data=diamonds) +
    scale_y_continuous(trans=log10_trans() ) +
    theme_bw() +
    ggtitle("Price (log10) by Cubed-Root of Carat")
p

caratpricelog10

Better, though still a little funky—let’s try using use the cube-root of carat as we speculated about above:

cubroot_trans = function() trans_new("cubroot", transform= function(x) x^(1/3), inverse = function(x) x^3 )

p = qplot(carat, price, data=diamonds) +
    scale_x_continuous(trans=cubroot_trans(), limits = c(0.2,3),
        breaks = c(0.2, 0.5, 1, 2, 3)) +
    scale_y_continuous(trans=log10_trans(), limits = c(350,15000),
        breaks = c(350, 1000, 5000, 10000, 15000)) +
    theme_bw() +
    ggtitle("Price (log10) by Cubed-Root of Carat")
p

crcaratpricelog10

Nice, looks like an almost-linear relationship after applying the transformations above to get our variables on a nice scale.

Overplotting

Note that until now I haven’t done anything about overplotting—where multiple points take on the same value, often due to rounding.  Indeed, price is rounded to dollars and carats are rounded to two digits.  Not bad, though when we’ve got this much data we’re going to have some serious overplotting.

head(sort(table(diamonds$carat), decreasing=TRUE ))
head(sort(table(diamonds$price), decreasing=TRUE ))
 0.3 0.31 1.01  0.7 0.32    1 
2604 2249 2242 1981 1840 1558 

605 802 625 828 776 698 
132 127 126 125 124 121 

Often you can deal with this by making your points smaller, using “jittering” to randomly shift points to make multiple points visible, and using transparency, which can be done in ggplot using the “alpha” parameter.

p = ggplot( data=diamonds, aes(carat, price)) +
    geom_point(alpha = 0.5, size = .75, position="jitter") +
    scale_x_continuous(trans=cubroot_trans(), limits = c(0.2,3),
        breaks = c(0.2, 0.5, 1, 2, 3)) +
    scale_y_continuous(trans=log10_trans(), limits = c(350,15000),
        breaks = c(350, 1000, 5000, 10000, 15000)) +
    theme_bw() +
    ggtitle("Price (log10) by Cubed-Root of Carat")
p

crcaratpricejitterlog10

This gives us a better sense of how dense and sparse our data is at key places.

Using Color to Understand Qualitative Factors

When I was looking around at diamonds, I also noticed that clarity seemed to factor in to price.  Of course, many consumers are looking for a diamond of a certain size, so we shouldn’t expect clarity to be as strong a factor as carat weight.  And I must admit that even though my grandparents were jewelers, I initially had a hard time discerning a diamond rated VVS1 from one rated SI2.  Surely most people need a loop to tell the difference.  And, according to BlueNile, the cut of a diamond has a much more consequential impact on that “fiery” quality that jewelers describe as the quintessential characteristic of a diamond.  On clarity, the website states, “Many of these imperfections are microscopic, and do not affect a diamond’s beauty in any discernible way.” Yet, clarity seems to explain an awful lot of the remaining variance in price when we visualize it as a color on our plot:

p = ggplot( data=diamonds, aes(carat, price, colour=clarity)) +
    geom_point(alpha = 0.5, size = .75, position="jitter") +
    scale_colour_brewer(type = "div",
        guide = guide_legend(title = NULL, reverse=T,
            override.aes = list(alpha = 1))) +
    scale_x_continuous(trans=cubroot_trans(), limits = c(0.2,3),
        breaks = c(0.2, 0.5, 1, 2, 3)) +
    scale_y_continuous(trans=log10_trans(), limits = c(350,15000),
        breaks = c(350, 1000, 5000, 10000, 15000)) +
    theme_bw() + theme(legend.key = element_blank()) +
    ggtitle("Price (log10) by Cubed-Root of Carat and Color")
p

 crcaratpriceclairtylog10

Despite what BlueNile says, we don’t see as much variation on cut (though most diamonds in this data set are ideal cut anyway):

p = ggplot( data=diamonds, aes(carat, price, colour=cut)) +
    geom_point(alpha = 0.5, size = .75, position="jitter") +
    scale_colour_brewer(type = "div",
        guide = guide_legend(title = NULL, reverse=T,
            override.aes = list(alpha = 1))) +
    scale_x_continuous(trans=cubroot_trans(), limits = c(0.2,3),
        breaks = c(0.2, 0.5, 1, 2, 3)) +
    scale_y_continuous(trans=log10_trans(), limits = c(350,15000),
        breaks = c(350, 1000, 5000, 10000, 15000)) +
    theme_bw() + theme(legend.key = element_blank()) +
    ggtitle("Price (log10) by Cube-Root of Carat and Cut")
p

caratpricecutlog10

Color seems to explain some of the variance in price as well, though BlueNile states that all color grades from D-J are basically not noticeable.

p = ggplot( data=diamonds, aes(carat, price, colour=color)) +
    geom_point(alpha = 0.5, size = .75, position="jitter") +
    scale_colour_brewer(type = "div",
        guide = guide_legend(title = NULL, reverse=T,
            override.aes = list(alpha = 1))) +
    scale_x_continuous(trans=cubroot_trans(), limits = c(0.2,3),
        breaks = c(0.2, 0.5, 1, 2, 3)) +
    scale_y_continuous(trans=log10_trans(), limits = c(350,15000),
        breaks = c(350, 1000, 5000, 10000, 15000)) +
    theme_bw() + theme(legend.key = element_blank()) +
    ggtitle("Price (log10) by Cube-Root of Carat and Color")
p

caratpricecolorlog10

At this point, we’ve got a pretty good idea of how we might model price. But there are a few problems with our 2008 data—not only do we need to account for inflation but the diamond market is quite different now than it was in 2008. In fact, when I fit models to this data then attempted to predict the price of diamonds I found on the market, I kept getting predictions that were far too low. After some additional digging, I found the Global Diamond Report. It turns out that prices plummeted in 2008 due to the global financial crisis, and since then prices (at least for wholesale polished diamond) have grown at a roughly a 6 percent compound annual rate. The rapidly-growing number of couples in China buying diamond engagement rings might also help explain this increase. After looking at data on PriceScope, I realized that diamond prices grew unevenly across different carat sizes, meaning that the model I initially estimated couldn’t simply be adjusted by inflation. While I could have done ok with that model, I really wanted to estimate a new model based on fresh data.

Thankfully I was able to put together a python script to scrape diamondse.info without too much trouble. This dataset is about 10 times the size of the 2008 diamonds data set and features diamonds from all over the world certified by an array of authorities besides just the Gemological Institute of America (GIA). You can read in this data as follows (be forewarned—it’s over 500K rows):

#install.packages('RCurl')
library('RCurl')
diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
load(rawConnection(diamondsurl))

My github repository has the code necessary to replicate each of the figures above—most look quite similar, though this data set contains much more expensive diamonds than the original. Regardless of whether you’re using the original diamonds data set or the current larger diamonds data set, you can estimate a model based on what we learned from our scatterplots. We’ll regress carat, the cubed-root of carat, clarity, cut and color on log-price. I’m using only GIA-certified diamonds in this model and looking only at diamonds under $10K because these are the type of diamonds sold at most retailers I’ve seen and hence the kind I care most about. By trimming the most expensive diamonds from the dataset, our model will also be less likely to be thrown off by outliers at the high end of price and carat. The new data set has mostly the same columns as the old one, so we can just run the following (if you want to run it on the old data set, just set data=diamonds).

diamondsbig$logprice = log(diamondsbig$price)
m1 = lm(logprice~  I(carat^(1/3)), 
    data=diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert == "GIA",])
m2 = update(m1, ~ . + carat)
m3 = update(m2, ~ . + cut )
m4 = update(m3, ~ . + color + clarity)

#install.packages('memisc')
library(memisc)
mtable(m1, m2, m3, m4)

Here are the results for my recently scraped data set:

===============================================================
                    m1          m2          m3          m4     
---------------------------------------------------------------
(Intercept)       2.671***    1.333***    0.949***   -0.464*** 
                 (0.003)     (0.012)     (0.012)     (0.009)   
I(carat^(1/3))    5.839***    8.243***    8.633***    8.320*** 
                 (0.004)     (0.022)     (0.021)     (0.012)   
carat                        -1.061***   -1.223***   -0.763*** 
                             (0.009)     (0.009)     (0.005)   
cut: V.Good                               0.120***    0.071*** 
                                         (0.002)     (0.001)   
cut: Ideal                                0.211***    0.131*** 
                                         (0.002)     (0.001)   
color: K/L                                            0.117*** 
                                                     (0.003)   
color: J/L                                            0.318*** 
                                                     (0.002)   
color: I/L                                            0.469*** 
                                                     (0.002)   
color: H/L                                            0.602*** 
                                                     (0.002)   
color: G/L                                            0.665*** 
                                                     (0.002)   
color: F/L                                            0.723*** 
                                                     (0.002)   
color: E/L                                            0.756*** 
                                                     (0.002)   
color: D/L                                            0.827*** 
                                                     (0.002)   
clarity: I1                                           0.301*** 
                                                     (0.006)   
clarity: SI2                                          0.607*** 
                                                     (0.006)   
clarity: SI1                                          0.727*** 
                                                     (0.006)   
clarity: VS2                                          0.836*** 
                                                     (0.006)   
clarity: VS1                                          0.891*** 
                                                     (0.006)   
clarity: VVS2                                         0.935*** 
                                                     (0.006)   
clarity: VVS1                                         0.995*** 
                                                     (0.006)   
clarity: IF                                           1.052*** 
                                                     (0.006)   
---------------------------------------------------------------
R-squared             0.888       0.892      0.899        0.969
N                338946      338946     338946       338946    
===============================================================

Now those are some very nice R-squared values—we are accounting for almost all of the variance in price with the 4Cs.  If we want to know what whether the price for a diamond is reasonable, we can now use this model and exponentiate the result (since we took the log of price). Let’s take a look at an example from Blue Nile. I’ll use the full model, m4.

# Example from BlueNile
# Round 1.00 Very Good I VS1 $5,601

thisDiamond = data.frame(carat = 1.00, cut = "V.Good", color = "I", clarity="VS1")
modEst = predict(m4, newdata = thisDiamond, interval="prediction", level = .95)
exp(modEst)

#        fit     lwr      upr
# 1 5040.436 3730.34 6810.638

The results yield an expected value for price given the characteristics of our diamond and the upper and lower bounds of a 95% CI—note that because this is a linear model, predict() is just multiplying each model coefficient by each value in our data. Turns out that this diamond is a touch pricier than expected value under the full model, though it is by no means outside our 95% CI. BlueNile has by most accounts a better reputation than diamondse.info however, and reputation is worth a lot in a business that relies on easy-to-forge certificates and one in which the non-expert can be easily fooled.

This illustrates an important point about generalizing a model from one data set to another. First, there may be important differences between data sets—as I’ve speculated about above—making the estimates systematically biased. Second, overfitting—our model may be fitting noise present in data set. Even a model cross-validated against out-of-sample predictions can be over-fit to noise that results in differences between data sets. Of course, while this model may give you a sense of whether your diamond is a rip-off against diamondse.info diamonds, it’s not clear that diamondse.info should be regarded as a source of universal truth about whether the price of a diamond is reasonable. Nonetheless, to have the expected price at diamondse.info with a 95% interval is a lot more information than we had about the price we should be willing to pay for a diamond before we started this exercise.

An important point—even though we can predict diamondse.info prices almost perfectly based on a function of the 4c’s, one thing that you should NOT conclude from this exercise is that *where* you buy your diamond is irrelevant, which apparently used to be conventional wisdom in some circles.  You will almost surely pay more if you buy the same diamond at Tiffany’s versus Costco.  But Costco sells some pricy diamonds as well. Regardless, you can use this kind of model to give you an indication of whether you’re overpaying.

Of course, the value of a natural diamond is largely socially constructed. Like money, diamonds are only valuable because society says they are—-there’s no obvious economic efficiencies to be gained or return on investment in a diamond, except perhaps in a very subjective sense concerning your relationship with your significant other. To get a sense for just how much value is socially constructed, you can compare the price of a natural diamond to a synthetic diamond, which thanks to recent technological developments are of comparable quality to a “natural” diamond. Of course, natural diamonds fetch a dramatically higher price.

One last thing—there are few guarantees in life, and I offer none here. Though what we have here seems pretty good, data and models are never infallible, and obviously you can still get taken (or be persuaded to pass on a great deal) based on this model. Always shop with a reputable dealer, and make sure her incentives are aligned against selling you an overpriced diamond or worse one that doesn’t match its certificate. There’s no substitute for establishing a personal connection and lasting business relationship with an established jeweler you can trust.

One Final Consideration

Plotting your data can help you understand it and can yield key insights.  But even scatterplot visualizations can be deceptive if you’re not careful.  Consider another data set the comes with the alr3 package—soil temperature data from Mitchell, Nebraska, collected by Kenneth G. Hubbard from 1976-1992, which I came across in Weisberg, S. (2005). Applied Linear Regression, 3rd edition. New York: Wiley (from which I’ve shamelessly stolen this example).

Let’s plot the data, naively:

#install.packages('alr3')
library(alr3)
data(Mitchell)
qplot(Month, Temp, data = Mitchell) + theme_bw()

MitchellTempMonth

Looks kinda like noise.  What’s the story here?

When all else fails, think about it.

What’s on the X axis?  Month.  What’s on the Y-axis?  Temperature.  Hmm, well there are seasons in Nebraska, so temperature should fluctuate every 12 months.  But we’ve put more than 200 months in a pretty tight space.

Let’s stretch it out and see how it looks:

MitchellTempMonthWide

Don’t make that mistake.

That concludes part I of this series on scatterplots.  Part II will illustrate the advantages of using facets/panels/small multiples, and show how tools to fit trendlines including linear regression and local regression (loess) can help yield additional insight about your data. You can also learn more about exploratory data analysis via this Udacity course taught by my colleagues Dean Eckles and Moira Burke, and Chris Saden, which will be coming out in the next few weeks (right now there’s no content up on the page).

Posted in R | 4 Comments

Streamline Your Mechanical Turk Workflow with MTurkR

I’ve been using Thomas Leeper‘s MTurkR package to administer my most recent Mechanical Turk study—an extension of work on representative-constituent communication claiming credit for pork benefits, with Justin Grimmer and Sean Westwood.  MTurkR is excellent, making it quick and easy to:

  • test a project in your MTurk sandbox
  • deploy a project live
  • download the completed HIT data from Amazon
  • issue credit to workers after you run your own validation checks in R
  • issue bonuses en-masse
  • all programmatically via the API.

In this post I’ll walk you through the set-up process and a couple of basic examples.  But first a few words to motivate your interest in Mechanical Turk.

On Mechanical Turk and Social Science

Mechanical Turk has proven to be a boon for social science research.  I discussed how it’s a great way attain data based on human judgements (e.g., content analysis, coding images, etc.) in my last post on creating labels for supervised text classification.

It’s also a great source of experimental subjects in a field often plagued by small samples, which have lower power and higher false discovery rates (especially when researchers exploit “undisclosed flexibility” in their design and analysis).  Mechanical Turk provides a large, geographically and demographically diverse sample of participants, freeing researchers from confines of undergraduate samples, which can respond differently than other key populations.

Mechanical Turk also provides a much lower-cost experimental platform.  Compared to a national survey research firm, the cost per subject is dramatically lower, and most survey firms simply do not allow the same design flexibility, generally limiting you to question-wording manipulations.  Mechanical Turk is also cheaper than running a lab study in terms of the researcher’s time, or that of her graduate students.  This is true even for web studies—just tracking undergraduate participants and figuring out who should get course credit is a mind-numbing, yet error-prone and so frustrating process.   Mechanical Turk makes it easy for the experimenter to manage data collection, perform validation checks, and quickly disburse payment to subjects who actually participated.  I’ll show you how to streamline this process to automate as much as possible in the next section.

But are results from Mechanical Turk valid and will I be able to publish on Mturk data?  Evidence that the answer is “yes” accumulates.  Berinksy, Huber, and Lenz (2012) published a validation study in Political Analysis that replicates important published experimental work using the platform.  Furthermore, Justin Grimmer, Sean Westwood and I published a study in the American Political Science Review on how people respond when congressional representatives announce political pork in which we provide extensive Mechanical Turk validation in the online appendix.  We provide demographic comparisons to the U.S. Census, show the regional distribution of Turkers by state and congressional district, and show that correlations between political variables are about the same as in national survey samples.  Sean and I also published a piece in Communication Research on how social cues drive the news we consume which among other things replicates past findings on partisans’ preferences for news with a partisan slant.

But are Turkers paying enough attention to actually process my stimuli?  It’s hard to know for sure, but there are an array of questions one can deploy that are designed to make sure Turkers are paying attention.  When my group runs Mturk studies, we typically ask a few no-brainer questions to qualify at the beginning of any task (e.g., 5 + 7 = 13, True or False?), then burry some specific instructions in a paragraph of text toward the end (e.g., bla bla bla, write “I love Mechanical Turk” in the box asking if you have any comments).

But we are paying Turkers, doesn’t that make them more inclined toward demand effects?  Well, compared to what? Compared to a student who needs class credit or a paid subject?  On the contrary, running a study on the web ensures that researchers in a lab do not provide non-verbal cues (of which they are unaware) that can give rise to demand effects.  Of course, in any experimental setting it’s important to think carefully about potential demand effects, and especially so when deploying noticeable/obvious manipulations, and or when you have a within-subject design such that respondents observe multiple conditions.

Set-Up

Install the most recent version of MTurkR from Tom’s github repo.  You can use Hadley’s excellent devtools package to install it:

# install.packages("devtools")
library(devtools)
install_github(repo="MTurkR",
    username = "leeper")

If you don’t have a Mechanical Turk account, get one.  Next you want to set up your account, fund it, etc., there are a lot of good walkthroughs just google it if you have trouble.

Also, set up your sandbox so you can test things before you actually spend real money.

You’ll also need to get your AWS Access Key ID and AWS Secret Access Key to get access to the API.  Click on “Show” under “Secret Access Key” to actually see it.

I believe you can use the MTurkR package to design a questionnaire and handle randomization, but that’s beyond the scope of this post—for now just create a project as you normally would via the GUI.

Examples

First, make sure your credentials work correctly and check your account balance (replace the dummy credentials with your own):

require("MTurkR")
credentials(c("AKIAXFL6UDOMAXXIHBQZ","oKPPL/ySX8M7RIXquzUcuyAZ8EpksZXmuHLSAZym"))
AccountBalance()

Great, now R is talking to the API.  Now you can actually create HITs.  Let’s start in your sandbox.  Create a project in your sandbox, or just copy a project you’ve already created—copy the html in the design layout screen from your production account to your sandbox account.

EditHTMLSource

Now you need to know the HIT layout id.  Make sure you are in the “Create” tab in the requester interface, and click on your Project Name.  This window will pop up:

MTurkScreenShot

Copy and paste the Layout ID in your R script, you’ll use it in a second.  With the Layout ID in hand, you can create HITs based on this layout.  Let’s create a HIT in our sandbox first.  We’ll first set the qualifications that Mechanical Turkers must meet, then tell the API to create a new HIT in the sandbox with 1200 assignments for $.50 each.  Modify the parameters below to fit your needs.

# First set qualifications
# ListQualificationTypes() to see different qual types
qualReqs = paste(

	# Set Location to US only
	GenerateQualificationRequirement(
		"Location","==","US"),

	# Worker_PercentAssignmentsApproved
	GenerateQualificationRequirement(
		"000000000000000000L0", ">", "80",
		qual.number=2),

	# Un-comment after sandbox test
	# Worker_NumberHITsApproved
	# GenerateQualificationRequirement(
	#	"00000000000000000040", ">", "100",
	#	qual.number=3),

	sep="" )

# Create new batch of hits:
newHIT = CreateHIT(

	# layoutid in sandbox:
	hitlayoutid="22P2J1LY58B74P14WC6KKD16YGOR6N",
	sandbox=T,

	# layoutid in production:
	# hitlayoutid="2C9X7H57DZKPHWJIU98DS25L8N41BW",

	annotation = "HET Experiment with Pre-Screen",
	assignments = "1200",
	title="Rate this hypothetical representative",
	description="It's easy, just rate this
		hypothetical representative on how well
		she delivers funds to his district",
	reward=".50",
	duration=seconds(hours=4),
	expiration=seconds(days=7),
	keywords="survey, question, answers, research,
			politics, opinion",
	auto.approval.delay=seconds(days=15),
	qual.reqs=qualReqs
)

To check to make sure everything worked as intended, go to your worker sandbox and search for the HIT you just created.  See it? Great.

Here’s how to check on the status of your HIT:

# Get HITId (record result below)
newHIT$HITId
# "2C2CJ011K274LPOO4SX1EN488TRCAG"

HITStatus(hit="2C2CJ011K274LPOO4SX1EN488TRCAG")

And now here’s where the really awesome and time-saving bit comes in, downloading the most recent results.  If you don’t use the API, you have to do this manually from the website GUI, which is a pain.  Here’s the code:

review = GetAssignments(hit="2C2CJ011K274LPOO4SX1EN488TRCAG",
	status="Submitted", return.all=T)

Now you probably want to actually run your study with real subjects.  You can copy your project design HTML back to the production site and repeat the above for production when you are ready to launch your HIT.

Often, Turkers will notice something about your study and point it out, hoping that it will prove useful and you will grant them a bonus.  If they mention something helpful, grant them a bonus!  Here’s how (replace dummy workerid with the id for the worker to whom you wish to grant a bonus):

## Grant bonus to worker who provided helpful comment:
bonus 	assignments=review$AssignmentId[review$WorkerId=="A2VDVPRPXV3N59"],
	amounts="1.00",
	reasons="Thanks for the feedback!")

You’ll also save time when you go to validate and approve HITs, which MTurkR allows you to do from R.  My group usually uses Qualtrics for questionnaires that accompany our studies where we build in the attention check described above.  Here’s how to automate the attention check and approve HITs for those who passed:

svdat = read.csv(unzip("data/Het_Survey_Experiment.zip"), skip=1, as.is=T)
approv = agrep("I love Mechanical Turk", max.distance=.3,
svdat$Any.other.comments.or.questions.)
svdat$Any.other.comments.or.questions.[approv]

correct = review$AssignmentId[
	gsub(" ", "", review$confcode) %in% svdat$GUID[approv] ]

# To approve:
approve = approve(assignments=correct)

Note that “agrep()” is an approximate matching function which I use in case Turkers add quotes or other miscellaneous text that shouldn’t disqualify their work.

But that’s not all–suppose we want to check the average time per HIT (to make sure we are paying Turkers enough) and/or start analyzing our data—we can do this via the API rather than downloading things from the web with MTurkR.

# check average time:
review = GetAssignments(hit="2C2CJ011K274LPOO4SX1EN488TRCAG",
	status="Approved", return.all=T)

# Check distribution of time each Turker takes:
quantile(review$SecondsOnHIT/60)

If we discover we aren’t paying Turkers enough in light of how long they spend on the HIT, which in addition to being morally wrong means that we will probably not collect a sufficient number of responses in a timely fashion, MTurkR allows us to quickly remedy the situation.  We can expire the HIT, grant bonuses to those who completed the low-pay HIT, and relaunch.  Here’s how:

# Low pay hits:
HITStatus(hit="2JV21O3W5XH0L74WWYBYKPLU3XABH0")

# Review remaining HITS and approve:
review = GetAssignments(hit="2JV21O3W5XH0L74WWYBYKPLU3XABH0",
status="Submitted", return.all=T)

correct = review$AssignmentId[ which( gsub(" ", "", review$confcode) %in% svdat$GUID[approv] ) ]

approve = approve(assignments=correct)

# Expire remaining:
ExpireHIT(hit="2JV21O3W5XH0L74WWYBYKPLU3XABH0")

approvedLowMoneyHits = GetAssignments(hit="2JV21O3W5XH0L74WWYBYKPLU3XABH0",
	status="Approved", return.all=T)

# Grant bonus to workers who completed hit already:
bonus = GrantBonus(workers=approvedLowMoneyHits$WorkerId,
	assignments=approvedLowMoneyHits$AssignmentId,
	amounts="1.00",
	reasons="Upping the pay for this HIT, thanks for completing it!")

We can now run the administrative side of our Mechanical Turk HITs from R directly.  It’s a much more efficient workflow.

Posted in R | 1 Comment

Generating Labels for Supervised Text Classification using CAT and R

lambda_cv

The explosion in the availability of text has opened new opportunities to exploit text as data for research. As Justin Grimmer and Brandon Stewart discuss in the above paper, there are a number of approaches to reducing human text to data, with various levels of computational sophistication and human input required. In this post, I’ll explain how to use the Coding Analysis Toolkit (CAT) to help you collect human evaluations of documents, which is a necessary part of many text analyses, and especially so when you have a specific research question that entails precisely characterizing whether a particular document contains a particular type of content. CAT facilitates fast data entry and handles data management when you have multiple human coders. It’s default output can be tricky to deal with however, so I’ll also provide R code to extract useable data from CAT’s XML output, which should serve as a good into to data munging with XML to the uninitiated. I’ll also show you how to compute metrics that will help diagnose the reliability of your coding system, which entails using the melt and cast functionality in Hadley’s ‘reshape’ package to get the data in the right shape then feeding the results to the ‘irr’ package.

In future posts, I’ll explain how to use these labels to train various machine learning algorithms aka classification models to automatically classify documents in a large corpus. I’ll talk about how to extract features using R and the ‘tm’ package; the problem of classification in high-dimensional spaces (e.g., there are many many words in natural language) and how we can exploit the bias-variance tradeoff to get traction on this problem; the various models that are generally well suited for text classification like the lasso, elastic net, SVMs and random forests; the importance of properly tuning these models; and how to use cross-validation to avoid overfitting these models to your data. (For a preview check out my slides and R labs for my short course on Analyzing Text as Data that I presented at the Stanford Computational Social Science Workshop)

Is human categorization/content analysis the right solution?

Social scientists have been trying to reduce the complexities and subjectivities of the human language to objective data for a long time, calling it “content analysis.” It is no easy task. If you decide to use human coders, I suggest you read Kim Neuendorf’s book and you can find some nice resources on her website that may prove helpful. You’d also do well to read Krippendoff’s eponymous classic.

If you are trying to characterize the type of things that occur or discover categories in your documents, it might make more sense to go with a fully computational approach, employing unsupervised machine learning methods that cluster documents based on word features (e.g. simple word counts, unigrams, or combinations thereof, N-grams). You can take a look at the Grimmer and Stewart paper above for more details on this approach, or check out the notes from lectures 5 – 7 from Justin’s course on the topic. If your are interested in additional computational approaches/features, have a look at Dan Jurafsky’s course From Language to Information.

But if you have a specific research question in mind, which entails precisely characterizing whether a particular document contains a particular type of content, you probably need people to read and classify each document according to a coding scheme. Of course, this can become expensive or impossible if you are dealing with a large corpus of documents. If so, you can first have people classify a sample of documents, which can then be used as labels to train supervised classifiers. Once you have a classifier that performs well, you can use it to classify the rest of the documents in your large data set. I have put together some slides on this process, along with an R lab with example code. Also of interest may be some slides I put together on acquiring and pre-processing text data from the web with R, along with computational labs on interacting with APIs and scraping/regex.

If instead you care about making generalizations about the proportion of documents in any given category (in your population of documents based on your sample), check out the R package ReadMe, which implements A Method of Automated Nonparametric Content Analysis for Social Science, and an industrial implementation of the method is offered at big data social media analytics company Crimson Hexagon. If you decide on this approach, you’ll still need human-labeled categories to start, so keep reading.

Is CAT the right solution for your categorization task?

Before we get to CAT, I want to talk about human classification with Amazon’s Mechanical Turk, which is great when the classification task is simple and the documents/units of text to classify are short. Mturk is especially useful when the categorization task becomes mind-numbingly boring with repetition, because when one Turker burns out on your task and stops working, fresh Turkers can continue the work. Hence, the main advantage to Mturk is that you can get simple classification tasks done much more quickly than by relying upon research assistants/undergraduates/employees/etc—in fact I’ve used Mturk to classify thousands of short open responses to the question “What’s the most important issue facing the nation” into Gallup’s categories in a matter of hours.

Often overlooked are the nice interfaces that Mturk provides both to it’s requesters (e.g., you), which makes data management far easier than keeping track of excel spreadsheets/google docs edited by multiple coders, and to it’s workers, which translate to less mouse clicks, fatigue, and probably lower error rates. Panos Ipeirotis has some nice slides (best stuff in 30-50) + open source code to help ensure this kind of crowd-sourced data is of the highest quality.

But often the classification task is complicated and people need training to do it correctly and efficiently. I’d direct you to the various links above for guidance on building codebooks for complicated classification schemes and training coders. A great human-coder system is well-conceptualized, highly relevant to the corpus in question, and contains crystal clear instructions for your coders—providing flow charts and diagrams seems to be especially helpful. When Justin, Sean and I were implementing a coding protocol recently, we used this flow chart (from this latex file ) to compliment our actual codebook.

Why is CAT better?

My experiences with Mechanical Turk spoiled me—all of the complexities of dealing with multiple coders entering data and managing that data were abstracted away by the system that Mturk has in place. What I’d done in the past—having analysts enter data in a Google Doc or worse, MS-Excel—was keystroke and mouse-click intensive, which meant it was time-consuming and error-prone for coders when they were entering data, and for me when I was merging/cleaning data from multiple coders.

CAT is the best solution I’ve come across yet. It’s interface isn’t aesthetically perfect, but it gets the job done well. It minimizes key strokes and requires no mouse clicks for data-entry, so you’ll see your coders work faster and probably happier. It maintains your data, alleviating the need to manage spreadsheets and concerns about your coders making errors due to transcribing codes to a spreadsheet. Because it also handles the back-end of things, there’s no need to maintain your own servers/sql database/etc. But it’s open-source so if you need to use your own servers, you can download the source and set it up yourself.
Screen Shot 2013-01-05 at 5.34.36 PM

Getting Started

Head over to the CAT website and register for an account. Maybe poke around a bit on the website a bit, familiarize yourself with the menus. When you’re ready to upload a data set, go to Datasets –> Upload Raw Data Set. CAT wants a .zip file with all of your documents in individual .txt files.

If you have your text in a vector in R, say called text_vector, you can output each element to individual .txt files as follows:

for( i in 1:length(text_vector)){
  capture.output(text_vector[i], file = paste("doc_number_", i, ".txt", sep="") )
}

Next put these files in a zip archive and upload to CAT. You can upload another file that specifies the coding scheme, but it’s probably easier just to make the coding scheme using the interface on the website later.

When you’re done, go to Datasets –> View Raw Datasets and click on the dataset you just uploaded. From this page, you can manage coders associated with the data sets. If you click on Manage Sub-Accounts, you can easily create new accounts for your coders. Add yourself and other coders to the dataset and be sure to click the “Set Chosen Coders” button when you’re done.

Next, implement your carefully constructed coding scheme. From the same “View Raw Dataset” page, click on “Add or modify codes in the dataset” (on the right under “Toolbox”). Add a sensical name for each code and enter a shortcut key—this will make it so your coders can just hit a button to code a document and move on to the next. When you’re done, hit finished.

I highly recommend you test out your coding scheme yourself. You’ll also probably want to consult your coders and iteratively tweak your codebook according to qualitative input from your coders (this is discussed in full in the content analysis links above).

Then set your coders loose on a hundred documents or so.

Exporting Data from CAT

This process is a bit more complicated than it sounds. If you download data in CSV format, it comes out jagged (i.e., not rectangular), and hence it’s not immediately useful in R. The .CSV file becomes especially convoluted if you change your code labels, add coders, etc.

Better to just deal with the XML output. I’ll introduce data-munging/ETL with XML below. XML is like HTML, but tags describe data, not formatting. We’ll use those tags to create queries that return the data we want. XML is semi-structured data, with a general tree structure, rather than the rectangular structure that R likes. This tree structure saves space and is highly flexible, though it can be hard to work with initially. If you’ve never seen XML before, you’d do well to check out Jennifer Widom’s excellent lecture on XML in her Coursera course. XML is often used in various useful data APIs, which you can learn more about by checking out Sean Westwood’s short course on the topic.

To get the XML output for your project, from the “View Raw Dataset” page, select “Download Coded Text File (XML Format)” from the drop-down menu and then click on “Download Data” (on the right under “Toolbox”).

Here’s how to read in and clean the resulting file. You need to do this step because (1) R wants the encoding to be utf-8 but the CAT file says it’s utf-16, and (2) the XML package doesn’t like “&#x0​;” strings (HTML notation for NULL), which frequently occur in the CAT output.

doc <- readLines("http://dl.dropbox.com/u/25710348/blog/sample.xml")

# check out what's in the file:
head(doc)

# Fix utf-8 issue:
doc <- gsub("utf-16", "utf-8", doc)

# Remove bad "&#x0" characters:
grep("&#x0​;", doc)

doc <- gsub("&#x0​;", "", doc)

First a bit of background on this particular data. These represent a small subsample of documents that Justin Grimmer, Sean Westwood and I were putting together for a project looking at how congressional representatives claim credit for expenditures in their district. We were most concerned with identifying press releases that claimed credit for an expenditure in the district (CC_expenditure). But we also wanted to code items that explicitly criticized earmarks or advocated for earmark reform (Egregious); items that were speaking to local constituents—advertising constituent service that the candidate performed or community events the candidate attended to build name recognition (Adv_par_const); and items that were explicitly taking a national policy position or speaking to non-local audiences (Position_taking_other).

Have a look at the file above in an editor so you get a sense of its structure. The file first provides a lot of meta-data about the project, about each of the codes used, the coders, then the full text of each document. After the full text comes the actual data we want—how each item (paragraphId) was coded (codeId) and by which coder (coderId). It looks like this:

      <codedDataItem>
        <paragraphId>4334458</paragraphId>
        <codeId>142061</codeId>
        <coderId>4506</coderId>
      </codedDataItem>

It’s the values inside each of the tags that we want. Here’s how we can get them: (1) parse the XML so R recognizes the tags and values properly using the XML package, and (2) extract those values and get them into a data frame for analysis using XPATH. (2) involves telling R to traverse the XML data and return the value in each of the paragraphId, codeId and coderId tags.

# Parse the XML
# uncomment the line below to install the XML package
# install.packages('XML')

library('XML')
doc <- xmlInternalTreeParse(doc, asText=T)

# That was easy, now for #2:
para = unlist(xpathApply(doc, "//paragraphId", xmlValue))
code = unlist(xpathApply(doc, "//codeId", xmlValue))
coderid = unlist(xpathApply(doc, "//coderId", xmlValue))

# Now put into a data frame:
alldat <- data.frame(para, coder=coderid, code)

That’s great, but if you want human-readable data, you need to do a few more things. Let’s pull each of the codeIds and codenames, then use that to map each of the codeIds in our data back to human-readable codes. We’ll do the same thing for our coders and give a number to each of the coding units (paragraphId).

# now map back to human readable values:
# CODES
codeids <- unlist(xpathApply(doc, "//code", xmlGetAttr, "codeId" ))
codenames <- unlist(xpathApply(doc, "//code", xmlValue))
alldat$codes <- codenames[match(alldat$code, codeids)]

# CODERS
coderids <- unlist(xpathApply(doc, "//coder", xmlGetAttr, "coderId" ))
codernames <- unlist(xpathApply(doc, "//coder", xmlValue))
alldat$coder <- codernames[match(alldat$coder, coderids)]

# paragraph num:
pgnum <- as.numeric(unlist(lapply(strsplit(paragraphCodes, "_"), function(x) x[[2]] ))) 
alldat$pgnum <- pgnum[match(para, paragraphIds)]

# paragraph tag:
paragraphTag <- unlist(xpathApply(doc, "//paragraph", xmlGetAttr, "paragraphTag"))
alldat$paragraphTag <- paragraphTag[match(para, paragraphIds)]

Excellent, now we have our data in a very nice rectangular format.

Basic diagnostics

Two of the most helpful diagnostics when assessing inter-coder reliability are confusion matrices and Krippendorff’s Alpha. Confusion matrices are a bit easier to produce when the data is in this format so that’s where I’ll start.

A confusion matrix is just a contingency table, or incidence matrix, that helps us figure out if any two coders are scoring things in the same way. It consists of the incidence matrix of codes for a pair of coders where the entries are the sum of the incidences—if this sounds confusing, don’t worry this will become clear in the example below. One compact way to get this is to use the paragraph-code incidence matrix for each coder, then multiply each pair of matrices. Here’s how to do it:

# get paragraph-code incidence matrix for each coder:
alltabs <- table(alldat$para, alldat$codes, alldat$coder )
dimnames(alltabs)[[3]]

coder1 <- alltabs[,,1]
coder2 <- alltabs[,,2]
coder3 <- alltabs[,,3]

# Multiply together to get confusion matrix for each pair
# of coders:
coder12 <- t(coder1) %*% coder2
coder23 <- t(coder2) %*% coder3
coder13 <- t(coder1) %*% coder3

# Clean up column names so we can read things clearly:
dimnames(coder12)[[2]] <- substr( dimnames(coder12)[[2]], 1, 6)
dimnames(coder23)[[2]] <- substr( dimnames(coder23)[[2]], 1, 6)
dimnames(coder13)[[2]] <- substr( dimnames(coder13)[[2]], 1, 6)

# Take a look:
coder12
coder23
coder13

# Pay attention to the sum on the diagonal:
sum(diag(coder12))
sum(diag(coder23))
sum(diag(coder13))

Here’s what the first confusion matrix looks like:

> coder12
                       
                        Adv_pa CC_exp Egregi Positi
  Adv_par_const             25      0      0      6
  CC_expenditure             4     12      0      6
  Egregious                  0      0      0      0
  Position_taking_other     11      1      0     35
                 

It shows the incidence between coder 1 and coder 2′s codes, with coder 1′s codes on the rows and coder 2′s codes on the columns. So coder 1 and coder 2 coded 25 of the same items as “Adv_par_const” but coder 1 coded “Position_taking_other” when coder 2 coded “Adv_par_const” for 11 items.

This can help diagnose which categories are creating the most confusion. We can see that our coders are confusing “Adv_par_const” and “Position_taking_other” more often than “CC_expenditure” and “Adv_par_const.” For us, that meant we focused on distinguishing these two categories in our training sessions.

It’s also useful to look at Krippendorff’s alpha to get a sense for the global agreement between all coders. We can compute Krippendorff’s alpha using the “irr” package.

But first a little data-munging is in order. The irr package expects data in a matrix with a single row for each document and columns for each coder. But of course, currently our data is in “long” format, with one line for each document-coder pair. Luckily, we can use the “reshape” package to “melt” our data then “cast” it into the format we want. In this case, “melt” does not actually change the shape of our data—it’s already long. It simply adds a “variable” column and a “value” column, which is necessary to use “cast.” Next, transform the variable to numeric so that irr will be happy. Lastly, cast the data into the format we want, with a column for each coder.

library(reshape)
alltabsm <- melt(alldat, measure.vars=c("codes"))

# Add make "value" column numeric for irr
alltabsm$value <- as.numeric(alltabsm$value)
alltabsrs <- cast(alltabsm[,which(names(alltabsm)!="code")], ... ~ coder)

And lastly, run the kripp.alpha() function on the columns that contain the coders and codes.

# KRIPP ALPHA
library(irr)
kripp.alpha(t(as.matrix(alltabsrs[,5:7])) )

Now, all that’s left is to get our output. What we want is to take the mode value for each article (which in this case is the same as the median). Let’s take a look at the histogram of the results.

alltabsrs$modecode <- apply(alltabsrs[,5:7], 1, median)
hist(alltabsrs$modecode)

codes

You can see from the histogram that code 3 (Eggregious) was rare, as we were expecting. The other codes look good.

And now we’ve got what we need! A data frame with each item, each code for each coder, and the most commonly occurring code that we can use as the actual label in our analysis.

Posted in R | 5 Comments

Working with Bipartite/Affiliation Network Data in R


Data can often be usefully conceptualized in terms affiliations between people (or other key data entities). It might be useful analyze common group membership, common purchasing decisions, or common patterns of behavior. This post introduces bipartite/affiliation network data and provides R code to help you process and visualize this kind of data. I recently updated this for use with larger data sets, though I put it together a while back.

Preliminaries

Much of the material here is covered in the more comprehensive “Social Network Analysis Labs in R and SoNIA,” on which I collaborated with Dan McFarland, Sean Westwood and Mike Nowak.

For a great online introduction to social network analysis see the online book Introduction to Social Network Methods by Robert Hanneman and Mark Riddle.

Bipartite/Affiliation Network Data

A network can consist of different ‘classes’ of nodes. For example, a two-mode network might consist of people (the first mode) and groups in which they are members (the second mode). Another very common example of two-mode network data consists of users on a particular website who communicate in the same forum thread.

Here’s a short example of this kind of data. Run this in R for yourself – just copy an paste into the command line or into a script and it will generate a dataframe that we can use for illustrative purposes:

df <- data.frame( person =
    c('Sam','Sam','Sam','Greg','Tom','Tom','Tom','Mary','Mary'), group =
    c('a','b','c','a','b','c','d','b','d'), stringsAsFactors = F)

df
person group
1    Sam     a
2    Sam     b
3    Sam     c
4   Greg     a
5    Tom     b
6    Tom     c
7    Tom     d
8   Mary     b
9   Mary     d

Fast, efficient two-mode to one-mode conversion in R

Suppose we wish to analyze or visualize how the people are connected directly – that is, what if we want the network of people where a tie between two people is present if they are both members of the same group? We need to perform a two-mode to one-mode conversion.

To convert a two-mode incidence matrix to a one-mode adjacency matrix, one can simply multiply an incidence matrix by its transpose, which sum the common 1′s between rows. Recall that matrix multiplication entails multiplying the k-th entry of a row in the first matrix by the k-th entry of a column in the second matrix, then summing, such that the ij-th row-column entry in resulting matrix represents the dot-product of the i-th row of the first matrix and the j-th column of the second. In mathematical notation:

\displaystyle  AB=  \begin{bmatrix}  a & b\\  c & d  \end{bmatrix}  \begin{bmatrix}  e & f\\  g & h  \end{bmatrix} =  \begin{bmatrix}  ae+bg & af+bh\\  ce+dg & cf+dh  \end{bmatrix}

Notice further that multiplying a matrix by its transpose yields the following:

\displaystyle  AA'=  \begin{bmatrix}  a & b\\  c & d  \end{bmatrix}  \begin{bmatrix}  a & c\\  b & d  \end{bmatrix} =  \begin{bmatrix}  aa+bb & ac+bd\\  ca+db & cc+dd  \end{bmatrix}

Because our incidence matrix consists of 0′s and 1′s, the off-diagonal entries represent the total number of common columns, which is exactly what we wanted. We’ll use the %*% operator to tell R to do exactly this. Let’s take a look at a small example using toy data of people and groups to which they belong. We’ll coerce the data to an incidence matrix, then multiply the incidence matrix by its transpose to get the number of common groups between people.

This is easy to do using the matrix algebra functions included in R. But first, you need to restructure your (edgelist) network data as an incidence matrix. An incidence will record a 1 for row-column combinations where a tie is present and 0 otherwise. One easy way to do this in R is to use the table function and then coerce the table object to a matrix object:

m <- table( df )
M <- as.matrix( m )

If you are using the network or sna packages, a network object be coerced via as.matrix(your-network); with the igraph package use get.adjacency(your-network).

This is great, but what about if we are working with a really large data set? Network data is almost always sparse—there are far more pairwise combinations of potential connections than actual observed connections. Hence, we’d actually prefer to keep the underlying data structured in edgelist format, but we’d also like access to R’s matrix algebra functionality.

We can get the best of both worlds using the Matrix library to construct a sparse triplet representation of a matrix. But we’d also like to avoid building the entire incidence matrix and just feed Matrix our edgelist directly, a point that came up in a recent conversation I had with Sean Taylor. We feed Matrix our ‘person’ column to index ‘i’ (rows in the new incidence matrix), our ‘group’ column to index j (columns in the new incidence matrix), and we repeat ’1′ for the length of the edgelist to denote an incidence.

library('Matrix')
A <- spMatrix(nrow=length(unique(df$person)),
		ncol=length(unique(df$group)),
		i = as.numeric(factor(df$person)),
		j = as.numeric(factor(df$group)),
		x = rep(1, length(as.numeric(df$person))) )
row.names(A) <- levels(factor(df$person))
colnames(A) <- levels(factor(df$group))
A

We will either convert to the ‘mode’ represented by the columns or by the rows.

To get the one-mode representation of ties between rows (people in our example), multiply the matrix by its transpose. Note that you must use the matrix-multiplication operator %*% rather than a simple astrisk. The R code is:

Arow <- A %*% t(A)

But we can still do better! The function tcrossprod is faster and more efficient for this:

Arow <- tcrossprod(A)

Arow will now represent the one-mode matrix formed by the row entities—people will have ties to each other if they are in the same group, in our example. Here’s what it looks like:

Arow
4 x 4 sparse Matrix of class "dgCMatrix"
     Greg Mary Sam Tom
Greg    1    .   1   .
Mary    .    2   1   2
Sam     1    1   3   2
Tom     .    2   2   3

To get the one-mode matrix formed by the column entities (i.e. the number of people) enter the following command:

Acol <- t(A) %*% A

Again, we can use tcrossprod to make this even more efficient:

Acol <- tcrossprod(t(A))

And the resulting co-membership matrix is as follows:

Mcol
group
group a b c d
a 2 1 1 0
b 1 3 2 2
c 1 2 2 1
d 0 2 1 2

Although we’ve used a very small network for our example, this code is highly extensible to the analysis of larger networks with R.

Analysis of Two Mode Data and Mobility

Let’s work with some actual affiliation data, collected by Dan McFarland on student extracurricular affiliations. It’s a longitudinal data set, with 3 waves – 1996, 1997, 1998.  It consists of students (anonymized) and the student organizations in which they are members (e.g. National Honor Society, wrestling team, cheerleading squad, etc.).

What we’ll do is to read in the data, make some mode conversions, visualize the networks in various ways, compute some centrality measures, and then compute transition probabilities (the probability that a member of one group will stay a member of the same group or become a member of a new group

# Load the "igraph" library
library("igraph")

# (1) Read in the data files, NA data objects coded as "na"
magact96 = read.delim("http://dl.dropbox.com/u/25710348/snaimages/mag_act96.txt",
    na.strings = "na")
magact97 = read.delim("http://dl.dropbox.com/u/25710348/snaimages/mag_act97.txt",
    na.strings = "na")
magact98 = read.delim("http://dl.dropbox.com/u/25710348/snaimages/mag_act98.txt",
    na.strings = "na")

Missing data is coded as “na” in this data, which is why we gave R the command na.strings = “na”.
These files consist of four columns of individual-level attributes (ID, gender, grade, race), then a bunch of group membership dummy variables (coded “1″ for membership, “0″ for no membership).  We need to set aside the first four columns (which do not change from year to year).

magattrib = magact96[,1:4]

g96 <- as.matrix(magact96[,-(1:4)]); row.names(g96) = magact96$ID.
g97 <- as.matrix(magact97[,-(1:4)]); row.names(g97) = magact97$ID.
g98 <- as.matrix(magact98[,-(1:4)]); row.names(g98) = magact98$ID.

By using the [,-(1:4)] index, we drop those columns so that we have a square incidence matrix for each year, and then tell R to set the row names of the matrix to the student’s ID. Note that we need to keep the “.” after ID in this dataset (because it’s in the name of the variable).

Now we load these two-mode matrices into igraph:

i96 <- graph.incidence(g96, mode=c("all") )
i97 <- graph.incidence(g97, mode=c("all") )
i98 <- graph.incidence(g98, mode=c("all") )

Plotting two-mode networks

Now, let’s plot these graphs. The igraph package has excellent plotting functionality that allows you to assign visual attributes to igraph objects before you plot. The alternative is to pass 20 or so arguments to the plot.igraph() function, which gets really messy.

Let’s assign some attributes to our graph. First we set vertex attributes, making sure to make them slightly transparent by altering the gamma, using the rgb(r,g,b,gamma) function to set the color. This makes it much easier to look at a really crowded graph, which might look like a giant hairball otherwise.

You can read up on the RGB color model here.

Each node (or “vertex”) object is accessible by calling V(g), and you can call (or create) a node attribute by using the $ operator so that you call V(g)$attribute. Here’s how to set the color attribute for a set of nodes in a graph object:

V(i96)$color[1:1295] <- rgb(1,0,0,.5)
V(i96)$color[1296:1386] <- rgb(0,1,0,.5)

Notice that we index the V(g)$color object by a seemingly arbitrary value, 1295.  This marks the end of the student nodes, and 1296 is the first group node. You can view which nodes are which by typing V(i96). R prints out a list of all the nodes in the graph, and those with a number are obviously different from those that consist of a group name.

Now we’ll set some other graph attributes:

V(i96)$label <- V(i96)$name
V(i96)$label.color <- rgb(0,0,.2,.5)
V(i96)$label.cex <- .4
V(i96)$size <- 6
V(i96)$frame.color <- NA

You can also set edge attributes. Here we’ll make the edges nearly transparent and slightly yellow because there will be so many edges in this graph:

E(i96)$color <- rgb(.5,.5,0,.2)

Now, we’ll open a pdf “device” on which to plot. This is just a connection to a pdf file. Note that the code below will take a minute or two to execute (or longer if you have a pre- Intel dual-core processor).

pdf("i96.pdf")
plot(i96, layout=layout.fruchterman.reingold)
dev.off()

Note that we’ve used the Fruchterman-Reingold force-directed layout algorithm here.  Generally speaking, the when you have a ton of edges, the Kamada-Kawai layout algorithm works well but, it can get really slow for networks with a lot of nodes. Also, for larger networks, layout.fruchterman.reingold.grid is faster, but can fail to produce a plot with any meaninful pattern if you have too many isolates, as is the case here. Experiment for yourself.

Here’s what we get:


It’s oddly reminiscent of a cresent and star, but impossible to read. Now, if you open the pdf output, you’ll notice that you can zoom in on any part of the graph ad infinitum without losing any resolution. How is that possible in such a small file? It’s possible because the pdf device output consists of data based on vectors: lines, polygons, circles, elipses, etc., each specified by a mathematical formula that your pdf program renders when you view it. Regular bitmap or jpeg picture output, on the other hand, consists of a pixel-coordinate mapping of the image in question, which is why you lose resolution when you zoom in on a digital photograph or a plot produced with most other programs.

Let’s remove all of the isolates (the cresent), change a few aesthetic features, and replot. First, we’ll remove isloates, by deleting all nodes with a degree of 0, meaning that they have zero edges. Then, we’ll suppress labels for students and make their nodes smaller and more transparent. Then we’ll make the edges more narrow more transparent. Then, we’ll replot using various layout algorithms:

i96 <- delete.vertices(i96, V(i96)[ degree(i96)==0 ])
V(i96)$label[1:857] <- NA
V(i96)$color[1:857] <-  rgb(1,0,0,.1)
V(i96)$size[1:857] <- 2

E(i96)$width <- .3
E(i96)$color <- rgb(.5,.5,0,.1)

pdf("i96.2.pdf")
plot(i96, layout=layout.kamada.kawai)
dev.off()

pdf("i96.3.pdf")
plot(i96, layout=layout.fruchterman.reingold.grid)
dev.off()

pdf("i96.4.pdf")
plot(i96, layout=layout.fruchterman.reingold)
dev.off()

I personally prefer the Fruchterman-Reingold layout in this case. The nice thing about this layout is that it really emphasizes centrality–the nodes that are most central are nearly always placed in the middle of the plot. Here’s what it looks like:

Very pretty, but you can’t see which groups are which at this resolution. Zoom in on the pdf output, and you can see things pretty clearly.

Two mode to one mode data transformation

We’ve emphasized groups in this visualization so much, that we might want to just create a network consisting of group co-membership. First we need to create a new network object. We’ll do that the same way for this network as for our example at the top of this page:

g96e <- t(g96) %*% g96
g97e <- t(g97) %*% g97
g98e <- t(g98) %*% g98

i96e <- graph.adjacency(g96e, mode = "undirected")

Now we need to tansform the graph so that multiple edges become an attribute ( E(g)$weight ) of each unique edge:

E(i96e)$weight <- count.multiple(i96e)
i96e <- simplify(i96e)

Now we’ll set the other plotting parameters as we did above:

# Set vertex attributes
V(i96e)$label <- V(i96e)$name
V(i96e)$label.color <- rgb(0,0,.2,.8)
V(i96e)$label.cex <- .6
V(i96e)$size <- 6
V(i96e)$frame.color <- NA
V(i96e)$color <- rgb(0,0,1,.5)

# Set edge gamma according to edge weight
egam <- (log(E(i96e)$weight)+.3)/max(log(E(i96e)$weight)+.3)
E(i96e)$color <- rgb(.5,.5,0,egam)

We set edge gamma as a function of how many edges exist between two nodes, or in this case, how many students each group has in common. For illustrative purposes, let’s compare how the Kamada-Kawai and Fruchterman-Reingold algorithms render this graph:

pdf("i96e.pdf")
plot(i96e, main = "layout.kamada.kawai", layout=layout.kamada.kawai)
plot(i96e, main = "layout.fruchterman.reingold", layout=layout.fruchterman.reingold)
dev.off()

I like the Kamada-Kawai layout for this graph, because the center of the graph is too busy otherwise. And here’s what the resulting plot looks like:


You can check out the difference between each layout yourself. Here’s what the pdf output looks like.  Page 1 shows the Kamada-Kawai layout and page 2 shows the Fruchterman Reingold layout.

Group overlap networks and plots

Now we might also be interested in the percent overlap between groups. Note that this will be a directed graph, because the percent overlap will not be symmetric across groups–for example, it may be that 3/4 of Spanish NHS members are in NHS, but only 1/8 of NHS members are in the Spanish NHS. We’ll create this graph for all years in our data (though we could do it for one year only).

First we’ll need to create a percent overlap graph. We start by dividing each row by the diagonal (this is really easy in R):

ol96 <- g96e/diag(g96e)
ol97 <- g97e/diag(g97e)
ol98 <- g98e/diag(g98e)

Next, sum the matricies and set any NA cells (caused by dividing by zero in the step above) to zero:

magall <- ol96 + ol97 + ol98
magall[is.na(magall)] <- 0

Note that magall now consists of a percent overlap matrix, but because we’ve summed over 3 years, the maximun is now 3 instead of 1.

Let’s compute average club size, by taking the mean across each value in each diagonal:

magdiag <- apply(cbind(diag(g96e), diag(g97e), diag(g98e)), 1, mean )

Finally, we’ll generate centrality measures for magall. When we create the igraph object from our matrix, we need to set weighted=T because otherwise igraph dichotomizes edges at 1. This can distort our centrality measures because now edges represent  more than binary connections–they represent the percent of membership overlap.

magallg <- graph.adjacency(magall, weighted=T)

# Degree
V(magallg)$degree <- degree(magallg)

# Betweenness centrality
V(magallg)$btwcnt <- betweenness(magallg)

Before we plot this, we should probably filter some of the edges, otherwise our graph will probably be too busy to make sense of visually.  Take a look at the distribution of connection strength by plotting the density of the magall matrix:

plot(density(magall))

Nearly all of the edge weights are below 1–or in other words, the percent overlap for most clubs is less than 1/3. Let’s filter at 1, so that an edge will consists of group overlap of more than 1/3 of the group’s members in question.

magallgt1 <- magall
magallgt1[magallgt1<1] <- 0
magallggt1 <- graph.adjacency(magallgt1, weighted=T)

# Removes loops:
magallggt1 <- simplify(magallggt1, remove.multiple=FALSE, remove.loops=TRUE)

Before we do anything else, we’ll create a custom layout based on Fruchterman.-Ringold wherein we adjust the coordates by hand using the tkplot gui tool to make sure all of the labels are visible. This is very useful if you want to create a really sharp-looking network visualization for publication.

magallggt1$layout <- layout.fruchterman.reingold(magallggt1)
V(magallggt1)$label <- V(magallggt1)$name
tkplot(magallggt1)

Let the plot load, then maximize the window, and select to View -> Fit to Screen so that you get maximum resolution for this large graph. Now hand-place the nodes, making sure no labels overlap:

Pay special attention to whether the labels overlap (or might overlap if the font was bigger) along the vertical. Save the layout coordinates to the graph object:

magallggt1$layout <- tkplot.getcoords(1)

We use “1″ here because only if this was the first tkplot object you called. If you called tkplot a few times, use the last plot object. You can tell which object is visible because at the top of the tkplot interface, you’ll see something like “Graph plot 1″ or in the case of my screenshot above “Graph plot 7″ (it was the seventh time I called tkplot).

# Set vertex attributes
V(magallggt1)$label <- V(magallggt1)$name
V(magallggt1)$label.color <- rgb(0,0,.2,.6)
V(magallggt1)$size <- 6
V(magallggt1)$frame.color <- NA
V(magallggt1)$color <- rgb(0,0,1,.5)

# Set edge attributes
E(magallggt1)$arrow.size <- .3

# Set edge gamma according to edge weight
egam <- (E(magallggt1)$weight+.1)/max(E(magallggt1)$weight+.1)
E(magallggt1)$color <- rgb(.5,.5,0,egam)

One thing that we can do with this graph is to set label size as a function of degree, which adds a “tag-cloud”-like element to the visualization:

V(magallggt1)$label.cex <- V(magallggt1)$degree/(max(V(magallggt1)$degree)/2)+ .3
#note, unfortunately one must play with the formula above to get the
#ratio just right

Let’s plot the results:

pdf("magallggt1customlayout.pdf")
plot(magallggt1)
dev.off()

Note that we used the custom layout, which because we made part of the igraph object magallggt1, we did not need to specify in plot command.

Here’s the pdf output, and here’s what it looks like:


This visualization reveals much more information about our network than our cresent-star visualization.

Posted in R | 7 Comments

Visualization series: Insight from Cleveland and Tufte on plotting numeric data by groups

After my post on making dotplots with concise code using plyr and ggplot, I got an email from my dad who practices immigration law and runs a website with a variety of immigration resources and tools.  He pointed out that the post was written for folks who already know that they want to make dot plots, and who already know about bootstrapped standard errors.  That’s not many people.

In an attempt to appeal to a broader audience, I’m starting a series in which I’ll outline the key principles I use when developing a visualization.  In this post, I’ll articulate these principles, which combine some of Tuft’s aesthetic guidelines with Cleveland’s scientific approach to visualization, which is based on the psychological processes involved in making sense of visualizations, and has been rigorously tested via randomized controlled experiments.  Based on these principles, I’ll argue that dotplots and scatterplots are better than other types of plots (especially pie charts) in most situations.  In later posts, I’ll demonstrate another innovation whose widespread use I’ll credit to Cleveland and Tufte: the use of multiple panels (aka small multiples, trellis graphics, facets, generalized draftsman’s displays, multivar charts) to clearly convey the same information embedded in more complex and difficult to read visualizations, including multiple line plots and mosaic plots. In future posts I’ll also emphasize why it is important to provide some indication of the noise present in the underlying data using error bars or bands.  Along the way, I’ll put you to the test–I’ll present some visualizations of the same data using different visualization techniques and ask you to try to get as much information as you can in 2 seconds from each type of visualization.

A good visualization conveys key information to those who may have trouble interpreting numbers and/or statistics, which can make your findings accessible to a wider audience (more on this below).  Visualizations also give your audience a break from lexical processing, which is especially useful when you are presenting your findings–people can listen to you and process the findings from a well-designed visual at the same time, but most people have trouble listening while reading your PowerPoint bullet points.  Visualizations also convey key information embedded in massive amounts of data, which can aid your own exploratory analysis of data, no matter how massive.

Yet most visualizations are flawed, drawn using elements that make it unnecessarily difficult for the human visual system to make sense of things.  I see a lot of these visualizations attending research presentations, screening incoming draft manuscripts as the assistant editor for Political Communication, and as a consumer of media info-graphics (CNN is especially bad, have a look at this monstrosity).  Kevin Fox has an especially compelling visual speaking to this here. A big part of the problem is that Microsoft makes it easy to draw flashy but ultimately confusing visualizations in Excel.  If you are too busy to read this post in full, follow this short list of guidelines and you’ll be on your way to producing elegant visualizations that impose a minimal cognitive burden on your audience:

  1. Never represent something in 2 or worse yet 3 dimensions if it can be represented in one—NEVER use pie charts, 3-D pie charts, stacked bar charts, or 3-D bar charts.
  2. Remove as much chart junk as possible–unnecessary gridlines, shading, borders, etc.
  3. Give your audience a sense of the noise present in your data–draw error bars or confidence bands if you are plotting estimates.
  4. If you want to plot multiple types of groups on a single outcome (the visual analog of cross-tabulations/marginals), use multi-paneled plots. These can also help if overploting looks too cluttered.
  5. Avoid mosaic plots. Instead use paneled histograms.
  6. Ditch the legend if you can (you almost always can).

The rest of the content in this series emphasizes why it makes sense to follow these guidelines. In this post I’ll look at the first point in detail and touch on the sixth. These two guidelines are most relevant when you want to look at a quantitative variable (e.g., earnings, vote-share, temperature, etc.) across different qualitative groupings (e.g., industry segment, candidate, party, racial group, season, etc.).  This is one of the most common visualization tasks in business, media, and social science, and for this task people often use pie charts and/or bar charts, and occasionally dot plots.

The science of graphical perception

When most people think about visualization, they think first of Edward Tufte.  Tufte emphasizes integrity to the data, showing relationships between phenomena, and above all else aesthetic minimalism.  I appreciate his ruthless crusade against chart junk and pie charts (nice quote from Data without Borders). We share an affinity for multipanel plotting approaches, which he calls “small multiples,” (thanks to Rebecca Weiss for pointing this out) though I think people give Tufte too much credit for their invention—both juiceanalytics and infovis-wiki write that Cleveland introduced the concept/principle. However, both Cleveland and Tufte published books in 1983 discussing the use of multipanel displays; David Smith over at Revolutions writes that “the “small-multiples” principle of data visualization [was] pioneered by Cleveland and popularized in Tufte’s first book”; and the earliest reference to a work containing multipanel displays I could find was published *long* before Tufte’s 1983 work–Seder, Leonard (1950), “Diagnosis with Diagrams—Part I”, Industrial Quality Control (New York, New York: American Society for Quality Control) 7 (1): 11–19.

I’m less sure about Tufte’s advice to always show axes starting at zero, which can make comparison between two groups difficult, and to “show causality,” which can end up misleading your readers.  Of course, the visualizations on display in the glossy pages of Tufte’s books are beautiful–they belong  in a museum.  But while his books are full of general advice that we should all keep in mind when creating plots, he does not put forth a theory of what works and what doesn’t when trying to visualize data.

Cleveland (with Robert McGill) develops such a theory and subjects it to rigorous scientific testing. In my last post I linked to one of Cleveland’s studies showing that dots (or bars) aligned on the same scale are indeed the best visualization to convey a series of numerical estimates.  In this work, Cleveland examined how accurately our visual system can process visual elements or “perceptual units” representing underlying data.  These elements include markers aligned on the same scale (e.g., dot plots, scatterplots, ordinary bar charts), the length of lines that are not aligned on the same scale (e.g., stacked bar plots), area (pie charts and mosaic plots), angles (also pie charts), shading/color, volume, curvature, and direction.

He runs two experiments: the first compares judgements about relative position (grouped bar charts) to judgements based only on length (stacked bar charts); the second compares judgements about relative position (ordinary bar charts) to judgements about angles/area (pie charts).  Here are the materials he uses, courtesy of the Stanford Computer Graphics Lab:

The results are resoundingly clear—judgements about position relative to a baseline are dramatically more accurate than judgements about angles, area, or length (with no baseline).  Hence, he suggests that we replace pie charts with bar charts or dot plots and that we substitute stacked bar charts for grouped bar charts.

A striking and often overlooked finding in this work is the fact that the group of participants without technical training, “mostly ordinary housewives” as Cleveland describes them, performed just as well as the group of mostly men with substantial technical training and experience.   This finding provides evidence for something that I’ve long suspected: that visualizations make it easier for people lacking quantitative experience to understand your results, serving to level the playing field.  If you want your findings to be broadly accessible, it’s probably better to present a visualization rather than a bunch of numbers.  It also suggests that if someone is having trouble interpreting your visualizations, it’s probably your fault.

Dotplots versus pie charts and stacked barplots

Now let’s put this to the test.  Take a look at each visualization below for two seconds, looking for the percent of the vote that Mitt Romney, Ron Paul, and Jon Huntsman got.

Which is easiest to read? Which conveys information most accurately? Let’s first take a look at the most critical information–the order in which the candidates placed.  In all plots, the candidates are arrayed in order from highest to least vote share, and it’s easy to see that Mitt won.  But once we start looking at who came in second, third, and so on, differences emerge.  It’s slightly harder to process order in the pie chart because your eye has to go around the plot rather than up and down in a straight line.  In the stacked bar chart, we need to look up which color corresponds to which candidate’s in the legend (as Tufte told us not to use), adding a layer of cognitive processing.

Second, which conveys estimates most accurately? The dot plot is the clear winner here.  We can quickly see that Romney got about 37%, Paul got about 24%, and Huntsman got about 16%, just by looking at dots relative to the axis.  When we look at the pie chart, it’s really tough to estimate the exact percent each candidate got.  Same with the stacked bar chart. We could add numbers to the pie and bar charts, which would even things out to some extent, but then why not just display a table with exact percents?

One argument I used to hear all the time when I worked in industry is that pie charts “convey a sense of proportion.”  Well, sure, I guess I can kind of guestimate that Ron Paul’s vote share is about 1/4.  What about Jon Huntsman? Hmm, it looks like about 15 percent, which is 3/20.  But wait, why do I want to convert things into fractions anyway? I don’t think in terms of fractions, I think in terms of percents.  And if I really care about proportion, I suppose I could extend the axis from 0 to 100.

Suppose I want to plot results for the top 15 candidates, not just the top 6?  Here’s what happens:

No contest, the pie chart fails completely.  We’d need to add a legend with colors for each candidate, which adds another layer of cognitive processing–we’d need to look up each color in the lengend as we go.  And even after adding the legend, you wouldn’t be able to distinguish the lower performing candidates from say write-in votes because the pie slices would be too small.  The stacked bar chart will fail for the same reasons, so I’ve excluded it in the interest of brevity.  Note that we don’t need to add colors to the dotplot to convey the same information, which saves an extra plotting element that we can use to represent something else (say candidate’s campaign funds or total assets).  And, on top of it all, the dot plot takes up less screen/page real estate!

Why do I use dot plots instead of ordinary bar charts? A nice visualization guide from perceptualedge.com points out that often we want to only visualize differences between groups in a narrow range (they use an example wherein monthly expenses vary from $4,250-$5,500). But the length of a bar is supposed to facilitate accurate comparisons between values, so when you use a bar plot starting from $4,250, the length between bars dramatically exaggerates the actual differences. Dot plots do not have this problem because dot encode values using only location, so one must reference the axis to interpret the value.

A related points is that bars are often used to convey counts–we use them in histograms to represent frequency and track say counts of dollars earned/raised in bar charts.  In fact, a team of doctors I work with at the med school recently sent in a manuscript to Radiology containing a bar chart plotting mean values between groups; they got back the following comment from the statistical reviewer: “the y-axis is quantitative but the data are represented using bars as if the data were counts.”  People often use bar plots to convey estimates of means (and I’ve certainly done this), which can serve to exaggerate differences in means and hence effect sizes if you do not plot the bars from zero.

In addition, dot plots have aesthetic advantages.  They convey the numerical estimate in question with a single one-dimensional point, rather than a two dimensional bar.  There’s simply less that the eye needs to process.  Accordingly, if a pattern across qualitative groupings exists, it’s often easier to see with a dot plot.  For example, below I plot the average user ratings for each article to which Sean Westwood and I exposed subjects in a news reading experiment.  The pattern that emerges is an “S” curve in which one or two stories dominate the ratings, most are sort of average, and a few are uniformly terrible. Note that you’d probably want to use something like this more for yourself than to communicate your results to others as it might overload your audience with too much information–you’d do better to select a subset of these articles or remove some of the ones in the middle (thanks to Yph Lelkes for making this point).

One question that remains is if pie charts are so bad, why are they so common? Perhaps we like them because we find them comforting just as we find pies and pizza? Well if so we’d expect pie charts to be less common in places like Japan and China where people grow up eating different food.  Consider info-graphics in newspapers: I haven’t yet done a systematic content analysis, but I was unable to find a single pie chart in Japan’s Yomimuri Shimbun nor the Asahi Shimbun; nor in China’s Beijing Daily nor Sing Tao Daily.  I did see plenty of maps, however, which I suppose one could argue are reminiscent of noodles.

Implementation

The most efficient way to produce solid visualizations with the ability to implement multiple panels, proper standard error estimates, and dot plots is probably in R using the ggplot2 package.  If you do not have time to learn R and remain tied to MS-Excel stick to ordinary barplots to visualize quantitative variables among multiple groups (not recommended).

Otherwise, if you don’t already use it, download R and a decent editor like Rstudio.  Then get started with ggplot2 and dot plots by running the following code chunk which will replicate the election figure above:


pres <- read.csv("http://www.stanford.edu/~messing/primaryres.csv", as.is=T)

# sort data in order of percent of vote:
pres <- pres[order(pres$Percentage, decreasing=T), ]

# only show top 15 candidates:
pres <- pres[1:15,]

# create a precentage variable
pres$Percentage <- pres$Percentage*100

# reorder the Candidate factor by percentage for plotting purposes:
pres$Candidate <- reorder(pres$Candidate, pres$Percentage)

# To install ggplot2, run the following line after deleting the #
#install.packages("ggplot2")

library(ggplot2)
ggplot(pres, aes(x = Percentage, y = factor(Candidate) )) +
		geom_point() +
		theme_bw() + opts(axis.title.x = theme_text(size = 12, vjust = .25))+
		xlab("Percent of Vote") + ylab("Candidate") +
		opts(title = expression("New Hampshire Primary 2012"))

After loading our data and running a few preliminary data processing operations, we pass ggplot our data set, “pres,” then we tell it what aesthetic elements we want to use, in this case that x is going to be our “Percentage” variable and y is going to be our “Candidate” variable. We tell ggplot that we want to display points for every xy pair. We also tell it to use the black and white theme, and pass some obscure axis options that ensures the axis plot correctly. Then we tell it what to label the x and y axis, and give it a title.

We can also reproduce the article ratings by story plot above using ggplot2 (even though I originally produced the plot using the lattice package).

# load the data
load(file("http://www.stanford.edu/~messing/db.Rda"))

# if you haven't installed plyr, delete the # and run this line:
# install.packages("plyr")

library(plyr)

# first we use plyr to calculate the mean rating and SE for each story
ratingdat <- ddply(db, c("story"), summarise,
		M = mean(rating, na.rm=T),
		SE = sd(rating, na.rm=T)/sqrt(length(na.omit(rating))),
		N = length(na.omit(rating)))

# make story into an ordered factor, ordering by mean rating:
ratingdat$story <- factor(ratingdat$story)
ratingdat$story <- reorder(ratingdat$story, ratingdat$M)

# take a look at our handiwork:

# Now open a connection to a pdf file
pdf(file="plots/dotplot-story-rating.pdf", height=14, width=8.5)

ggplot(ratingdat, aes(x = M, xmin = M-SE, xmax = M+SE, y = story )) +
		geom_point() + geom_segment( aes(x = M-SE, xend = M+SE,
						y = story, yend=story)) +
		theme_bw() + opts(axis.title.x = theme_text(size = 12, vjust = .25))+
		xlab("Mean rating") + ylab("Story") +
		opts(title = expression("Rating article by Story, with SE"))
dev.off()

Posted in R | 18 Comments

Putting it all together: concise code to make dotplots with weighted bootstrapped standard errors

Dotplots with plyr, boot, and ggplot2

I analyze a lot of experiments and there are many times when I want to quickly look at means and standard errors for each cell (experimental condition), or the same for each cell and individual-level attribute level (e.g., Democrat, Independent, Republican).  Many of these experiments are embedded in national or state-level surveys, for which each respondent is weighted so that the sample means for gender, age, and political affiliation approximate those of a probability sample. But if you use weights, it’s difficult to get unbiased estimates of the standard error of that weighted mean.

In this post, I’ll walk through how to use the powerful plyr package to summarize your weighted data, estimating weighted means and standard errors for each experimental cell using the bootstrap, and then generate dotplots to visualize the result using the ggplot2 package. Note that these are both Hadley Wickham’s packages; he’s also responsible for the immensely useful packages reshape and stringr, which I hope to write about in future posts.

First, I’ll describe the data we’re going to use. Just before the 2010 elections, my lab hired YouGov/Polimetrics to survey Illinois voters about a series of candidates for statewide office, most notably those for Secretary of State.  The major party candidates included the Black Democrat incumbent, Jesse White, and a Hispanic Republican challenger, Robert Enriquez. We exposed respondents to different images of Jesse White, according to three experimental conditions: light, dark, and control (no picture). Our dependent measures included a question about how each respondent intended to vote and a feeling thermometer measuring affect toward each candidate. I use the net of the thermometer rating for Obama minus the thermometer rating for McCain (“netther”). For more on this type of measure, here’s a link to a study on feeling thermometers and their precision compared to questionnaire measures with fewer categories.

Even before we have to deal with the issue of weighted standard errors, it would be nice to find an elegant way to extract weighted statistics on subsets of our data corresponding to our experimental cells (or other qualitative grouping of our data, such as respondent party or ethnicity). It’s rather clunky and/or difficult to do this with other packages/functions, such as base::by or doBy::summaryBy because of the difficulty of passing more than one variable to the function in question, say “weighted.mean(),” applied to the subset of your data specified by a particular variable (see this discussion for details — thanks to David Freedman for the pointer to plyr::ddply). The survey::svymean function provides another potential alternative, but is nowhere near as flexible as is ddply.

We’ll use plyr::ddply, which handles this problem quite well. In the code below, we use “ddply(),” which takes a data frame for input and returns a data frame for output. The “.data” parameter specifies the data frame and the “.variables” parameter specifies the qualitative variable used to subset the data. We use “summarise()” as the function, which allows us to specify not only the function we want to pass over the subsets of our data (experimental cells), but also to tell ddply which variables in our data subset to pass to the function. For example, below we run the function “mean()” on the variable “netther” for each treatment group.

install.packages("plyr")
library("plyr")

# load the Illinois survey data
load(file("http://www.stanford.edu/~messing/data/IL2010.Rda"))

# look at ddply:
ddply(.data = IL2010, .variables= .(treat), .fun = summarise,
		mean = mean(x=netther, na.rm=T),
		se = sd(netther, na.rm=T)/sqrt(length(netther)) )

Now that we have some nice machinery to quickly summarize weighted data in hand, let’s get into bootstrapping. The computation of weighted standard errors for estimates of the weighted mean has no straighforward closed form solution, and hence bootstrapping is not a bad way to go. Bootstrapping consists of computing a statistic over and over, by resampling the data with replacement. For more see the UCLA stats webpage on bootstrapping in R.

I’ll use the boot package to implement bootstrapping, which requires us to specify a bootstrapping function that returns a statistic.

install.packages("boot")
library("boot")

sample.wtd.mean <- function(x, w, d) {
	return(weighted.mean(x = x[d], w = w[d], na.rm=T ))
}

Why do we need this function? It allows the boot function to sample our data many times (w/ replacement), each time computing the statistic we want to estimate, as mentioned above. Specifying that function make this process quite elegant by making use of R’s indexing capabilities. Boot pass this function an index of items to include from our original data in each of the resamples. To illustrate this, run the following code in R:

playdata <- runif(20, 1, 10)
playdata
d <- sample(20, replace = TRUE)
d
playdata[d]

The code “playdata[d]” returns the “dth” element from playdata. This is exactly how the boot function works. It passes our “sample.wtd.mean()” function our index of items (d) over and over again based on sampling without replacement — utilizing the fast c code that R uses for indexing rather than slow R for-loops.

When boot is done, we take the mean or median of these statistics as our estimate, which is a loose definition of the bootstrap. We also use the standard deviation of these estimates as the estimate of the standard error of the statistic. Here’s what happens when we pass our function over our data:

b <- boot( IL2010$netther, statistic = sample.wtd.mean, R=1000, w=IL2010$weight)
b
sd(b$t)

[UPDATE 26 JAN 2012: perviously this was coded incorrectly.  The "weights" parameter has been changed to "w." Thanks to Gaurav Sood and Yph Lelkes for pointing this out!]

Now let’s put it all together using “plyr” and “ggplot2.” I use dotplots because they convey numbers more accurately than other types of plots, such as bar plots or (worst of all) pie charts. William Cleveland has conducted research showing that dots aligned on the same scale are indeed the best visualization to convey a series of numerical estimates (Correction, the actual study is here). His definition of “best” is based on the accuracy with which people interpret various graphical elements, or as he calls them, “perceptual units.”

We’ll first just visualize results by party:


install.packages("ggplot2")
library("ggplot2")

# Create nice party ID variable:
IL2010$pid[IL2010$pid==8] <- NA
IL2010$pidcut <- cut(IL2010$pid, c(-Inf, 1, 3, 4, 6, Inf) , 
		c("Strong Democrat", "Democrat", "Independent", "Republican", "Strong Republican"))
# clean up treatment variable
IL2010$treat <- factor(IL2010$treat, labels = c("Dark", "Light", "Control" ))
# Use plyr::ddply to produce estimates based on PID
pid.therm <- ddply(.data = IL2010, .variables= .(pidcut), .fun = summarise,		
		mean = weighted.mean(x=netther, w=weight, na.rm=T),
		se = sd(boot(netther, sample.wtd.mean, R = 1000, w = weight)$t))
# plot it
ggplot(na.omit(pid.therm), aes(x = mean, xmin = mean-se, xmax = mean+se, y = factor(pidcut))) +
		geom_point() + geom_segment( aes(x = mean-se, xend = mean+se,
						y = factor(pidcut), yend=factor(pidcut))) +
		theme_bw() + opts(axis.title.x = theme_text(size = 12, vjust = .25))+
		xlab("Mean thermometer rating") + ylab("Treatment") +
		opts(title = expression("Thermometer ratings for Jessie White by party"))

That’s a pretty big ggplot command, so let’s discuss each element. The first is the data to plot, which is the object produced via ddply. The second specifies the aesthetic elements of the plot, including the x and y values of the plot, and the boundaries of the plot based on specified minimum and maximum endpoints (“xmin = mean-se, xmax = mean+se”). Next, we add points to the plot by envoking “geom_point().” If we only wanted points, we could stop here. We plot lines representing standard errors with “geom_segment()” specifying x and xend. Next, we specify that we want to use theme_bw(), which I find cleaner than the default parameters. Then we make some minor adjustments via opts for aesthetics. Lastly, we set the y and x labels and the title.

Well, it’s pretty clear from our plot that there’s a lot of action on partisan ID. Let’s look at how each group is affected by the treatment. To do so, we’ll produce the plot at the top of this article, putting each partisan group in a panel via “facet_wrap(),” and plotting each treatment group. First, we’ll generate the summary data using plyr, all we need to do is add another variable (“treat”).

# PID x Treatment
trtbypid.therm <- ddply(.data = IL2010, .variables= .(treat, pidcut), .fun = summarise,		
		mean = weighted.mean(x=netther, w=weight, na.rm=T),
		se = sd(boot(netther, sample.wtd.mean, R = 1000, w = weight)$t))

ggplot(na.omit(trtbypid.therm), aes(x = mean, xmin = mean-se, xmax = mean+se, y = factor(treat))) + 
		geom_point() + geom_segment( aes(x = mean-se, xend = mean+se, 
						y = factor(treat), yend=factor(treat))) +
		facet_wrap(~pidcut, ncol=1) +
		theme_bw() + opts(axis.title.x = theme_text(size = 12, vjust = .25))+ 
		xlab("Mean thermometer rating") + ylab("Treatment") +
		opts(title = expression("Thermometer ratings for Jessie White by treatment condition and pid"))

It looks like Republicans generally don’t like the dark image, while Democrats do. But what if we want to look at the way that racial attitudes interact with the treatments? We measured implicit racial attitudes via the IAT). Let’s take a look.

# Fix up the data to remove the "Control" condition and all "na's"
ilplot <- IL2010[-which(is.na(IL2010$pidcut)),]
ilplot$treat[ilplot$treat=="Control"] <- NA
ilplot$treat <- ilplot$treat[drop=T]
ilplot <- ilplot[-which(is.na(ilplot$treat)),]
ilplot$dscore[which(ilplot$dscore< -1)] <- NA

# Plot it
ggplot(ilplot, aes(x = dscore, y = netther, colour=treat)) +
		geom_point() + geom_smooth(method = "loess", size = 1.5) +
		facet_wrap(~pidcut, nrow=1) +
		theme_bw() + opts(axis.title.x = theme_text(size = 12, vjust = .25))+
		xlab("IAT d-score (- pro black, + anti-black)") + ylab("Thermometer rating") +
		opts(title = expression("Thermometer ratings for Jessie White by treatment condition and iat"))

Visualization of IAT x skin complexion manipulation (conditional on party id ) via ggplot2

With this plot, we don’t really need to subset the data because we are using all of the points and just using lowess to draw lines representing the conditional means across every value of IAT score. The plot shows what seems to be an interaction between IAT score and treatment, such that those with high (anti-black) IAT scores generally favor the lighter skinned image.

Posted in R | 12 Comments

Map the distribution of your sample by geolocating ip addresses or zip codes

Using the maps package to plot your geolocated sample


Yesterday I wanted to create a map of participants from a study on social media and partisan selective exposure that Sean Westwood and I ran recently, with participants from Amazon’s Mechanical Turk.  We recorded ip addresses for each Turker participant, so I used a geolocation service to lookup the latitude and longitude associated with each participant’s ip address, and then created a nice little map in R.  This can also be done if you have each participant’s zip+4 code, which is probably more accurate as well.

At any rate, we made a map that looks something like the picture above.

In this post I’ll illustrate how to make a map like this first using zip+4 codes and then using ip addresses.  We’ll use RCurl to send an html POST command to some website that will give us the relevant geolocation data, and then extract the relevant info by parsing JSON or html output (using regular expressions).

First we’ll pass zip+4 codes to the Yahoo! PlaceFinder Application Programming Interface (API) to return the state, latitude and longitude.  We’ll use their JSON output, because it’s easier to parse in R than the alternative (XML).

The first thing you’ll need to do is to get yourself a Yahoo! Application ID, so that you can use Yahoo’s API. This should only take about ten minutes. When you’ve got it, replace your appid where it says “[INSERT YOUR YAHOO ID HERE]” in the code below.

Here’s the R code:

###############################################################################
################## Example 1: Geolocating Zip +4 Codes ########################
###############################################################################

zips <- c("48198-6275", "75248-5000", "27604-1520", "53010-3212", "95453-9630",
		"10548-1241", "70582-6107", "84106-2494", "78613-3249", "18705-3906",
		"23108-0274", "23108-0274", "67361-1732", "28227-1454", "37027-5902",
		"75067-4208", "32505-2315", "80905-4449", "49008-1818", "08034-2616",
		"04347-1204", "94703-1708", "32822-4969", "44035-6205", "60192-1114",
		"45030-4933", "32210-3625", "80439-8782", "68770-7013", "42420-8832",
		"83843-2219", "33064-2555", "75067-6605", "22301-1266", "78757-1652",
		"85140-5232", "19067-6225", "06410-4210", "63109-3347", "89103-4803",
		"99337-5714", "37067-8123", "94582-4921", "65401-3065", "46614-1320",
		"29732-3810", "79412-1142", "23226-3006", "78217-1542", "10128-4923",
		"15145-1635", "48135-1093", "50621-0683", "32713-3803", "08251-1332",
		"14445-2024", "83210-0054", "27525-8790", "32609-8860", "53711-3548",
		"49601-2214", "15324-0010", "37604-3508", "85041-7739", "01906-1317",
		"10552-2605", "20120-6322", "19934-1263", "87124-2735", "22192-4415",
		"45344-9130", "08618-1464", "83646-3473", "80911-9007", "13057-2023",
		"30033-2016", "30039-6323", "20109-8120", "98043-2105", "34655-3129",
		"60465-1326", "33433-5746", "30707-1636", "98102-4432", "92037-7407",
		"95054-4144", "94703-1708", "94110-2712", "92562-5073", "31418-0341",
		"10009-1819", "73703-6736", "98554-0175", "65649-8267", "21704-7864",
		"15209-2420", "61550-2715", "92506-5317", "60611-3801", "48161-4064",
		"77365-3268", "50022-1388", "63841-1227", "36207-5817", "22121-0232",
		"73115-4446", "85340-7341", "44420-3415", "07663-4913", "10065-6716",
		"91606-2213", "19120-9237", "80015-2718", "97401-4448", "46637-5424",
		"29609-5425", "20815-6414", "45373-1731", "57106-6205", "71744-9486",
		"14222-1922", "28306-3901", "84074-2684", "28605-8292", "59405-8649")
 
###############################################################################
############################# Geolocate data ##################################
###############################################################################

# First load RCurl, which we'll use to send HTML POST commands to the Yahoo! API
library(RCurl)

# Next load RJSONIO, an R library that deals with JavaScript Object Notation (JSON) 
# Input/Output.  In my experience this library is a bit faster and more robust than RJSON.   
library(RJSONIO)

# create variables to store output:
state <- character(length(zips))
lat <- character(length(zips))
lon <- character(length(zips))
appid <- "&appid=[INSERT YOUR YAHOO ID HERE]"

# try a sample query
ipinfo <- fromJSON(getURL(paste("http://where.yahooapis.com/geocode?flags=J&country=USA&q=", zips[1], appid, sep="")))

# the good stuff is here:
ipinfo$ResultSet$Results[[1]]

for(i in 1:length(zips)){
	#i=1
	print(paste("i = ", i, "  zip = ", zips[i] ))
	
	# Read in geolocation data
	ipinfo <- fromJSON(getURL(paste("http://where.yahooapis.com/geocode?flags=J&country=USA&q=", zips[i], appid, sep="")))
	
	if(as.numeric(ipinfo$ResultSet[6])>0){
		# state
		try(suppressWarnings(state[i] <- ipinfo$ResultSet$Results[[1]][24])) 
		print(ipinfo$ResultSet$Results[[1]][24])

		# get lattitude
		try(suppressWarnings(lat[i] <- ipinfo$ResultSet$Results[[1]][2])) 
		print(ipinfo$ResultSet$Results[[1]][2])
		
		# get longitude
		try(suppressWarnings(lon[i] <- ipinfo$ResultSet$Results[[1]][3])) 
		print(ipinfo$ResultSet$Results[[1]][3])
		
	}
	
}

###############################################################################
####################### draw maps based on ip addresses #######################
###############################################################################

# We'll draw points for each observation of lattitude and longitude, and
# we'll shade in each state according to how many cases we have in each state

# first count the number of cases in each state:
mtstates <- table(state)
mtstates

# Convert state abbreviations to full state names: 
mtfullstatenames <- tolower(state.name[match(names(mtstates), state.abb)])
mtfullstatenames

mtfullstatenames[names(mtstates)=="DC"] <- "district of columbia"
names(mtstates) <- mtfullstatenames

# use "cut" to group the count data into several groups
mtstatesb <- cut(mtstates, breaks= c(-Inf, 0, 1, 4, 7, Inf), labels=c("0", "1", "2-4", "5-7", "8+")  )  
mtstatesb 

library(maps)
# generate a map object so that we can match names to the state level shape files
usmap <- map("state")


# Some of the state level shape files are broken down by sub-state region.  We
# just care about the political boundaries for the purposes of this project, 
# so we will not differentiate.  Thus, we need to make all names within a 
# state the same, which will mean some names are repeated.  
# Luckily the usmap$names are as follows: state:subregion
usmap$names

# so we can just split the strings based on the ":" character.  We'll use
# sapply( ..., "[[", 1) which extracts the first element when a list is returned:
mapstates <- sapply(strsplit(usmap$names, ":"), "[[", 1)
mapstates

# Now we need to generate a color for each item in mapstates so we can color in the map. 
# We first use "match()" to index our count of observations per state in mtstatesb by mapstates.  
# shapeFileN will contain the numeric range specified in mstatesb, but indexed by the 
# the names in mapstates, some of which will of course be repeated as discussed above. 

shapeFileN <- mtstatesb[match(mapstates,names(mtstates))]

# Now we've can generate a series of grey colors indexed by the count of observations per state,
# indexed so that the the observations match the shape files that come with the "state" object in
# the maps package:
cols <- rev(grey.colors(5))[shapeFileN]

# now let's draw the map:
png("demosamplemap.png", width=1000, height=700)

# draw the map object using the colors specified above
map("state", col= cols, fill=T)

# draw lattitude and longitude points for each observation:
points(lon, lat, cex = .3, col = rgb(0,0,0,.5))

# draw a legend
legend("bottomright", levels(mtstatesb), fill = c("white", rev(grey.colors(5))[2:5]), bty="n")

# put a title on it:
title("Geographic distribution of Zip+4 codes")
dev.off()

Note that for publication purposes, you’ll want to create a pdf instead of a png file.

OK, now for our second example, which will geolocate using ip addresses.  I found an excellent ip database website, it’s pretty current and has nicely formatted html: http://www.ip-db.com/.  You can lookup an ip address as follows: http://www.ip-db.com/%5Bipaddress%5D.  Take a look at the html for http://www.ip-db.com/64.90.182.55, which is one of the NIST Internet Time Servers, which I’ll be using to create example code.  If you view the html source, you’ll see that the relevant state is embedded in this line:

Region: <span style="font-family: arial; font-size: x-small;">New York

Also embedded is the lattitude and longitude, which we will need to create the map:

Lat/Long:
<span style="font-family: arial; font-size: x-small;"><a href="http://maps.google.com/maps?om=1&q=40.7089+-74.0012&z=10" target="_blank">40.7089 -74.0012</a>

Now we’ll just use RCurl to send a set of ip addresses to this website, then use Regular Expressions to extract the relevant data from the html that the website returns. Code below:

###############################################################################
################## Example 2: Geolocating IP Addresses ########################
###############################################################################

ips <- c("64.90.182.55", "96.47.67.105", "206.246.122.250", "129.6.15.28", "64.236.96.53", 
		"216.119.63.113", "64.250.177.145", "208.66.175.36", "173.14.55.9", "64.113.32.5", 
		"66.219.116.140", "24.56.178.140", "132.163.4.101", "132.163.4.102", "132.163.4.103",
		"192.43.244.18", "128.138.140.44", "128.138.188.172", "198.60.73.8", "64.250.229.100",
		"131.107.13.100", "207.200.81.113", "69.25.96.13", "216.171.124.36", "64.147.116.229")

# create variables to store output:
state <- character(length(ips))
lat <- character(length(ips))
lon <- character(length(ips))

# iterate over ip addresses:
for(i in 18:length(ips)){
	#i=5	
	print(paste("i = ", i, " ip: ", i))

	ipinfo <- getURL(paste("http://www.ip-db.com/", ips[i], sep=""))
	
	# use RegEx to extract key info:
	library(stringr)
	
	# Get state
	# We use the regex (\\w\\s?)+ where \\w = word character, \\s? = possible space character, 
	# () groups the pattern, and + means one or more.  
	# note that we must escape " characters by \" to include them in the pattern
	pattrn <- "Region:</b></font></td><td width=\"10\">&nbsp;</td><td><font face=\"arial\" size=\"2\">(\\w\\s?)+</td>"
	( results <- str_extract(ipinfo, pattrn))
	
	# Extract the actual text of the state and clean it up so we can actually use it: 
	state[i] <- str_extract(results, ">(\\w\\s?)+<")
	state[i] <- gsub(">", "", state[i])
	state[i] <- gsub("<", "", state[i])
	
	print(state[i])
	
	# Now get lattitude and longitude:
	# We can limit our pattern to target="_blank" because it's a unique pattern in the html
	# An example of our substring of interest follows: "target=\"_blank\">40.7089 -74.0012</a></td>"
	# So we use this pattern \\-?\\d\\d(\\.\\d+)? \\-?\\d\\d(\\.\\d+)? where \\d means digit,
	# means optional, () groups the pattern, and \\- is an escaped dash
	
	pattrn <- "target=\"_blank\">\\-?\\d+(\\.\\d+)? \\-?\\d+(\\.\\d+)?</a></td>"
	( results <- str_extract(ipinfo, pattrn))
		
	# Extract both lattitude an longitude and clean it up so we can save it as numeric in the data: 
	latlon <- str_extract_all(results, "\\-?\\d+(\\.\\d+)?")
	
	# the code above returns a list in case results has more than one set of characters in the character
	# vector, so we need to unlist it to make it into a normal vector:
	latlon <- unlist(latlon)
	
	# since there are two results, we return one for each for lat and lon:
	lat[i] <- latlon[1] 
	lon[i] <- latlon[2] 
	print(lat[i])
	print(lon[i])
	
	# Put a delay between entries so there's no danger of overwhelming the server
	Sys.sleep(sample(15:25,1)/10) 
}

###############################################################################
####################### draw maps based on ip addresses #######################
###############################################################################

# We'll draw points for each observation of lattitude and longitude, and
# we'll shade in each state according to how many cases we have in each state

# first count the number of cases in each state:
mtstates <- table(state)

# then match to the state.name object (which the maps object also uses)
mtfullstatenames <- tolower(state.name[match(names(mtstates), state.name)])

# extract the state names to a separate object:
names(mtstates) <- mtfullstatenames

# use "cut" to group the count data into several groups
mtstatesb <- cut(mtstates, breaks= c(-Inf, 0, 1, 2, 4, Inf), labels=c("0", "1", "2", "3-4", "4+")  )  

library(maps)
# generate a map object so that we can match names to the state level shape files
usmap <- map("state")


# Some of the state level shape files are broken down by sub-state region.  We
# just care about the political boundaries for the purposes of this project, 
# so we will not differentiate.  Thus, we need to make all names within a 
# state the same, which will mean some names are repeated.  
# Luckily the usmap$names are as follows: state:subregion
usmap$names

# so we can just split the strings based on the ":" character.  We'll use
# sapply( ..., "[[", 1) which extracts the first element when a list is returned:
mapstates <- sapply(strsplit(usmap$names, ":"), "[[", 1)
mapstates

# Now we need to generate a color for each item in mapstates so we can color in the map. 
# We first use "match()" to index our count of observations per state in mtstatesb by mapstates.  
# shapeFileN will contain the numeric range specified in mstatesb, but indexed by the 
# the names in mapstates, some of which will of course be repeated as discussed above. 

shapeFileN <- mtstatesb[match(mapstates,names(mtstates))]

# Now we've can generate a series of grey colors indexed by the count of observations per state,
# indexed so that the the observations match the shape files that come with the "state" object in
# the maps package:
cols <- rev(grey.colors(5))[shapeFileN]

# now let's draw the map:
png("demosamplemap.png", width=1000, height=700)

# draw the map object using the colors specified above
map("state", col= cols, fill=T)

# draw lattitude and longitude points for each observation:
points(lon, lat, cex = .3, col = rgb(0,0,0,.5))

# draw a legend
legend("bottomright", levels(mtstatesb), fill = c("white", rev(grey.colors(5))[2:5]), bty="n")

# put a title on it:
title("Geographic distribution of NIST Internet Time Servers")
dev.off()
Posted in R | 2 Comments