Working with Bipartite/Affiliation Network Data in R


Data can often be usefully conceptualized in terms affiliations between people (or other key data entities). It might be useful analyze common group membership, common purchasing decisions, or common patterns of behavior. This post introduces bipartite/affiliation network data and provides R code to help you process and visualize this kind of data. I recently updated this for use with larger data sets, though I put it together a while back.

Preliminaries

Much of the material here is covered in the more comprehensive “Social Network Analysis Labs in R and SoNIA,” on which I collaborated with Dan McFarland, Sean Westwood and Mike Nowak.

For a great online introduction to social network analysis see the online book Introduction to Social Network Methods by Robert Hanneman and Mark Riddle.

Bipartite/Affiliation Network Data

A network can consist of different ‘classes’ of nodes. For example, a two-mode network might consist of people (the first mode) and groups in which they are members (the second mode). Another very common example of two-mode network data consists of users on a particular website who communicate in the same forum thread.

Here’s a short example of this kind of data. Run this in R for yourself – just copy an paste into the command line or into a script and it will generate a dataframe that we can use for illustrative purposes:

df <- data.frame( person =
    c('Sam','Sam','Sam','Greg','Tom','Tom','Tom','Mary','Mary'), group =
    c('a','b','c','a','b','c','d','b','d'), stringsAsFactors = F)

df
person group
1    Sam     a
2    Sam     b
3    Sam     c
4   Greg     a
5    Tom     b
6    Tom     c
7    Tom     d
8   Mary     b
9   Mary     d

Fast, efficient two-mode to one-mode conversion in R

Suppose we wish to analyze or visualize how the people are connected directly – that is, what if we want the network of people where a tie between two people is present if they are both members of the same group? We need to perform a two-mode to one-mode conversion.

To convert a two-mode incidence matrix to a one-mode adjacency matrix, one can simply multiply an incidence matrix by its transpose, which sum the common 1’s between rows. Recall that matrix multiplication entails multiplying the k-th entry of a row in the first matrix by the k-th entry of a column in the second matrix, then summing, such that the ij-th row-column entry in resulting matrix represents the dot-product of the i-th row of the first matrix and the j-th column of the second. In mathematical notation:

\displaystyle  AB=  \begin{bmatrix}  a & b\\  c & d  \end{bmatrix}  \begin{bmatrix}  e & f\\  g & h  \end{bmatrix} =  \begin{bmatrix}  ae+bg & af+bh\\  ce+dg & cf+dh  \end{bmatrix}

Notice further that multiplying a matrix by its transpose yields the following:

\displaystyle  AA'=  \begin{bmatrix}  a & b\\  c & d  \end{bmatrix}  \begin{bmatrix}  a & c\\  b & d  \end{bmatrix} =  \begin{bmatrix}  aa+bb & ac+bd\\  ca+db & cc+dd  \end{bmatrix}

Because our incidence matrix consists of 0’s and 1’s, the off-diagonal entries represent the total number of common columns, which is exactly what we wanted. We’ll use the %*% operator to tell R to do exactly this. Let’s take a look at a small example using toy data of people and groups to which they belong. We’ll coerce the data to an incidence matrix, then multiply the incidence matrix by its transpose to get the number of common groups between people.

This is easy to do using the matrix algebra functions included in R. But first, you need to restructure your (edgelist) network data as an incidence matrix. An incidence will record a 1 for row-column combinations where a tie is present and 0 otherwise. One easy way to do this in R is to use the table function and then coerce the table object to a matrix object:

m <- table( df )
M <- as.matrix( m )

If you are using the network or sna packages, a network object be coerced via as.matrix(your-network); with the igraph package use get.adjacency(your-network).

This is great, but what about if we are working with a really large data set? Network data is almost always sparse—there are far more pairwise combinations of potential connections than actual observed connections. Hence, we’d actually prefer to keep the underlying data structured in edgelist format, but we’d also like access to R’s matrix algebra functionality.

We can get the best of both worlds using the Matrix library to construct a sparse triplet representation of a matrix. But we’d also like to avoid building the entire incidence matrix and just feed Matrix our edgelist directly, a point that came up in a recent conversation I had with Sean Taylor. We feed Matrix our ‘person’ column to index ‘i’ (rows in the new incidence matrix), our ‘group’ column to index j (columns in the new incidence matrix), and we repeat ‘1’ for the length of the edgelist to denote an incidence.

library('Matrix')
A <- spMatrix(nrow=length(unique(df$person)),
        ncol=length(unique(df$group)),
        i = as.numeric(factor(df$person)),
        j = as.numeric(factor(df$group)),
        x = rep(1, length(as.numeric(df$person))) )
row.names(A) <- levels(factor(df$person))
colnames(A) <- levels(factor(df$group))
A

We will either convert to the ‘mode’ represented by the columns or by the rows.

To get the one-mode representation of ties between rows (people in our example), multiply the matrix by its transpose. Note that you must use the matrix-multiplication operator %*% rather than a simple astrisk. The R code is:

Arow <- A %*% t(A)

But we can still do better! The function tcrossprod is faster and more efficient for this:

Arow <- tcrossprod(A)

Arow will now represent the one-mode matrix formed by the row entities—people will have ties to each other if they are in the same group, in our example. Here’s what it looks like:

Arow
4 x 4 sparse Matrix of class "dgCMatrix"
     Greg Mary Sam Tom
Greg    1    .   1   .
Mary    .    2   1   2
Sam     1    1   3   2
Tom     .    2   2   3

To get the one-mode matrix formed by the column entities (i.e. the number of people) enter the following command:

Acol <- t(A) %*% A

Again, we can use tcrossprod to make this even more efficient:

Acol <- tcrossprod(t(A))

And the resulting co-membership matrix is as follows:

Mcol
group
group a b c d
a 2 1 1 0
b 1 3 2 2
c 1 2 2 1
d 0 2 1 2

Although we’ve used a very small network for our example, this code is highly extensible to the analysis of larger networks with R.

Analysis of Two Mode Data and Mobility

Let’s work with some actual affiliation data, collected by Dan McFarland on student extracurricular affiliations. It’s a longitudinal data set, with 3 waves – 1996, 1997, 1998.  It consists of students (anonymized) and the student organizations in which they are members (e.g. National Honor Society, wrestling team, cheerleading squad, etc.).

What we’ll do is to read in the data, explore it, make a few two-to-one mode conversions, and visualize it.

# Load the 'igraph' library
library('igraph')

# (1) Read in the data files, NA data objects coded as 'na'
magact96 = read.delim('http://dl.dropbox.com/u/25710348/snaimages/mag_act96.txt',
    na.strings = 'na')
magact97 = read.delim('http://dl.dropbox.com/u/25710348/snaimages/mag_act97.txt',
    na.strings = 'na')
magact98 = read.delim('http://dl.dropbox.com/u/25710348/snaimages/mag_act98.txt',
    na.strings = 'na')

Missing data is coded as “na” in this data, which is why we gave R the command na.strings = “na”.
These files consist of four columns of individual-level attributes (ID, gender, grade, race), then a bunch of group membership dummy variables (coded “1” for membership, “0” for no membership).  We need to set aside the first four columns (which do not change from year to year).

magattrib = magact96[,1:4]

g96 <- as.matrix(magact96[,-(1:4)]); row.names(g96) = magact96$ID.
g97 <- as.matrix(magact97[,-(1:4)]); row.names(g97) = magact97$ID.
g98 <- as.matrix(magact98[,-(1:4)]); row.names(g98) = magact98$ID.

By using the [,-(1:4)] index, we drop those columns so that we have a square incidence matrix for each year, and then tell R to set the row names of the matrix to the student’s ID. Note that we need to keep the “.” after ID in this dataset (because it’s in the name of the variable).

Now we load these two-mode matrices into igraph:

i96 <- graph.incidence(g96, mode=c('all') )
i97 <- graph.incidence(g97, mode=c('all') )
i98 <- graph.incidence(g98, mode=c('all') )

Plotting two-mode networks

Now, let’s plot these graphs. The igraph package has excellent plotting functionality that allows you to assign visual attributes to igraph objects before you plot. The alternative is to pass 20 or so arguments to the plot.igraph() function, which gets really messy.

Let’s assign some attributes to our graph. First we set vertex attributes, making sure to make them slightly transparent by altering the gamma, using the rgb(r,g,b,gamma) function to set the color. This makes it much easier to look at a really crowded graph, which might look like a giant hairball otherwise.

You can read up on the RGB color model here.

Each node (or “vertex”) object is accessible by calling V(g), and you can call (or create) a node attribute by using the $ operator so that you call V(g)$attribute. Here’s how to set the color attribute for a set of nodes in a graph object:

V(i96)$color[1:1295] <- rgb(1,0,0,.5)
V(i96)$color[1296:1386] <- rgb(0,1,0,.5)

Notice that we index the V(g)$color object by a seemingly arbitrary value, 1295.  This marks the end of the student nodes, and 1296 is the first group node. You can view which nodes are which by typing V(i96). R prints out a list of all the nodes in the graph, and those with a number are obviously different from those that consist of a group name.

Now we’ll set some other graph attributes:

V(i96)$label <- V(i96)$name
V(i96)$label.color <- rgb(0,0,.2,.5)
V(i96)$label.cex <- .4
V(i96)$size <- 6
V(i96)$frame.color <- NA

You can also set edge attributes. Here we’ll make the edges nearly transparent and slightly yellow because there will be so many edges in this graph:

E(i96)$color <- rgb(.5,.5,0,.2)

Now, we’ll open a pdf “device” on which to plot. This is just a connection to a pdf file. Note that the code below will take a minute or two to execute (or longer if you have a pre- Intel dual-core processor).

pdf('i96.pdf')
plot(i96, layout=layout.fruchterman.reingold)
dev.off()

Note that we’ve used the Fruchterman-Reingold force-directed layout algorithm here.  Generally speaking, the when you have a ton of edges, the Kamada-Kawai layout algorithm works well but, it can get really slow for networks with a lot of nodes. Also, for larger networks, layout.fruchterman.reingold.grid is faster, but can fail to produce a plot with any meaninful pattern if you have too many isolates, as is the case here. Experiment for yourself.

Here’s what we get:


It’s oddly reminiscent of a cresent and star, but impossible to read. Now, if you open the pdf output, you’ll notice that you can zoom in on any part of the graph ad infinitum without losing any resolution. How is that possible in such a small file? It’s possible because the pdf device output consists of data based on vectors: lines, polygons, circles, elipses, etc., each specified by a mathematical formula that your pdf program renders when you view it. Regular bitmap or jpeg picture output, on the other hand, consists of a pixel-coordinate mapping of the image in question, which is why you lose resolution when you zoom in on a digital photograph or a plot produced with most other programs.

Let’s remove all of the isolates (the cresent), change a few aesthetic features, and replot. First, we’ll remove isloates, by deleting all nodes with a degree of 0, meaning that they have zero edges. Then, we’ll suppress labels for students and make their nodes smaller and more transparent. Then we’ll make the edges more narrow more transparent. Then, we’ll replot using various layout algorithms:

i96 <- delete.vertices(i96, V(i96)[ degree(i96)==0 ])
V(i96)$label[1:857] <- NA
V(i96)$color[1:857] <-  rgb(1,0,0,.1)
V(i96)$size[1:857] <- 2

E(i96)$width <- .3
E(i96)$color <- rgb(.5,.5,0,.1)

pdf('i96.2.pdf')
plot(i96, layout=layout.kamada.kawai)
dev.off()

pdf('i96.3.pdf')
plot(i96, layout=layout.fruchterman.reingold.grid)
dev.off()

pdf('i96.4.pdf')
plot(i96, layout=layout.fruchterman.reingold)
dev.off()

I personally prefer the Fruchterman-Reingold layout in this case. The nice thing about this layout is that it really emphasizes centrality–the nodes that are most central are nearly always placed in the middle of the plot. Here’s what it looks like:

Very pretty, but you can’t see which groups are which at this resolution. Zoom in on the pdf output, and you can see things pretty clearly.

Two mode to one mode data transformation

We’ve emphasized groups in this visualization so much, that we might want to just create a network consisting of group co-membership. First we need to create a new network object. We’ll do that the same way for this network as for our example at the top of this page:

g96e <- t(g96) %*% g96
g97e <- t(g97) %*% g97
g98e <- t(g98) %*% g98

i96e <- graph.adjacency(g96e, mode = 'undirected')

Now we need to tansform the graph so that multiple edges become an attribute ( E(g)$weight ) of each unique edge:

E(i96e)$weight <- count.multiple(i96e)
i96e <- simplify(i96e)

Now we’ll set the other plotting parameters as we did above:

# Set vertex attributes
V(i96e)$label <- V(i96e)$name
V(i96e)$label.color <- rgb(0,0,.2,.8)
V(i96e)$label.cex <- .6
V(i96e)$size <- 6
V(i96e)$frame.color <- NA
V(i96e)$color <- rgb(0,0,1,.5)

# Set edge gamma according to edge weight
egam <- (log(E(i96e)$weight)+.3)/max(log(E(i96e)$weight)+.3)
E(i96e)$color <- rgb(.5,.5,0,egam)

We set edge gamma as a function of how many edges exist between two nodes, or in this case, how many students each group has in common. For illustrative purposes, let’s compare how the Kamada-Kawai and Fruchterman-Reingold algorithms render this graph:

pdf('i96e.pdf')
plot(i96e, main = 'layout.kamada.kawai', layout=layout.kamada.kawai)
plot(i96e, main = 'layout.fruchterman.reingold', layout=layout.fruchterman.reingold)
dev.off()

I like the Kamada-Kawai layout for this graph, because the center of the graph is too busy otherwise. And here’s what the resulting plot looks like:


You can check out the difference between each layout yourself. Here’s what the pdf output looks like.  Page 1 shows the Kamada-Kawai layout and page 2 shows the Fruchterman Reingold layout.

Group overlap networks and plots

Now we might also be interested in the percent overlap between groups. Note that this will be a directed graph, because the percent overlap will not be symmetric across groups–for example, it may be that 3/4 of Spanish NHS members are in NHS, but only 1/8 of NHS members are in the Spanish NHS. We’ll create this graph for all years in our data (though we could do it for one year only).

First we’ll need to create a percent overlap graph. We start by dividing each row by the diagonal (this is really easy in R):

ol96 <- g96e/diag(g96e)
ol97 <- g97e/diag(g97e)
ol98 <- g98e/diag(g98e)

Next, sum the matricies and set any NA cells (caused by dividing by zero in the step above) to zero:

magall <- ol96 + ol97 + ol98
magall[is.na(magall)] <- 0

Note that magall now consists of a percent overlap matrix, but because we’ve summed over 3 years, the maximun is now 3 instead of 1.

Let’s compute average club size, by taking the mean across each value in each diagonal:

magdiag <- apply(cbind(diag(g96e), diag(g97e), diag(g98e)), 1, mean )

Finally, we’ll generate centrality measures for magall. When we create the igraph object from our matrix, we need to set weighted=T because otherwise igraph dichotomizes edges at 1. This can distort our centrality measures because now edges represent  more than binary connections–they represent the percent of membership overlap.

magallg <- graph.adjacency(magall, weighted=T)

# Degree
V(magallg)$degree <- degree(magallg)

# Betweenness centrality
V(magallg)$btwcnt <- betweenness(magallg)

Before we plot this, we should probably filter some of the edges, otherwise our graph will probably be too busy to make sense of visually.  Take a look at the distribution of connection strength by plotting the density of the magall matrix:

plot(density(magall))

Nearly all of the edge weights are below 1–or in other words, the percent overlap for most clubs is less than 1/3. Let’s filter at 1, so that an edge will consists of group overlap of more than 1/3 of the group’s members in question.

magallgt1 <- magall
magallgt1[magallgt1&lt;1] <- 0
magallggt1 <- graph.adjacency(magallgt1, weighted=T)

# Removes loops:
magallggt1 <- simplify(magallggt1, remove.multiple=FALSE, remove.loops=TRUE)

Before we do anything else, we’ll create a custom layout based on Fruchterman.-Ringold wherein we adjust the coordates by hand using the tkplot gui tool to make sure all of the labels are visible. This is very useful if you want to create a really sharp-looking network visualization for publication.

magallggt1$layout <- layout.fruchterman.reingold(magallggt1)
V(magallggt1)$label <- V(magallggt1)$name
tkplot(magallggt1)

Let the plot load, then maximize the window, and select to View -> Fit to Screen so that you get maximum resolution for this large graph. Now hand-place the nodes, making sure no labels overlap:

Pay special attention to whether the labels overlap (or might overlap if the font was bigger) along the vertical. Save the layout coordinates to the graph object:

magallggt1$layout <- tkplot.getcoords(1)

We use “1” here because only if this was the first tkplot object you called. If you called tkplot a few times, use the last plot object. You can tell which object is visible because at the top of the tkplot interface, you’ll see something like “Graph plot 1” or in the case of my screenshot above “Graph plot 7” (it was the seventh time I called tkplot).

# Set vertex attributes
V(magallggt1)$label <- V(magallggt1)$name
V(magallggt1)$label.color <- rgb(0,0,.2,.6)
V(magallggt1)$size <- 6
V(magallggt1)$frame.color <- NA
V(magallggt1)$color <- rgb(0,0,1,.5)

# Set edge attributes
E(magallggt1)$arrow.size <- .3

# Set edge gamma according to edge weight
egam <- (E(magallggt1)$weight+.1)/max(E(magallggt1)$weight+.1)
E(magallggt1)$color <- rgb(.5,.5,0,egam)

One thing that we can do with this graph is to set label size as a function of degree, which adds a “tag-cloud”-like element to the visualization:

V(magallggt1)$label.cex <- V(magallggt1)$degree/(max(V(magallggt1)$degree)/2)+ .3
#note, unfortunately one must play with the formula above to get the
#ratio just right

Let’s plot the results:

pdf('magallggt1customlayout.pdf')
plot(magallggt1)
dev.off()

Note that we used the custom layout, which because we made part of the igraph object magallggt1, we did not need to specify in plot command.

Here’s the pdf output, and here’s what it looks like:


This visualization reveals much more information about our network than our cresent-star visualization.

Advertisements
Posted in R | 15 Comments

Visualization series: Insight from Cleveland and Tufte on plotting numeric data by groups

After my post on making dotplots with concise code using plyr and ggplot, I got an email from my dad who practices immigration law and runs a website with a variety of immigration resources and tools.  He pointed out that the post was written for folks who already know that they want to make dot plots, and who already know about bootstrapped standard errors.  That’s not many people.

In an attempt to appeal to a broader audience, I’m starting a series in which I’ll outline the key principles I use when developing a visualization.  In this post, I’ll articulate these principles, which combine some of Tuft’s aesthetic guidelines with Cleveland’s scientific approach to visualization, which is based on the psychological processes involved in making sense of visualizations, and has been rigorously tested via randomized controlled experiments.  Based on these principles, I’ll argue that dotplots and scatterplots are better than other types of plots (especially pie charts) in most situations.  In later posts, I’ll demonstrate another innovation whose widespread use I’ll credit to Cleveland and Tufte: the use of multiple panels (aka small multiples, trellis graphics, facets, generalized draftsman’s displays, multivar charts) to clearly convey the same information embedded in more complex and difficult to read visualizations, including multiple line plots and mosaic plots. In future posts I’ll also emphasize why it is important to provide some indication of the noise present in the underlying data using error bars or bands.  Along the way, I’ll put you to the test–I’ll present some visualizations of the same data using different visualization techniques and ask you to try to get as much information as you can in 2 seconds from each type of visualization.

A good visualization conveys key information to those who may have trouble interpreting numbers and/or statistics, which can make your findings accessible to a wider audience (more on this below).  Visualizations also give your audience a break from lexical processing, which is especially useful when you are presenting your findings–people can listen to you and process the findings from a well-designed visual at the same time, but most people have trouble listening while reading your PowerPoint bullet points.  Visualizations also convey key information embedded in massive amounts of data, which can aid your own exploratory analysis of data, no matter how massive.

Yet most visualizations are flawed, drawn using elements that make it unnecessarily difficult for the human visual system to make sense of things.  I see a lot of these visualizations attending research presentations, screening incoming draft manuscripts as the assistant editor for Political Communication, and as a consumer of media info-graphics (CNN is especially bad, have a look at this monstrosity).  Kevin Fox has an especially compelling visual speaking to this here. A big part of the problem is that Microsoft makes it easy to draw flashy but ultimately confusing visualizations in Excel.  If you are too busy to read this post in full, follow this short list of guidelines and you’ll be on your way to producing elegant visualizations that impose a minimal cognitive burden on your audience:

  1. Never represent something in 2 or worse yet 3 dimensions if it can be represented in one—NEVER use pie charts, 3-D pie charts, stacked bar charts, or 3-D bar charts.
  2. Remove as much chart junk as possible–unnecessary gridlines, shading, borders, etc.
  3. Give your audience a sense of the noise present in your data–draw error bars or confidence bands if you are plotting estimates.
  4. If you want to plot multiple types of groups on a single outcome (the visual analog of cross-tabulations/marginals), use multi-paneled plots. These can also help if overploting looks too cluttered.
  5. Avoid mosaic plots. Instead use paneled histograms.
  6. Ditch the legend if you can (you almost always can).

The rest of the content in this series emphasizes why it makes sense to follow these guidelines. In this post I’ll look at the first point in detail and touch on the sixth. These two guidelines are most relevant when you want to look at a quantitative variable (e.g., earnings, vote-share, temperature, etc.) across different qualitative groupings (e.g., industry segment, candidate, party, racial group, season, etc.).  This is one of the most common visualization tasks in business, media, and social science, and for this task people often use pie charts and/or bar charts, and occasionally dot plots.

The science of graphical perception

When most people think about visualization, they think first of Edward Tufte.  Tufte emphasizes integrity to the data, showing relationships between phenomena, and above all else aesthetic minimalism.  I appreciate his ruthless crusade against chart junk and pie charts (nice quote from Data without Borders). We share an affinity for multipanel plotting approaches, which he calls “small multiples,” (thanks to Rebecca Weiss for pointing this out) though I think people give Tufte too much credit for their invention—both juiceanalytics and infovis-wiki write that Cleveland introduced the concept/principle. However, both Cleveland and Tufte published books in 1983 discussing the use of multipanel displays; David Smith over at Revolutions writes that “the “small-multiples” principle of data visualization [was] pioneered by Cleveland and popularized in Tufte’s first book”; and the earliest reference to a work containing multipanel displays I could find was published *long* before Tufte’s 1983 work–Seder, Leonard (1950), “Diagnosis with Diagrams—Part I”, Industrial Quality Control (New York, New York: American Society for Quality Control) 7 (1): 11–19.

I’m less sure about Tufte’s advice to always show axes starting at zero, which can make comparison between two groups difficult, and to “show causality,” which can end up misleading your readers.  Of course, the visualizations on display in the glossy pages of Tufte’s books are beautiful–they belong  in a museum.  But while his books are full of general advice that we should all keep in mind when creating plots, he does not put forth a theory of what works and what doesn’t when trying to visualize data.

Cleveland (with Robert McGill) develops such a theory and subjects it to rigorous scientific testing. In my last post I linked to one of Cleveland’s studies showing that dots (or bars) aligned on the same scale are indeed the best visualization to convey a series of numerical estimates.  In this work, Cleveland examined how accurately our visual system can process visual elements or “perceptual units” representing underlying data.  These elements include markers aligned on the same scale (e.g., dot plots, scatterplots, ordinary bar charts), the length of lines that are not aligned on the same scale (e.g., stacked bar plots), area (pie charts and mosaic plots), angles (also pie charts), shading/color, volume, curvature, and direction.

He runs two experiments: the first compares judgements about relative position (grouped bar charts) to judgements based only on length (stacked bar charts); the second compares judgements about relative position (ordinary bar charts) to judgements about angles/area (pie charts).  Here are the materials he uses, courtesy of the Stanford Computer Graphics Lab:

The results are resoundingly clear—judgements about position relative to a baseline are dramatically more accurate than judgements about angles, area, or length (with no baseline).  Hence, he suggests that we replace pie charts with bar charts or dot plots and that we substitute stacked bar charts for grouped bar charts.

A striking and often overlooked finding in this work is the fact that the group of participants without technical training, “mostly ordinary housewives” as Cleveland describes them, performed just as well as the group of mostly men with substantial technical training and experience.   This finding provides evidence for something that I’ve long suspected: that visualizations make it easier for people lacking quantitative experience to understand your results, serving to level the playing field.  If you want your findings to be broadly accessible, it’s probably better to present a visualization rather than a bunch of numbers.  It also suggests that if someone is having trouble interpreting your visualizations, it’s probably your fault.

Dotplots versus pie charts and stacked barplots

Now let’s put this to the test.  Take a look at each visualization below for two seconds, looking for the percent of the vote that Mitt Romney, Ron Paul, and Jon Huntsman got.

Which is easiest to read? Which conveys information most accurately? Let’s first take a look at the most critical information–the order in which the candidates placed.  In all plots, the candidates are arrayed in order from highest to least vote share, and it’s easy to see that Mitt won.  But once we start looking at who came in second, third, and so on, differences emerge.  It’s slightly harder to process order in the pie chart because your eye has to go around the plot rather than up and down in a straight line.  In the stacked bar chart, we need to look up which color corresponds to which candidate’s in the legend (as Tufte told us not to use), adding a layer of cognitive processing.

Second, which conveys estimates most accurately? The dot plot is the clear winner here.  We can quickly see that Romney got about 37%, Paul got about 24%, and Huntsman got about 16%, just by looking at dots relative to the axis.  When we look at the pie chart, it’s really tough to estimate the exact percent each candidate got.  Same with the stacked bar chart. We could add numbers to the pie and bar charts, which would even things out to some extent, but then why not just display a table with exact percents?

One argument I used to hear all the time when I worked in industry is that pie charts “convey a sense of proportion.”  Well, sure, I guess I can kind of guestimate that Ron Paul’s vote share is about 1/4.  What about Jon Huntsman? Hmm, it looks like about 15 percent, which is 3/20.  But wait, why do I want to convert things into fractions anyway? I don’t think in terms of fractions, I think in terms of percents.  And if I really care about proportion, I suppose I could extend the axis from 0 to 100.

Suppose I want to plot results for the top 15 candidates, not just the top 6?  Here’s what happens:

No contest, the pie chart fails completely.  We’d need to add a legend with colors for each candidate, which adds another layer of cognitive processing–we’d need to look up each color in the lengend as we go.  And even after adding the legend, you wouldn’t be able to distinguish the lower performing candidates from say write-in votes because the pie slices would be too small.  The stacked bar chart will fail for the same reasons, so I’ve excluded it in the interest of brevity.  Note that we don’t need to add colors to the dotplot to convey the same information, which saves an extra plotting element that we can use to represent something else (say candidate’s campaign funds or total assets).  And, on top of it all, the dot plot takes up less screen/page real estate!

Why do I use dot plots instead of ordinary bar charts? A nice visualization guide from perceptualedge.com points out that often we want to only visualize differences between groups in a narrow range (they use an example wherein monthly expenses vary from $4,250-$5,500). But the length of a bar is supposed to facilitate accurate comparisons between values, so when you use a bar plot starting from $4,250, the length between bars dramatically exaggerates the actual differences. Dot plots do not have this problem because dot encode values using only location, so one must reference the axis to interpret the value.

A related points is that bars are often used to convey counts–we use them in histograms to represent frequency and track say counts of dollars earned/raised in bar charts.  In fact, a team of doctors I work with at the med school recently sent in a manuscript to Radiology containing a bar chart plotting mean values between groups; they got back the following comment from the statistical reviewer: “the y-axis is quantitative but the data are represented using bars as if the data were counts.”  People often use bar plots to convey estimates of means (and I’ve certainly done this), which can serve to exaggerate differences in means and hence effect sizes if you do not plot the bars from zero.

In addition, dot plots have aesthetic advantages.  They convey the numerical estimate in question with a single one-dimensional point, rather than a two dimensional bar.  There’s simply less that the eye needs to process.  Accordingly, if a pattern across qualitative groupings exists, it’s often easier to see with a dot plot.  For example, below I plot the average user ratings for each article to which Sean Westwood and I exposed subjects in a news reading experiment.  The pattern that emerges is an “S” curve in which one or two stories dominate the ratings, most are sort of average, and a few are uniformly terrible. Note that you’d probably want to use something like this more for yourself than to communicate your results to others as it might overload your audience with too much information–you’d do better to select a subset of these articles or remove some of the ones in the middle (thanks to Yph Lelkes for making this point).

One question that remains is if pie charts are so bad, why are they so common? Perhaps we like them because we find them comforting just as we find pies and pizza? Well if so we’d expect pie charts to be less common in places like Japan and China where people grow up eating different food.  Consider info-graphics in newspapers: I haven’t yet done a systematic content analysis, but I was unable to find a single pie chart in Japan’s Yomimuri Shimbun nor the Asahi Shimbun; nor in China’s Beijing Daily nor Sing Tao Daily.  I did see plenty of maps, however, which I suppose one could argue are reminiscent of noodles.

Implementation

The most efficient way to produce solid visualizations with the ability to implement multiple panels, proper standard error estimates, and dot plots is probably in R using the ggplot2 package.  If you do not have time to learn R and remain tied to MS-Excel stick to ordinary barplots to visualize quantitative variables among multiple groups (not recommended).

Otherwise, if you don’t already use it, download R and a decent editor like Rstudio.  Then get started with ggplot2 and dot plots by running the following code chunk which will replicate the election figure above:


pres <- read.csv("https://dl.dropboxusercontent.com/u/25710348/Blogposts/data/primaryres.csv", as.is=T)

# sort data in order of percent of vote:
pres <- pres[order(pres$Percentage, decreasing=T), ]

# only show top 15 candidates:
pres <- pres[1:15,]

# create a precentage variable
pres$Percentage <- pres$Percentage*100

# reorder the Candidate factor by percentage for plotting purposes:
pres$Candidate <- reorder(pres$Candidate, pres$Percentage)

# To install ggplot2, run the following line after deleting the #
#install.packages("ggplot2")

library(ggplot2)
ggplot(pres, aes(x = Percentage, y = factor(Candidate) )) +
  geom_point() +
  theme_bw() +  xlab("Percent of Vote") + ylab("Candidate") +
  ggtitle("New Hampshire Primary 2012")

After loading our data and running a few preliminary data processing operations, we pass ggplot our data set, “pres,” then we tell it what aesthetic elements we want to use, in this case that x is going to be our “Percentage” variable and y is going to be our “Candidate” variable. We tell ggplot that we want to display points for every xy pair. We also tell it to use the black and white theme, and pass some obscure axis options that ensures the axis plot correctly. Then we tell it what to label the x and y axis, and give it a title.

We can also reproduce the article ratings by story plot above using ggplot2 (even though I originally produced the plot using the lattice package).

# To install ggplot2, run the following line after deleting the #
#install.packages("ggplot2")

library(ggplot2)

load(file("https://dl.dropboxusercontent.com/u/25710348/Blogposts/data/db.Rda"))

# if you haven't installed dplyr, delete the # and run this line:
# install.packages("dplyr")

library(dplyr)

table(db$story)

# first we use plyr to calculate the mean rating and SE for each story
ratingdat <- db %>% group_by(story) %>%
  summarise(M = mean(rating, na.rm=T),
            SE = sd(rating, na.rm=T)/sqrt(length(na.omit(rating))),
            N = length(na.omit(rating)))

# make story into an ordered factor, ordering by mean rating:
ratingdat$story <- factor(ratingdat$story)
ratingdat$story <- reorder(ratingdat$story, ratingdat$M)

# take a look at our handiwork:


ggplot(ratingdat, aes(x = M, xmin = M-SE, xmax = M+SE, y = story )) +
  geom_point() + geom_segment( aes(x = M-SE, xend = M+SE,
                                   y = story, yend=story)) +
  theme_bw() + xlab("Mean rating") + ylab("Story") +
  ggtitle("Rating article by Story, with SE")

# Now save
ggsave(file="plots/dotplot-story-rating.pdf", height=14, width=8.5)

Posted in R | 26 Comments

Putting it all together: concise code to make dotplots with weighted bootstrapped standard errors

Dotplots with plyr, boot, and ggplot2

I analyze a lot of experiments and there are many times when I want to quickly look at means and standard errors for each cell (experimental condition), or the same for each cell and individual-level attribute level (e.g., Democrat, Independent, Republican).  Many of these experiments are embedded in national or state-level surveys, for which each respondent is weighted so that the sample means for gender, age, and political affiliation approximate those of a probability sample. But if you use weights, it’s difficult to get unbiased estimates of the standard error of that weighted mean.

In this post, I’ll walk through how to use the powerful plyr package to summarize your weighted data, estimating weighted means and standard errors for each experimental cell using the bootstrap, and then generate dotplots to visualize the result using the ggplot2 package. Note that these are both Hadley Wickham’s packages; he’s also responsible for the immensely useful packages reshape and stringr, which I hope to write about in future posts.

First, I’ll describe the data we’re going to use. Just before the 2010 elections, my lab hired YouGov/Polimetrics to survey Illinois voters about a series of candidates for statewide office, most notably those for Secretary of State.  The major party candidates included the Black Democrat incumbent, Jesse White, and a Hispanic Republican challenger, Robert Enriquez. We exposed respondents to different images of Jesse White, according to three experimental conditions: light, dark, and control (no picture). Our dependent measures included a question about how each respondent intended to vote and a feeling thermometer measuring affect toward each candidate. I use the net of the thermometer rating for Obama minus the thermometer rating for McCain (“netther”). For more on this type of measure, here’s a link to a study on feeling thermometers and their precision compared to questionnaire measures with fewer categories.

Even before we have to deal with the issue of weighted standard errors, it would be nice to find an elegant way to extract weighted statistics on subsets of our data corresponding to our experimental cells (or other qualitative grouping of our data, such as respondent party or ethnicity). It’s rather clunky and/or difficult to do this with other packages/functions, such as base::by or doBy::summaryBy because of the difficulty of passing more than one variable to the function in question, say “weighted.mean(),” applied to the subset of your data specified by a particular variable (see this discussion for details — thanks to David Freedman for the pointer to plyr::ddply). The survey::svymean function provides another potential alternative, but is nowhere near as flexible as is ddply.

We’ll use plyr::ddply, which handles this problem quite well. In the code below, we use “ddply(),” which takes a data frame for input and returns a data frame for output. The “.data” parameter specifies the data frame and the “.variables” parameter specifies the qualitative variable used to subset the data. We use “summarise()” as the function, which allows us to specify not only the function we want to pass over the subsets of our data (experimental cells), but also to tell ddply which variables in our data subset to pass to the function. For example, below we run the function “mean()” on the variable “netther” for each treatment group.

install.packages("plyr")
library("plyr")

# load the Illinois survey data
load(file("dl.dropboxusercontent.com/u/25710348/Blogposts/data/IL2010.Rda"))

# look at ddply:
ddply(.data = IL2010, .variables= .(treat), .fun = summarise,
		mean = mean(x=netther, na.rm=T),
		se = sd(netther, na.rm=T)/sqrt(length(netther)) )

Now that we have some nice machinery to quickly summarize weighted data in hand, let’s get into bootstrapping. The computation of weighted standard errors for estimates of the weighted mean has no straighforward closed form solution, and hence bootstrapping is not a bad way to go. Bootstrapping consists of computing a statistic over and over, by resampling the data with replacement. For more see the UCLA stats webpage on bootstrapping in R.

I’ll use the boot package to implement bootstrapping, which requires us to specify a bootstrapping function that returns a statistic.

install.packages("boot")
library("boot")

sample.wtd.mean <- function(x, w, d) {
	return(weighted.mean(x = x[d], w = w[d], na.rm=T ))
}

Why do we need this function? It allows the boot function to sample our data many times (w/ replacement), each time computing the statistic we want to estimate, as mentioned above. Specifying that function make this process quite elegant by making use of R’s indexing capabilities. Boot pass this function an index of items to include from our original data in each of the resamples. To illustrate this, run the following code in R:

playdata <- runif(20, 1, 10)
playdata
d <- sample(20, replace = TRUE)
d
playdata[d]

The code “playdata[d]” returns the “dth” element from playdata. This is exactly how the boot function works. It passes our “sample.wtd.mean()” function our index of items (d) over and over again based on sampling without replacement — utilizing the fast c code that R uses for indexing rather than slow R for-loops.

When boot is done, we take the mean or median of these statistics as our estimate, which is a loose definition of the bootstrap. We also use the standard deviation of these estimates as the estimate of the standard error of the statistic. Here’s what happens when we pass our function over our data:

b <- boot( IL2010$netther, statistic = sample.wtd.mean, R=1000, w=IL2010$weight)
b
sd(b$t)

[UPDATE 26 JAN 2012: perviously this was coded incorrectly.  The “weights” parameter has been changed to “w.” Thanks to Gaurav Sood and Yph Lelkes for pointing this out!]

Now let’s put it all together using “plyr” and “ggplot2.” I use dotplots because they convey numbers more accurately than other types of plots, such as bar plots or (worst of all) pie charts. William Cleveland has conducted research showing that dots aligned on the same scale are indeed the best visualization to convey a series of numerical estimates (Correction, the actual study is here). His definition of “best” is based on the accuracy with which people interpret various graphical elements, or as he calls them, “perceptual units.”

We’ll first just visualize results by party:


install.packages("ggplot2")
library("ggplot2")

# Create nice party ID variable:
IL2010$pid[IL2010$pid==8] <- NA
IL2010$pidcut <- cut(IL2010$pid, c(-Inf, 1, 3, 4, 6, Inf) , 
		c("Strong Democrat", "Democrat", "Independent", "Republican", "Strong Republican"))
# clean up treatment variable
IL2010$treat <- factor(IL2010$treat, labels = c("Dark", "Light", "Control" ))
# Use plyr::ddply to produce estimates based on PID
pid.therm <- ddply(.data = IL2010, .variables= .(pidcut), .fun = summarise,		
		mean = weighted.mean(x=netther, w=weight, na.rm=T),
		se = sd(boot(netther, sample.wtd.mean, R = 1000, w = weight)$t))
# plot it
ggplot(na.omit(pid.therm), aes(x = mean, xmin = mean-se, xmax = mean+se, y = factor(pidcut))) +
		geom_point() + geom_segment( aes(x = mean-se, xend = mean+se,
						y = factor(pidcut), yend=factor(pidcut))) +
		theme_bw() + opts(axis.title.x = theme_text(size = 12, vjust = .25))+
		xlab("Mean thermometer rating") + ylab("Treatment") +
		opts(title = expression("Thermometer ratings for Jessie White by party"))

That’s a pretty big ggplot command, so let’s discuss each element. The first is the data to plot, which is the object produced via ddply. The second specifies the aesthetic elements of the plot, including the x and y values of the plot, and the boundaries of the plot based on specified minimum and maximum endpoints (“xmin = mean-se, xmax = mean+se”). Next, we add points to the plot by envoking “geom_point().” If we only wanted points, we could stop here. We plot lines representing standard errors with “geom_segment()” specifying x and xend. Next, we specify that we want to use theme_bw(), which I find cleaner than the default parameters. Then we make some minor adjustments via opts for aesthetics. Lastly, we set the y and x labels and the title.

Well, it’s pretty clear from our plot that there’s a lot of action on partisan ID. Let’s look at how each group is affected by the treatment. To do so, we’ll produce the plot at the top of this article, putting each partisan group in a panel via “facet_wrap(),” and plotting each treatment group. First, we’ll generate the summary data using plyr, all we need to do is add another variable (“treat”).

# PID x Treatment
trtbypid.therm <- ddply(.data = IL2010, .variables= .(treat, pidcut), .fun = summarise,		
		mean = weighted.mean(x=netther, w=weight, na.rm=T),
		se = sd(boot(netther, sample.wtd.mean, R = 1000, w = weight)$t))

ggplot(na.omit(trtbypid.therm), aes(x = mean, xmin = mean-se, xmax = mean+se, y = factor(treat))) + 
		geom_point() + geom_segment( aes(x = mean-se, xend = mean+se, 
						y = factor(treat), yend=factor(treat))) +
		facet_wrap(~pidcut, ncol=1) +
		theme_bw() + opts(axis.title.x = theme_text(size = 12, vjust = .25))+ 
		xlab("Mean thermometer rating") + ylab("Treatment") +
		opts(title = expression("Thermometer ratings for Jessie White by treatment condition and pid"))

It looks like Republicans generally don’t like the dark image, while Democrats do. But what if we want to look at the way that racial attitudes interact with the treatments? We measured implicit racial attitudes via the IAT). Let’s take a look.

# Fix up the data to remove the "Control" condition and all "na's"
ilplot <- IL2010[-which(is.na(IL2010$pidcut)),]
ilplot$treat[ilplot$treat=="Control"] <- NA
ilplot$treat <- ilplot$treat[drop=T]
ilplot <- ilplot[-which(is.na(ilplot$treat)),]
ilplot$dscore[which(ilplot$dscore< -1)] <- NA

# Plot it
ggplot(ilplot, aes(x = dscore, y = netther, colour=treat)) +
		geom_point() + geom_smooth(method = "loess", size = 1.5) +
		facet_wrap(~pidcut, nrow=1) +
		theme_bw() + opts(axis.title.x = theme_text(size = 12, vjust = .25))+
		xlab("IAT d-score (- pro black, + anti-black)") + ylab("Thermometer rating") +
		opts(title = expression("Thermometer ratings for Jessie White by treatment condition and iat"))

Visualization of IAT x skin complexion manipulation (conditional on party id ) via ggplot2

With this plot, we don’t really need to subset the data because we are using all of the points and just using lowess to draw lines representing the conditional means across every value of IAT score. The plot shows what seems to be an interaction between IAT score and treatment, such that those with high (anti-black) IAT scores generally favor the lighter skinned image.

Posted in R | 15 Comments

Map the distribution of your sample by geolocating ip addresses or zip codes

Using the maps package to plot your geolocated sample


Yesterday I wanted to create a map of participants from a study on social media and partisan selective exposure that Sean Westwood and I ran recently, with participants from Amazon’s Mechanical Turk.  We recorded ip addresses for each Turker participant, so I used a geolocation service to lookup the latitude and longitude associated with each participant’s ip address, and then created a nice little map in R.  This can also be done if you have each participant’s zip+4 code, which is probably more accurate as well.

At any rate, we made a map that looks something like the picture above.

In this post I’ll illustrate how to make a map like this first using zip+4 codes and then using ip addresses.  We’ll use RCurl to send an html POST command to some website that will give us the relevant geolocation data, and then extract the relevant info by parsing JSON or html output (using regular expressions).

First we’ll pass zip+4 codes to the Yahoo! PlaceFinder Application Programming Interface (API) to return the state, latitude and longitude.  We’ll use their JSON output, because it’s easier to parse in R than the alternative (XML).

The first thing you’ll need to do is to get yourself a Yahoo! Application ID, so that you can use Yahoo’s API. This should only take about ten minutes. When you’ve got it, replace your appid where it says “[INSERT YOUR YAHOO ID HERE]” in the code below.

Here’s the R code:

###############################################################################
################## Example 1: Geolocating Zip +4 Codes ########################
###############################################################################

zips <- c("48198-6275", "75248-5000", "27604-1520", "53010-3212", "95453-9630",
		"10548-1241", "70582-6107", "84106-2494", "78613-3249", "18705-3906",
		"23108-0274", "23108-0274", "67361-1732", "28227-1454", "37027-5902",
		"75067-4208", "32505-2315", "80905-4449", "49008-1818", "08034-2616",
		"04347-1204", "94703-1708", "32822-4969", "44035-6205", "60192-1114",
		"45030-4933", "32210-3625", "80439-8782", "68770-7013", "42420-8832",
		"83843-2219", "33064-2555", "75067-6605", "22301-1266", "78757-1652",
		"85140-5232", "19067-6225", "06410-4210", "63109-3347", "89103-4803",
		"99337-5714", "37067-8123", "94582-4921", "65401-3065", "46614-1320",
		"29732-3810", "79412-1142", "23226-3006", "78217-1542", "10128-4923",
		"15145-1635", "48135-1093", "50621-0683", "32713-3803", "08251-1332",
		"14445-2024", "83210-0054", "27525-8790", "32609-8860", "53711-3548",
		"49601-2214", "15324-0010", "37604-3508", "85041-7739", "01906-1317",
		"10552-2605", "20120-6322", "19934-1263", "87124-2735", "22192-4415",
		"45344-9130", "08618-1464", "83646-3473", "80911-9007", "13057-2023",
		"30033-2016", "30039-6323", "20109-8120", "98043-2105", "34655-3129",
		"60465-1326", "33433-5746", "30707-1636", "98102-4432", "92037-7407",
		"95054-4144", "94703-1708", "94110-2712", "92562-5073", "31418-0341",
		"10009-1819", "73703-6736", "98554-0175", "65649-8267", "21704-7864",
		"15209-2420", "61550-2715", "92506-5317", "60611-3801", "48161-4064",
		"77365-3268", "50022-1388", "63841-1227", "36207-5817", "22121-0232",
		"73115-4446", "85340-7341", "44420-3415", "07663-4913", "10065-6716",
		"91606-2213", "19120-9237", "80015-2718", "97401-4448", "46637-5424",
		"29609-5425", "20815-6414", "45373-1731", "57106-6205", "71744-9486",
		"14222-1922", "28306-3901", "84074-2684", "28605-8292", "59405-8649")
 
###############################################################################
############################# Geolocate data ##################################
###############################################################################

# First load RCurl, which we'll use to send HTML POST commands to the Yahoo! API
library(RCurl)

# Next load RJSONIO, an R library that deals with JavaScript Object Notation (JSON) 
# Input/Output.  In my experience this library is a bit faster and more robust than RJSON.   
library(RJSONIO)

# create variables to store output:
state <- character(length(zips))
lat <- character(length(zips))
lon <- character(length(zips))
appid <- "&appid=[INSERT YOUR YAHOO ID HERE]"

# try a sample query
ipinfo <- fromJSON(getURL(paste("http://where.yahooapis.com/geocode?flags=J&country=USA&q=", zips[1], appid, sep="")))

# the good stuff is here:
ipinfo$ResultSet$Results[[1]]

for(i in 1:length(zips)){
	#i=1
	print(paste("i = ", i, "  zip = ", zips[i] ))
	
	# Read in geolocation data
	ipinfo <- fromJSON(getURL(paste("http://where.yahooapis.com/geocode?flags=J&country=USA&q=", zips[i], appid, sep="")))
	
	if(as.numeric(ipinfo$ResultSet[6])>0){
		# state
		try(suppressWarnings(state[i] <- ipinfo$ResultSet$Results[[1]][24])) 
		print(ipinfo$ResultSet$Results[[1]][24])

		# get lattitude
		try(suppressWarnings(lat[i] <- ipinfo$ResultSet$Results[[1]][2])) 
		print(ipinfo$ResultSet$Results[[1]][2])
		
		# get longitude
		try(suppressWarnings(lon[i] <- ipinfo$ResultSet$Results[[1]][3])) 
		print(ipinfo$ResultSet$Results[[1]][3])
		
	}
	
}

###############################################################################
####################### draw maps based on ip addresses #######################
###############################################################################

# We'll draw points for each observation of lattitude and longitude, and
# we'll shade in each state according to how many cases we have in each state

# first count the number of cases in each state:
mtstates <- table(state)
mtstates

# Convert state abbreviations to full state names: 
mtfullstatenames <- tolower(state.name[match(names(mtstates), state.abb)])
mtfullstatenames

mtfullstatenames[names(mtstates)=="DC"] <- "district of columbia"
names(mtstates) <- mtfullstatenames

# use "cut" to group the count data into several groups
mtstatesb <- cut(mtstates, breaks= c(-Inf, 0, 1, 4, 7, Inf), labels=c("0", "1", "2-4", "5-7", "8+")  )  
mtstatesb 

library(maps)
# generate a map object so that we can match names to the state level shape files
usmap <- map("state")


# Some of the state level shape files are broken down by sub-state region.  We
# just care about the political boundaries for the purposes of this project, 
# so we will not differentiate.  Thus, we need to make all names within a 
# state the same, which will mean some names are repeated.  
# Luckily the usmap$names are as follows: state:subregion
usmap$names

# so we can just split the strings based on the ":" character.  We'll use
# sapply( ..., "[[", 1) which extracts the first element when a list is returned:
mapstates <- sapply(strsplit(usmap$names, ":"), "[[", 1)
mapstates

# Now we need to generate a color for each item in mapstates so we can color in the map. 
# We first use "match()" to index our count of observations per state in mtstatesb by mapstates.  
# shapeFileN will contain the numeric range specified in mstatesb, but indexed by the 
# the names in mapstates, some of which will of course be repeated as discussed above. 

shapeFileN <- mtstatesb[match(mapstates,names(mtstates))]

# Now we've can generate a series of grey colors indexed by the count of observations per state,
# indexed so that the the observations match the shape files that come with the "state" object in
# the maps package:
cols <- rev(grey.colors(5))[shapeFileN]

# now let's draw the map:
png("demosamplemap.png", width=1000, height=700)

# draw the map object using the colors specified above
map("state", col= cols, fill=T)

# draw lattitude and longitude points for each observation:
points(lon, lat, cex = .3, col = rgb(0,0,0,.5))

# draw a legend
legend("bottomright", levels(mtstatesb), fill = c("white", rev(grey.colors(5))[2:5]), bty="n")

# put a title on it:
title("Geographic distribution of Zip+4 codes")
dev.off()

Note that for publication purposes, you’ll want to create a pdf instead of a png file.

OK, now for our second example, which will geolocate using ip addresses.  I found an excellent ip database website, it’s pretty current and has nicely formatted html: http://www.ip-db.com/.  You can lookup an ip address as follows: http://www.ip-db.com/%5Bipaddress%5D.  Take a look at the html for http://www.ip-db.com/64.90.182.55, which is one of the NIST Internet Time Servers, which I’ll be using to create example code.  If you view the html source, you’ll see that the relevant state is embedded in this line:

Region: <span style="font-family: arial; font-size: x-small;">New York

Also embedded is the lattitude and longitude, which we will need to create the map:

Lat/Long:
<span style="font-family: arial; font-size: x-small;"><a href="http://maps.google.com/maps?om=1&q=40.7089+-74.0012&z=10" target="_blank">40.7089 -74.0012</a>

Now we’ll just use RCurl to send a set of ip addresses to this website, then use Regular Expressions to extract the relevant data from the html that the website returns. Code below:

###############################################################################
################## Example 2: Geolocating IP Addresses ########################
###############################################################################

ips <- c("64.90.182.55", "96.47.67.105", "206.246.122.250", "129.6.15.28", "64.236.96.53", 
		"216.119.63.113", "64.250.177.145", "208.66.175.36", "173.14.55.9", "64.113.32.5", 
		"66.219.116.140", "24.56.178.140", "132.163.4.101", "132.163.4.102", "132.163.4.103",
		"192.43.244.18", "128.138.140.44", "128.138.188.172", "198.60.73.8", "64.250.229.100",
		"131.107.13.100", "207.200.81.113", "69.25.96.13", "216.171.124.36", "64.147.116.229")

# create variables to store output:
state <- character(length(ips))
lat <- character(length(ips))
lon <- character(length(ips))

# iterate over ip addresses:
for(i in 18:length(ips)){
	#i=5	
	print(paste("i = ", i, " ip: ", i))

	ipinfo <- getURL(paste("http://www.ip-db.com/", ips[i], sep=""))
	
	# use RegEx to extract key info:
	library(stringr)
	
	# Get state
	# We use the regex (\\w\\s?)+ where \\w = word character, \\s? = possible space character, 
	# () groups the pattern, and + means one or more.  
	# note that we must escape " characters by \" to include them in the pattern
	pattrn <- "Region:</b></font></td><td width=\"10\">&nbsp;</td><td><font face=\"arial\" size=\"2\">(\\w\\s?)+</td>"
	( results <- str_extract(ipinfo, pattrn))
	
	# Extract the actual text of the state and clean it up so we can actually use it: 
	state[i] <- str_extract(results, ">(\\w\\s?)+<")
	state[i] <- gsub(">", "", state[i])
	state[i] <- gsub("<", "", state[i])
	
	print(state[i])
	
	# Now get lattitude and longitude:
	# We can limit our pattern to target="_blank" because it's a unique pattern in the html
	# An example of our substring of interest follows: "target=\"_blank\">40.7089 -74.0012</a></td>"
	# So we use this pattern \\-?\\d\\d(\\.\\d+)? \\-?\\d\\d(\\.\\d+)? where \\d means digit,
	# means optional, () groups the pattern, and \\- is an escaped dash
	
	pattrn <- "target=\"_blank\">\\-?\\d+(\\.\\d+)? \\-?\\d+(\\.\\d+)?</a></td>"
	( results <- str_extract(ipinfo, pattrn))
		
	# Extract both lattitude an longitude and clean it up so we can save it as numeric in the data: 
	latlon <- str_extract_all(results, "\\-?\\d+(\\.\\d+)?")
	
	# the code above returns a list in case results has more than one set of characters in the character
	# vector, so we need to unlist it to make it into a normal vector:
	latlon <- unlist(latlon)
	
	# since there are two results, we return one for each for lat and lon:
	lat[i] <- latlon[1] 
	lon[i] <- latlon[2] 
	print(lat[i])
	print(lon[i])
	
	# Put a delay between entries so there's no danger of overwhelming the server
	Sys.sleep(sample(15:25,1)/10) 
}

###############################################################################
####################### draw maps based on ip addresses #######################
###############################################################################

# We'll draw points for each observation of lattitude and longitude, and
# we'll shade in each state according to how many cases we have in each state

# first count the number of cases in each state:
mtstates <- table(state)

# then match to the state.name object (which the maps object also uses)
mtfullstatenames <- tolower(state.name[match(names(mtstates), state.name)])

# extract the state names to a separate object:
names(mtstates) <- mtfullstatenames

# use "cut" to group the count data into several groups
mtstatesb <- cut(mtstates, breaks= c(-Inf, 0, 1, 2, 4, Inf), labels=c("0", "1", "2", "3-4", "4+")  )  

library(maps)
# generate a map object so that we can match names to the state level shape files
usmap <- map("state")


# Some of the state level shape files are broken down by sub-state region.  We
# just care about the political boundaries for the purposes of this project, 
# so we will not differentiate.  Thus, we need to make all names within a 
# state the same, which will mean some names are repeated.  
# Luckily the usmap$names are as follows: state:subregion
usmap$names

# so we can just split the strings based on the ":" character.  We'll use
# sapply( ..., "[[", 1) which extracts the first element when a list is returned:
mapstates <- sapply(strsplit(usmap$names, ":"), "[[", 1)
mapstates

# Now we need to generate a color for each item in mapstates so we can color in the map. 
# We first use "match()" to index our count of observations per state in mtstatesb by mapstates.  
# shapeFileN will contain the numeric range specified in mstatesb, but indexed by the 
# the names in mapstates, some of which will of course be repeated as discussed above. 

shapeFileN <- mtstatesb[match(mapstates,names(mtstates))]

# Now we've can generate a series of grey colors indexed by the count of observations per state,
# indexed so that the the observations match the shape files that come with the "state" object in
# the maps package:
cols <- rev(grey.colors(5))[shapeFileN]

# now let's draw the map:
png("demosamplemap.png", width=1000, height=700)

# draw the map object using the colors specified above
map("state", col= cols, fill=T)

# draw lattitude and longitude points for each observation:
points(lon, lat, cex = .3, col = rgb(0,0,0,.5))

# draw a legend
legend("bottomright", levels(mtstatesb), fill = c("white", rev(grey.colors(5))[2:5]), bty="n")

# put a title on it:
title("Geographic distribution of NIST Internet Time Servers")
dev.off()
Posted in R | 3 Comments