I’ve been doing research that shows change in opinion over time across the electoral cycle and I wanted to visualize primary elections - but there’s a whole lot of them. Rather than copying them down by hand, I decided to scrape them from a PDF. I’ve included the full code to reproduce what I did below.

Now, you might notice that the document in question is pretty short and you could probably copy and paste this thing. My main interest here was to use this as a toy example to show how scraping structured data from a document like this could be done. Hopefully it’s useful as a reference, even if it’s a little overkill for this specific instance. This code scales pretty well to a document of any size.

Note: see the Tools section below for links to more info on what’s going on here

In R, a good tool for this is tabulizer, a wrapper for a library of Java tools (Tabula).

First, keep in mind: this package is just a way for you to use R to talk to Java. I don’t do Java myself and ran into a few wonky errors. See below for a little more detail in case you run into them1

I installed from the github repo using devtools (you’ll need to install devtools first, if you haven’t already):

devtools::install_github(c("ropenscilabs/tabulizerjars", "ropenscilabs/tabulizer"), args = "--no-multiarch")

# Load in packages

library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.3.4     ✔ dplyr   0.7.4
## ✔ tidyr   0.7.2     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(stringr)
library(tabulizer)
library(purrr)

The FEC has election dates posted as PDFs. You could probably scrape these from Wikipedia or something similar, but I like to go straight to the source.

prim_loc <- 'https://transition.fec.gov/pubrec/2008pdates.pdf'

prim <- extract_tables(prim_loc)

It might take a minute. Now you’ve got all the data stored in a useful R object - a list where each top-level element is a page of the PDF.

str(prim)
## List of 6
##  $ : chr [1:41, 1:5] "STATE" "" "" "" ...
##  $ : chr [1:40, 1:5] "STATE" "" "" "" ...
##  $ : chr [1:46, 1:6] "STATE" "" "" "" ...
##  $ : chr [1:29, 1:6] "STATE" "" "" "" ...
##  $ : chr [1:47, 1:7] "STATE" "" "" "Iowa" ...
##  $ : chr [1:36, 1:7] "STATE" "" "" "D.C." ...

Let’s look at the first page. If you look into the first element, it’s stored as a two dimensional character vector - rows and columns extracted from that page.

str(prim[[1]])
##  chr [1:41, 1:5] "STATE" "" "" "" "Alabama" "Alaska" ...
prim[[1]][c(1:20), ]
##       [,1]             [,2]           [,3]                
##  [1,] "STATE"          "PRESIDENTIAL" "PRESIDENTIAL"      
##  [2,] ""               "PRIMARY"      "CAUCUS DATE"       
##  [3,] ""               "DATE"         ""                  
##  [4,] ""               ""             ""                  
##  [5,] "Alabama"        "2/5"          ""                  
##  [6,] "Alaska"         ""             "2/5"               
##  [7,] "American Samoa" ""             "2/23 (Republicans)"
##  [8,] ""               ""             "2/5 (Democrats)"   
##  [9,] "Arizona"        "2/5"          ""                  
## [10,] "Arkansas"       "2/5"          ""                  
## [11,] "California"     ""             ""                  
## [12,] ""               "2/5"          ""                  
## [13,] "Colorado"       ""             "2/5"               
## [14,] "Connecticut"    "2/5"          ""                  
## [15,] "Delaware"       "2/5"          ""                  
## [16,] ""               ""             ""                  
## [17,] "D.C."           "2/12"         ""                  
## [18,] "Florida"        "1/29"         ""                  
## [19,] "Georgia"        "2/5"          ""                  
## [20,] "Guam"           ""             "3/8 (Republicans)" 
##       [,4]                    [,5]                
##  [1,] "FILING"                "INDEPENDENT 1"     
##  [2,] "DEADLINE FOR"          "FILING DEADLINE"   
##  [3,] "PRIMARY"               "FOR GENERAL"       
##  [4,] "BALLOT ACCESS"         "ELECTION"          
##  [5,] "11/7"                  "9/6"               
##  [6,] "n/a"                   "8/6"               
##  [7,] "n/a"                   "n/a"               
##  [8,] ""                      ""                  
##  [9,] "12/17 5pm"             "6/4 5pm"           
## [10,] "11/19 Noon"            "8/4"               
## [11,] "12/4 (Democrats)"      ""                  
## [12,] "11/23 (Other Parties)" "8/8"               
## [13,] "n/a"                   "6/17 3pm"          
## [14,] "12/17 4pm"             "8/6"               
## [15,] "12/10"                 "7/25 (Independent)"
## [16,] ""                      "9/1 (Third/Minor)" 
## [17,] "12/14 5pm"             "8/27"              
## [18,] "10/31"                 "7/15"              
## [19,] "11/1"                  "7/15"              
## [20,] "n/a"                   "n/a"

Let’s just slice off the stuff we need. First drop the pages we don’t need (the pages are redundant, the actual primary data shown in alpha order are on pages 1 and 2). Second, we don’t care about the filing deadlines, so lets just keep the first three columns (with states and dates). Third, let’s drop those headers (the first four rows) and keep everything else.

I make them into actual data.frames (well, tibbles) so we can use other dplyr verbs on them. Then to combine all list elements into one big table and give them useful variable names.

prim_df<- 
prim %>%
.[1:2] %>%
map(., `[`, c(-1:-4), c(1:3) ) %>%
map(as.tibble) %>%
bind_rows() %>%
set_names(c("state", "primary", "caucus") ) %>%

na_if("") %>%
filter(!(is.na(state) & is.na(primary) & is.na(caucus)))

head(prim_df)
## # A tibble: 6 x 3
##            state primary             caucus
##            <chr>   <chr>              <chr>
## 1        Alabama     2/5               <NA>
## 2         Alaska    <NA>                2/5
## 3 American Samoa    <NA> 2/23 (Republicans)
## 4           <NA>    <NA>    2/5 (Democrats)
## 5        Arizona     2/5               <NA>
## 6       Arkansas     2/5               <NA>
tail(prim_df)
## # A tibble: 6 x 3
##           state primary            caucus
##           <chr>   <chr>             <chr>
## 1          <NA>    <NA> 4/5 (Republicans)
## 2    Washington    2/19               2/9
## 3 West Virginia    5/13              <NA>
## 4     Wisconsin    2/19              <NA>
## 5       Wyoming    <NA> 1/5 (Republicans)
## 6          <NA>    <NA>   3/8 (Democrats)

Note two things: 1) First, I used backticks around the subset operator to apply it as a function [ . 2) map comes from purrr. Rather than using the apply() set of functions in base R, (or the ply family frm the earlier plyr package) as with other languages, map lets me apply a function across elements of a vector and returns that vector object. So in this case, I get back a list with the top four rows removed, retaining the first three columns.

Now some general cleanup: For empty strings, make them NA and get rid of rows where everything’s empty. Now fill in states where there’s an implied value that isn’t included in the PDF.

prim_df<- 
prim_df %>%
fill(state)

head(prim_df)
## # A tibble: 6 x 3
##            state primary             caucus
##            <chr>   <chr>              <chr>
## 1        Alabama     2/5               <NA>
## 2         Alaska    <NA>                2/5
## 3 American Samoa    <NA> 2/23 (Republicans)
## 4 American Samoa    <NA>    2/5 (Democrats)
## 5        Arizona     2/5               <NA>
## 6       Arkansas     2/5               <NA>
tail(prim_df)
## # A tibble: 6 x 3
##            state primary            caucus
##            <chr>   <chr>             <chr>
## 1 Virgin Islands    <NA> 4/5 (Republicans)
## 2     Washington    2/19               2/9
## 3  West Virginia    5/13              <NA>
## 4      Wisconsin    2/19              <NA>
## 5        Wyoming    <NA> 1/5 (Republicans)
## 6        Wyoming    <NA>   3/8 (Democrats)

A couple spot revisions…

prim_df$caucus[prim_df$caucus == '1/25-2/7'] <- '1/25-2/7 (Republicans)'
prim_df$caucus[prim_df$caucus == '(Republicans)'] <- NA

Now, rather than a wide format with a column for each type of election, let’s make just one row per election and get rid of those where there’s no date value.

primary_dates_2008 <-
prim_df %>%  
gather('primary', 'caucus', key = 'elex_type', value = 'date') %>%
filter(!(is.na(date)))

head(primary_dates_2008)
## # A tibble: 6 x 3
##         state elex_type  date
##         <chr>     <chr> <chr>
## 1     Alabama   primary   2/5
## 2     Arizona   primary   2/5
## 3    Arkansas   primary   2/5
## 4  California   primary   2/5
## 5 Connecticut   primary   2/5
## 6    Delaware   primary   2/5
tail(primary_dates_2008)
## # A tibble: 6 x 3
##            state elex_type               date
##            <chr>     <chr>              <chr>
## 1    Puerto Rico    caucus 2/24 (Republicans)
## 2 Virgin Islands    caucus    2/9 (Democrats)
## 3 Virgin Islands    caucus  4/5 (Republicans)
## 4     Washington    caucus                2/9
## 5        Wyoming    caucus  1/5 (Republicans)
## 6        Wyoming    caucus    3/8 (Democrats)

Finally, separate the date and the party labels into separate columns and format the dates with some regular expressions.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
primary_dates_2008 <-
primary_dates_2008 %>%
mutate(
    elex_party = str_extract(date, regex("[a-zA-Z]+")),
    date = str_extract(date, regex("\\d+/\\d+")) %>%
        paste0("/2008") %>%
        mdy()
    )


head(primary_dates_2008)
## # A tibble: 6 x 4
##         state elex_type       date elex_party
##         <chr>     <chr>     <date>      <chr>
## 1     Alabama   primary 2008-02-05       <NA>
## 2     Arizona   primary 2008-02-05       <NA>
## 3    Arkansas   primary 2008-02-05       <NA>
## 4  California   primary 2008-02-05       <NA>
## 5 Connecticut   primary 2008-02-05       <NA>
## 6    Delaware   primary 2008-02-05       <NA>
tail(primary_dates_2008)
## # A tibble: 6 x 4
##            state elex_type       date  elex_party
##            <chr>     <chr>     <date>       <chr>
## 1    Puerto Rico    caucus 2008-02-24 Republicans
## 2 Virgin Islands    caucus 2008-02-09   Democrats
## 3 Virgin Islands    caucus 2008-04-05 Republicans
## 4     Washington    caucus 2008-02-09        <NA>
## 5        Wyoming    caucus 2008-01-05 Republicans
## 6        Wyoming    caucus 2008-03-08   Democrats

Here’s a quick-and-dirty plot just to check to see this all makes sense:

library(ggplot2)

ggplot() +
  geom_text(data= primary_dates_2008, aes(x = date, y=0, label = state, color=coalesce(elex_party, 'Both'), angle=90), size=3, position=position_jitter(width=2,height=8), alpha=0.6) +
  scale_color_manual(values = c("black", "blue","red"), drop=F) +
  theme_minimal() +
  guides(color=guide_legend(title="Election Type")) +
  labs(x = "", y = "", title = "Primary dates across 2008 election cycle") +
  theme(panel.background = element_rect(fill = "white", colour = "white"), axis.text.y=element_blank()) +
  scale_x_date(date_labels = "%b %Y")

Yeah, this looks about right: Iowa and New Hampshire at the front; a giant jumble around Super Tuesday and lots of dead space in the middle of spring.


Tools:

A few links to tutorials on the functions I use above

tabulizer

purr: map

dplyr: filter, mutate, gather, bind_rows, na_if, fill

maggritr: %>%, set_names

stringr: str_replace, str_extract, etc

lubridate

ggplot


Some coding concepts:

Regular expressions: Introduction and cheatsheet

`[`: Advanced R > Functions. See ‘Infix functions’


  1. Apparently a lot of folks have trouble using R packages that have a Java dependency - or getting them to work with RStudio, if that’s how you do your work. This is too bad, because there’s a lot of utility you can get out of them.
    A brief summary of some of what I’ve found in dealing with this: You need to have the Java runtime environment installed first before doing any of this. Simple enough. But sometimes the R packages and the environments where they run have trouble finding the right libraries to get R and Java to talk to each other. In the case of using RStudio, I found a solution that involved 1) installing rJava first, 2) from terminal, entering R CMD javarecog to find out where the libraries in question were being stored and 3) calling them directly in R before loading the tabulizer package. In my case that’s: dyn.load('/Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home/jre/lib/server/libjvm.dylib') If you’re running on a Mac, odds are you’ll just need to replace the jdk1.8.0_144.jdk part with whatever JDK version you’ve installed - and javarecog should tell you what you need to know.