Outline

  1. dplyr
  2. magrittr

The dplyr package makes data manipulation much easier. It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate those thoughts into code. It also uses efficient data storage backends, so you spend less time waiting for the computer.

Here’s an useful cheat seet:

https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

1. dplyr

Let’s play with another dataset from the “Zelig” package

# Clear all
rm(list=ls())

# This dataset contains voting data for the 1988 Mexican presidential election.
#installpackages('Zelig')
library(Zelig)
data("mexico")

#Remember the following commands to get familiar with your data
dim(mexico)         # Dimensions of the data frame.             Syntax: dim(x)
head(mexico,10)   # Shows first n rows.                     Syntax: head(x,n) 
tail(mexico,10)   # Shows last n rows.                  Syntax: head(x,n)
str(mexico)           # Displays the structure of an object.    Syntax: str(x)
summary(mexico)   # Displays summary statistics.            Syntax: summary(x)
colnames(mexico)    # Column names of an object.            Syntax: colnames(x)
#?mexico


## First, let's transform our data into a tbl_df form. This is just for covenience
#install.packages('dplyr')
library(dplyr)
mexico <- tbl_df(mexico)
## Dimensions of dataset
dim(mexico)
## [1] 1359   33
str(mexico)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1359 obs. of  33 variables:
##  $ vote88  : int  2 2 3 1 3 3 1 1 3 1 ...
##  $ pri82   : int  0 0 1 1 0 0 1 1 0 0 ...
##  $ pan82   : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ novote82: int  0 0 0 0 0 0 0 0 0 1 ...
##  $ deathok : int  2 3 1 1 3 3 3 1 1 1 ...
##  $ forinvok: int  3 3 3 3 3 3 3 3 3 3 ...
##  $ limimp  : int  3 3 1 1 1 3 1 1 1 1 ...
##  $ paydebt : int  3 3 3 3 3 3 3 3 1 3 ...
##  $ keepind : int  3 3 3 2 1 3 1 1 3 3 ...
##  $ polint  : int  1 2 1 1 4 2 3 4 2 3 ...
##  $ auth    : int  3 2 4 1 2 4 3 3 3 3 ...
##  $ natecon : int  2 3 1 1 1 4 3 2 2 3 ...
##  $ futecok : int  1 1 2 1 2 1 3 2 2 2 ...
##  $ persecon: int  2 3 3 1 2 2 2 2 2 3 ...
##  $ futperok: int  4 4 3 1 2 3 2 1 3 3 ...
##  $ school  : int  3 6 7 1 3 6 8 5 6 8 ...
##  $ age     : int  30 22 32 55 57 21 53 37 24 37 ...
##  $ female  : int  0 0 1 0 0 0 1 1 0 0 ...
##  $ prof    : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ working : int  0 0 0 1 0 1 0 0 0 0 ...
##  $ union   : int  1 0 0 1 0 0 0 0 1 0 ...
##  $ townsize: int  4 4 4 4 4 4 4 4 4 4 ...
##  $ north   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ south   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ zmgm    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ religion: int  3 1 5 2 1 1 6 6 5 1 ...
##  $ pristr  : int  1 3 1 2 1 2 3 2 2 3 ...
##  $ othcok  : int  3 3 3 3 3 1 1 2 3 3 ...
##  $ othsocok: int  2 1 1 2 1 2 1 2 2 2 ...
##  $ ratemdm : int  1 7 4 1 1 7 10 6 4 9 ...
##  $ traitmjc: int  0 4 0 0 0 0 1 0 0 0 ...
##  $ traitccs: int  0 0 4 2 4 0 1 1 2 0 ...
##  $ one     : int  1 1 1 1 1 1 1 1 1 1 ...

Single table verbs

dplyr aims to provide a function for each basic verb of data manipulation:

We can forget now about ‘$’ , ‘[ ]’, and other annoyances!!!

Filter and Slice

### Lets filter all voters younger than 25 and female
### Syntax: filtert(df, statements)

mexico_young<- filter(mexico, age<25 & female==1)
dim(mexico_young)

## Equivalent to:
#mexico[mexico$age<25 & mexico$female==1, ]


###Slice selects rows by position
slice(mexico, 1:5)

Arrange

### Reorders rows
mexico
mexico <- arrange(mexico, vote88)
mexico

##We can use several rows
mexico
mexico <- arrange(mexico, vote88, -pri82)  # Now vote goes from 1 to 3 and, within each pri is ordered 
mexico

Select

#### Lets keep only some variables: vote in 88, age and sex
### Syntax: select(df, variables)
mexico_short <- select(mexico, vote88, age, female)


### We can rename with this function
mexico_short <- select(mexico, vote88, age, sex=female)

Mutate

### Add new columns that are transformation of existing columns
### Syntax: mutate(df, new_variables)
mexico_young <- mutate(mexico, young_female= (age<25 & female==1)) #This creates a logical statement

mexico_young <- mutate(mexico_young, young_female= as.numeric(young_female)) ## Transform to binary

## TRANSMUTE: If you want to keep these variables only

transmute(mexico, young_female= (age<25 & female==1))

Summarise

# It collapses a data frame to a single row
summarise(mexico, female=mean(female) )
summarise(mexico, age=mean(age) )

sample_n

## Randomply selects rows
sample_n(mexico_short, 10)
## # A tibble: 10 × 3
##    vote88   age   sex
##     <int> <int> <int>
## 1       1    21     0
## 2       1    34     1
## 3       3    40     0
## 4       3    29     0
## 5       1    46     1
## 6       2    22     1
## 7       2    40     1
## 8       2    18     0
## 9       1    45     1
## 10      1    35     1

Grouped operations in dplyr

The real power of dplyr comes when we combine arguments. In dplyr, you do this by with the group_by() function. It breaks down a dataset into specified groups of rows.

For Stata users: This is similar to the ‘collapse’ function’

Grouping affects the verbs as follows:

  • grouped select(): is the same as ungrouped select(), except that grouping variables are always retained.

  • grouped arrange(): orders first by the grouping variables

  • mutate() and filter(): are most useful in conjunction with window functions (like rank(), or min(x) == x). They are described in detail in vignette(“window-functions”).

  • sample_n(): and sample_frac() sample the specified number/fraction of rows in each group.

  • slice(): extracts rows within each group.

  • summarise(): is powerful and easy to understand, as described in more detail below.

Let’s apply this to the flights data from the nycflights13 package which contains on-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.

First, we can create a dataset for an specific variable collapsed by carrier, for example the average departure delay

### Lets calculate the average delay in departure by carrier
# install.packages('nycflights13')
library(nycflights13)
str(flights)
### Collapse and summarise
delay_carrier <- group_by(flights, carrier)  # This first statement only specifies the new arrangement but 
                                             # on its own is not very useful
delay_carrier <- summarise(delay_carrier, departure=mean(dep_delay, na.rm=T))
delay_carrier
## # A tibble: 16 × 2
##    carrier departure
##      <chr>     <dbl>
## 1       9E 16.725769
## 2       AA  8.586016
## 3       AS  5.804775
## 4       B6 13.022522
## 5       DL  9.264505
## 6       EV 19.955390
## 7       F9 20.215543
## 8       FL 18.726075
## 9       HA  4.900585
## 10      MQ 10.552041
## 11      OO 12.586207
## 12      UA 12.106073
## 13      US  3.782418
## 14      VX 12.869421
## 15      WN 17.711744
## 16      YV 18.996330

Summarise is useful with aggregate functions, which take a vector of values and return a single number. There are many useful examples of such functions in base R like min(), max(), mean(), sum(), sd(), median(), and IQR(). dplyr provides a handful of others. For example, we could use these to find the number of planes and the number of flights that go to each possible destination:

destinations <- group_by(flights, dest) #Group by destination 
summarise(destinations,
  planes = n_distinct(tailnum),
  flights = n()
)

destinations

Combining datasets

dplyr has some very useful commands to perform merges.

  • left_join(a, b, by = “x1”): Join matching rows from b to a.

  • right_join(a, b, by = “x1”): Join matching rows from a to b.

  • inner_join(a, b, by = “x1”): Join data. Retain only rows in both sets.

  • full_join(a, b, by = “x1”): Join data. Retain all values, all rows

### Let's create two dataframes

athletes <- data.frame(name= c("Phelps", "Biles", "Ledecky",  "Bolt", "Thompsom" ,
                                "Douglas","Nadal", "Murray")  ,
                          sport = c("swimming", "gymnastics", "swimming", "track", 
                                    "track", "gymnastics", "tennis","tennis"),
                          gender = c(0, 1, 1,0, 1, 1, 0,0),
                          country = c("US","US","US", "Jamaica", "Jamaica", 
                                      "US", "Spain", "Australia"))
                          
dim(athletes)


countries <- data.frame(country= c("US", "Jamaica", "Spain", "China"),
                        medals = c(121,11,17, 70))

dim(countries)

Inner join:

Inner_join(x, y): Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join.

### Inner Join ####


olympics <- inner_join(athletes, countries)
## Joining, by = "country"
## Warning in inner_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## factors with different levels, coercing to character vector
olympics
#We lose Murray in the join because, although he appears in x = athletes, 
#his country does not appear in y = countries. The join result has all variables from x =  plus yr_founded, from y.

Semi Join

semi_join(x, y): Return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x. This is a filtering join.

olympics <- semi_join(athletes, countries)
## Joining, by = "country"
olympics

#Now the effects of switching the x and y roles is more clear. The result resembles x = countries, but the county #China is lost, because there are no observations  in y = athletes.

olympics <- semi_join(countries, athletes)
## Joining, by = "country"
olympics

Left join

left_join(x, y): Return all rows from x, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join.

olympics <- left_join(athletes, countries)
## Joining, by = "country"
## Warning in left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## factors with different levels, coercing to character vector
olympics

#We basically get x = athletes back, but with the addition of variable medals,
#which is unique to y = countries. Murray, whose publisher does not appear in y =countries, has an NA for medals.


##Anti join ###
#anti_join(x, y): Return all rows from x where there are not matching values in y, keeping just columns from x. This is a filtering join.

olympics <- anti_join(athletes, countries)
## Joining, by = "country"
olympics

#We keep only Murray now (and do not get medals).

Full join

full_join(x, y): Return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing. This is a mutating join.

olympics <- full_join(athletes, countries)
## Joining, by = "country"
## Warning in full_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## factors with different levels, coercing to character vector
olympics

#We get all rows of x = athletes plus a new row from y = countries, containing China. We get all variables from x = athletes AND all variables from y = countries. Any row that derives solely from one table or the other carries NAs in the variables found only in the other table.

Exercise

  1. For the flights data select the variables: year, month, day and dest

  2. Filter this dataset so you keep only flights that happened in decemeber

  3. Summarise this data counting how many flights happened by destination during december

2. magrittr

The Pipe Concept: “The magrittr (to be pronounced with a sophisticated french accent) is a package with two aims: to decrease development time and to improve readability and maintainability of code.” This packages provides the pipe operator %>% which is used to pass objects to functions without directly storing them in your environment. So we do something like x %>% f instead of f(x).

For example, if we want to group our data by day, select the delay variables and obtain the average delays at arrival and departure by day, AND keep the days with mean times higher than 30 min….UFFFF. This is how it would look

daily  <- group_by(flights, year, month, day)
daily  <- select(daily, arr_delay, dep_delay)
## Adding missing grouping variables: `year`, `month`, `day`
daily_mean <- summarise(daily, 
     arr = mean(arr_delay, na.rm = TRUE),
    dep = mean(dep_delay, na.rm = TRUE))
filter(daily_mean, arr>30 | dep>30)

Let’s peform this with pipes

#install.packages('magritrr')
library(magrittr)


flights %>%
  group_by(year, month, day) %>%
  select(arr_delay, dep_delay) %>%
  summarise(
    arr = mean(arr_delay, na.rm = TRUE),
    dep = mean(dep_delay, na.rm = TRUE)
  ) %>%
  filter(arr > 30 | dep > 30)
## Adding missing grouping variables: `year`, `month`, `day`
## Source: local data frame [49 x 5]
## Groups: year, month [11]
## 
##     year month   day      arr      dep
##    <int> <int> <int>    <dbl>    <dbl>
## 1   2013     1    16 34.24736 24.61287
## 2   2013     1    31 32.60285 28.65836
## 3   2013     2    11 36.29009 39.07360
## 4   2013     2    27 31.25249 37.76327
## 5   2013     3     8 85.86216 83.53692
## 6   2013     3    18 41.29189 30.11796
## 7   2013     4    10 38.41231 33.02368
## 8   2013     4    12 36.04814 34.83843
## 9   2013     4    18 36.02848 34.91536
## 10  2013     4    19 47.91170 46.12783
## # ... with 39 more rows

This new version is cleaner and more readable.

Exercise

  1. Again, count how many flights happened by destination in december but now using the %>% operator.

  2. Using the %>%operator and dplyr functions calculate the average number of flights that were delayed at departure by month for more than 30 minutes. To do so follow these steps: 1) select the month and the dep_delay variable, 2) using mutate create a nuew variable that is 1 if the departure delay was more than 30 and zero otherwise, 3) collapse by month and 3.1) summarise to get the mean (be careful with NA’s)