The dplyr package makes data manipulation much easier. It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate those thoughts into code. It also uses efficient data storage backends, so you spend less time waiting for the computer.
Here’s an useful cheat seet:
https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
Let’s play with another dataset from the “Zelig” package
# Clear all
rm(list=ls())
# This dataset contains voting data for the 1988 Mexican presidential election.
#installpackages('Zelig')
library(Zelig)
data("mexico")
#Remember the following commands to get familiar with your data
dim(mexico) # Dimensions of the data frame. Syntax: dim(x)
head(mexico,10) # Shows first n rows. Syntax: head(x,n)
tail(mexico,10) # Shows last n rows. Syntax: head(x,n)
str(mexico) # Displays the structure of an object. Syntax: str(x)
summary(mexico) # Displays summary statistics. Syntax: summary(x)
colnames(mexico) # Column names of an object. Syntax: colnames(x)
#?mexico
## First, let's transform our data into a tbl_df form. This is just for covenience
#install.packages('dplyr')
library(dplyr)
mexico <- tbl_df(mexico)
## Dimensions of dataset
dim(mexico)
## [1] 1359 33
str(mexico)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1359 obs. of 33 variables:
## $ vote88 : int 2 2 3 1 3 3 1 1 3 1 ...
## $ pri82 : int 0 0 1 1 0 0 1 1 0 0 ...
## $ pan82 : int 0 0 0 0 1 0 0 0 0 0 ...
## $ novote82: int 0 0 0 0 0 0 0 0 0 1 ...
## $ deathok : int 2 3 1 1 3 3 3 1 1 1 ...
## $ forinvok: int 3 3 3 3 3 3 3 3 3 3 ...
## $ limimp : int 3 3 1 1 1 3 1 1 1 1 ...
## $ paydebt : int 3 3 3 3 3 3 3 3 1 3 ...
## $ keepind : int 3 3 3 2 1 3 1 1 3 3 ...
## $ polint : int 1 2 1 1 4 2 3 4 2 3 ...
## $ auth : int 3 2 4 1 2 4 3 3 3 3 ...
## $ natecon : int 2 3 1 1 1 4 3 2 2 3 ...
## $ futecok : int 1 1 2 1 2 1 3 2 2 2 ...
## $ persecon: int 2 3 3 1 2 2 2 2 2 3 ...
## $ futperok: int 4 4 3 1 2 3 2 1 3 3 ...
## $ school : int 3 6 7 1 3 6 8 5 6 8 ...
## $ age : int 30 22 32 55 57 21 53 37 24 37 ...
## $ female : int 0 0 1 0 0 0 1 1 0 0 ...
## $ prof : int 0 0 0 0 0 0 0 0 0 1 ...
## $ working : int 0 0 0 1 0 1 0 0 0 0 ...
## $ union : int 1 0 0 1 0 0 0 0 1 0 ...
## $ townsize: int 4 4 4 4 4 4 4 4 4 4 ...
## $ north : int 1 1 1 1 1 1 1 1 1 1 ...
## $ south : int 0 0 0 0 0 0 0 0 0 0 ...
## $ zmgm : int 0 0 0 0 0 0 0 0 0 0 ...
## $ religion: int 3 1 5 2 1 1 6 6 5 1 ...
## $ pristr : int 1 3 1 2 1 2 3 2 2 3 ...
## $ othcok : int 3 3 3 3 3 1 1 2 3 3 ...
## $ othsocok: int 2 1 1 2 1 2 1 2 2 2 ...
## $ ratemdm : int 1 7 4 1 1 7 10 6 4 9 ...
## $ traitmjc: int 0 4 0 0 0 0 1 0 0 0 ...
## $ traitccs: int 0 0 4 2 4 0 1 1 2 0 ...
## $ one : int 1 1 1 1 1 1 1 1 1 1 ...
Single table verbs
dplyr aims to provide a function for each basic verb of data manipulation:
filter() (and slice()): Return rows with matching conditions.
arrange(): Sort a variable in descending order.
select(): Keeps only the specified variables
distinct(): Retain only unique/distinct rows from an input tbl.
mutate(): Creates new variables
summarise(): Summarise multiple values to a single value.
sample_n(): Makes it easy to select random rows from a table.
We can forget now about ‘$’ , ‘[ ]’, and other annoyances!!!
### Lets filter all voters younger than 25 and female
### Syntax: filtert(df, statements)
mexico_young<- filter(mexico, age<25 & female==1)
dim(mexico_young)
## Equivalent to:
#mexico[mexico$age<25 & mexico$female==1, ]
###Slice selects rows by position
slice(mexico, 1:5)
### Reorders rows
mexico
mexico <- arrange(mexico, vote88)
mexico
##We can use several rows
mexico
mexico <- arrange(mexico, vote88, -pri82) # Now vote goes from 1 to 3 and, within each pri is ordered
mexico
#### Lets keep only some variables: vote in 88, age and sex
### Syntax: select(df, variables)
mexico_short <- select(mexico, vote88, age, female)
### We can rename with this function
mexico_short <- select(mexico, vote88, age, sex=female)
### Add new columns that are transformation of existing columns
### Syntax: mutate(df, new_variables)
mexico_young <- mutate(mexico, young_female= (age<25 & female==1)) #This creates a logical statement
mexico_young <- mutate(mexico_young, young_female= as.numeric(young_female)) ## Transform to binary
## TRANSMUTE: If you want to keep these variables only
transmute(mexico, young_female= (age<25 & female==1))
# It collapses a data frame to a single row
summarise(mexico, female=mean(female) )
summarise(mexico, age=mean(age) )
## Randomply selects rows
sample_n(mexico_short, 10)
## # A tibble: 10 × 3
## vote88 age sex
## <int> <int> <int>
## 1 1 21 0
## 2 1 34 1
## 3 3 40 0
## 4 3 29 0
## 5 1 46 1
## 6 2 22 1
## 7 2 40 1
## 8 2 18 0
## 9 1 45 1
## 10 1 35 1
The real power of dplyr comes when we combine arguments. In dplyr, you do this by with the group_by() function. It breaks down a dataset into specified groups of rows.
For Stata users: This is similar to the ‘collapse’ function’
Grouping affects the verbs as follows:
grouped select(): is the same as ungrouped select(), except that grouping variables are always retained.
grouped arrange(): orders first by the grouping variables
mutate() and filter(): are most useful in conjunction with window functions (like rank(), or min(x) == x). They are described in detail in vignette(“window-functions”).
sample_n(): and sample_frac() sample the specified number/fraction of rows in each group.
slice(): extracts rows within each group.
summarise(): is powerful and easy to understand, as described in more detail below.
Let’s apply this to the flights data from the nycflights13
package which contains on-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.
First, we can create a dataset for an specific variable collapsed by carrier, for example the average departure delay
### Lets calculate the average delay in departure by carrier
# install.packages('nycflights13')
library(nycflights13)
str(flights)
### Collapse and summarise
delay_carrier <- group_by(flights, carrier) # This first statement only specifies the new arrangement but
# on its own is not very useful
delay_carrier <- summarise(delay_carrier, departure=mean(dep_delay, na.rm=T))
delay_carrier
## # A tibble: 16 × 2
## carrier departure
## <chr> <dbl>
## 1 9E 16.725769
## 2 AA 8.586016
## 3 AS 5.804775
## 4 B6 13.022522
## 5 DL 9.264505
## 6 EV 19.955390
## 7 F9 20.215543
## 8 FL 18.726075
## 9 HA 4.900585
## 10 MQ 10.552041
## 11 OO 12.586207
## 12 UA 12.106073
## 13 US 3.782418
## 14 VX 12.869421
## 15 WN 17.711744
## 16 YV 18.996330
Summarise is useful with aggregate functions, which take a vector of values and return a single number. There are many useful examples of such functions in base R like min(), max(), mean(), sum(), sd(), median(), and IQR(). dplyr provides a handful of others. For example, we could use these to find the number of planes and the number of flights that go to each possible destination:
destinations <- group_by(flights, dest) #Group by destination
summarise(destinations,
planes = n_distinct(tailnum),
flights = n()
)
destinations
dplyr has some very useful commands to perform merges.
left_join(a, b, by = “x1”): Join matching rows from b to a.
right_join(a, b, by = “x1”): Join matching rows from a to b.
inner_join(a, b, by = “x1”): Join data. Retain only rows in both sets.
full_join(a, b, by = “x1”): Join data. Retain all values, all rows
### Let's create two dataframes
athletes <- data.frame(name= c("Phelps", "Biles", "Ledecky", "Bolt", "Thompsom" ,
"Douglas","Nadal", "Murray") ,
sport = c("swimming", "gymnastics", "swimming", "track",
"track", "gymnastics", "tennis","tennis"),
gender = c(0, 1, 1,0, 1, 1, 0,0),
country = c("US","US","US", "Jamaica", "Jamaica",
"US", "Spain", "Australia"))
dim(athletes)
countries <- data.frame(country= c("US", "Jamaica", "Spain", "China"),
medals = c(121,11,17, 70))
dim(countries)
Inner_join(x, y): Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join.
### Inner Join ####
olympics <- inner_join(athletes, countries)
## Joining, by = "country"
## Warning in inner_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## factors with different levels, coercing to character vector
olympics
#We lose Murray in the join because, although he appears in x = athletes,
#his country does not appear in y = countries. The join result has all variables from x = plus yr_founded, from y.
semi_join(x, y): Return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x. This is a filtering join.
olympics <- semi_join(athletes, countries)
## Joining, by = "country"
olympics
#Now the effects of switching the x and y roles is more clear. The result resembles x = countries, but the county #China is lost, because there are no observations in y = athletes.
olympics <- semi_join(countries, athletes)
## Joining, by = "country"
olympics
left_join(x, y): Return all rows from x, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join.
olympics <- left_join(athletes, countries)
## Joining, by = "country"
## Warning in left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## factors with different levels, coercing to character vector
olympics
#We basically get x = athletes back, but with the addition of variable medals,
#which is unique to y = countries. Murray, whose publisher does not appear in y =countries, has an NA for medals.
##Anti join ###
#anti_join(x, y): Return all rows from x where there are not matching values in y, keeping just columns from x. This is a filtering join.
olympics <- anti_join(athletes, countries)
## Joining, by = "country"
olympics
#We keep only Murray now (and do not get medals).
full_join(x, y): Return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing. This is a mutating join.
olympics <- full_join(athletes, countries)
## Joining, by = "country"
## Warning in full_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## factors with different levels, coercing to character vector
olympics
#We get all rows of x = athletes plus a new row from y = countries, containing China. We get all variables from x = athletes AND all variables from y = countries. Any row that derives solely from one table or the other carries NAs in the variables found only in the other table.
For the flights data select the variables: year, month, day and dest
Filter this dataset so you keep only flights that happened in decemeber
Summarise this data counting how many flights happened by destination during december
The Pipe Concept: “The magrittr (to be pronounced with a sophisticated french accent) is a package with two aims: to decrease development time and to improve readability and maintainability of code.” This packages provides the pipe operator %>%
which is used to pass objects to functions without directly storing them in your environment. So we do something like x %>% f
instead of f(x)
.
For example, if we want to group our data by day, select the delay variables and obtain the average delays at arrival and departure by day, AND keep the days with mean times higher than 30 min….UFFFF. This is how it would look
daily <- group_by(flights, year, month, day)
daily <- select(daily, arr_delay, dep_delay)
## Adding missing grouping variables: `year`, `month`, `day`
daily_mean <- summarise(daily,
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE))
filter(daily_mean, arr>30 | dep>30)
Let’s peform this with pipes
#install.packages('magritrr')
library(magrittr)
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
## Adding missing grouping variables: `year`, `month`, `day`
## Source: local data frame [49 x 5]
## Groups: year, month [11]
##
## year month day arr dep
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 16 34.24736 24.61287
## 2 2013 1 31 32.60285 28.65836
## 3 2013 2 11 36.29009 39.07360
## 4 2013 2 27 31.25249 37.76327
## 5 2013 3 8 85.86216 83.53692
## 6 2013 3 18 41.29189 30.11796
## 7 2013 4 10 38.41231 33.02368
## 8 2013 4 12 36.04814 34.83843
## 9 2013 4 18 36.02848 34.91536
## 10 2013 4 19 47.91170 46.12783
## # ... with 39 more rows
This new version is cleaner and more readable.
Again, count how many flights happened by destination in december but now using the %>%
operator.
Using the %>%
operator and dplyr
functions calculate the average number of flights that were delayed at departure by month for more than 30 minutes. To do so follow these steps: 1) select the month and the dep_delay variable, 2) using mutate create a nuew variable that is 1 if the departure delay was more than 30 and zero otherwise, 3) collapse by month and 3.1) summarise to get the mean (be careful with NA’s)