Instructor: Edgar Franco

Outline

Reading data
Selecting and subsampling data
Merging & appending datasets

6. Reading data

R can read and data from and to different formats.

1. Delimited data files read.csv(), read.table(), read.delim()
1. Excel files read.xlsx() package(“xlsx”)
b.1 Excel files read.xlsx2() package(“xlsx”).Faster, but you will have to define column classes manually.
1. STATA files read.dta() package(“foreign”)
c.1 STATA 13 and up files read.dta13() package(“readstata13”)
1. SPS files read.spss() package(“foreign”)
1. Online databases read.table(“path”)
1. Databases in packages

Functions read.csv(), read.table(), read.delim() read delimited data files, .csv, and .txt files

Similarly, functions write.csv() and write.table() export delimited data files, .csv, and .txt files.

The package “xlsx” reads, writes, and formats Excel 2007 and Excel 97/2000/XP/2003 files

The package “foreign” reads and writes data stored by some versions of Stata, SAS, SPSS, dBase files, and other programs.

Function read.dbf() reads a DBF file into a data frame. This is useful when working with geographic information systems like GIS or QGIS. Some of the spatial or geographical data for this systems come in the form of .dbf files.

Reading a .csv

In the following link you’l find a .csv version of our dataset E.

https://www.dropbox.com/s/18d4ko92s4ux2lt/E.csv?dl=0

Lets read it using read.csv. First you should store it in the same directory than this script.

E <- read.csv("E.csv")

6.1. Databases in packages

Remember: “Packages” are collections of objects including databases, functions, models, and compiled code.

Some packages are already installed in your computer and contain baseline functions and data.

Other functions need to be downloaded from the Comprehensive R Archive Network (CRAN).

You can see the available databases in a package by typing data(package=“package.name”)

For example, let’s work with the package “Zelig”

First load the package into the session using the function ‘library()’

library("Zelig")

## Loading required package: survival

Note that “Zelig” contains different types of models, functions, and databases.

To list the available datasets in a package, use the function ‘data()’

Syntax: ‘data(package = “package.name”)’

data(package=“Zelig”)

The function ‘data()’ is also used to load a particular dataset into the workspace. Syntax: ‘data(“dataset.name”)’

In this and future sessions we will be working with the “turnout” dataset in the “Zelig” package.

data(package="Zelig")       # Lists the available datasets in a package

data("turnout")             # Loads a dataset into the workspace

Now the dataset “turnout” is available for use.

ls()

Almost all datasets from packages have a help file with a brief description and codebook of the data.

To access the help file of a data, use the operator ‘?’ or the function ‘help(“dataset.name”)’

?turnout help(“turnout”)

Now we now that dataset “turnout” contains individual-level turnout data.

It pools several American National Election Surveys conducted during the 1992 presidential election year.

Only the first 2,000 observations were included in the sample data.

Remember the following commands to get familiar with your data

dim(turnout) # Dimensions of the data frame. Syntax: dim(x)
head(turnout,10) # Shows first n rows. Syntax: head(x,n)
tail(turnout,10) # Shows last n rows. Syntax: head(x,n)
str(turnout) # Displays the structure of an object. Syntax: str(x)
summary(turnout) # Displays summary statistics. Syntax: summary(x)
colnames(turnout) # Column names of an object. Syntax: colnames(x)

7. Selecting and subsampling data

Selecting elements in objects is one of the most important skills to learn in R.

A dataset is a two dimensional object in R. Each element in a dataset has a set of coordinates: c(row,column)

To subsample data in a matrix, table, or dataframe, you should provide the row and column NUMBERS or NAMES of the elements you want to select.

Use data[rows,columns] to indicate these numbers, leave a blank to select all.

Let’s go back to our turnout dataset

head(turnout)

7.1. Subsampling data

Remember: for a dataset the index is [Row, Column]

# Let's select some rows
head(turnout)
turnout[3  ,  ]         # Row 3, all columns
turnout[c(3:5) ,  ]     # Rows 3 to 5, all columns
turnout[ c(1,3,5),  ]   # Rows 1, 3, and 5, all columns

# To take a random subsample, you can use the function 'sample()'
# Note that 'sample()' only works with vectors.
# Syntax: sample(vector,size)
sample(c(1:2000),10)

# Then we can select a random sample of the rows and all columns.
head(turnout)
turnout[sample(c(1:2000),10) ,  ]   # Random sample of size 10 for rows, all columns


# For illustration purposes, let's keep the first 10 rows of the 'turnout' dataset.
turnout <- turnout[c(1:10),]
turnout

# Columns or variables in datasets can be selected in three different ways
# Suppose we want to select the column for age.

# These four notations are equivalent
turnout[  ,  2]     # All rows, column 2
turnout[  , "age"]  # All rows, column "age"
turnout[  , 'age']  # All rows, column 'age'
turnout$age         # Variable age in dataset turnout
# The sign "$" indicates that vector "age" is an element of the dataset "turnout"

# To select two or more variables, you can use one the following notations
turnout
turnout[  , c(2,4)]             # All rows, columns 2 & 3
turnout[  , c("age","income")]  # All rows, column "age" and "income"
turnout[  , c('age','income')]  # All rows, column 'age' and 'income'

# You can also use the function 'data.frame()'
data.frame(turnout$age,turnout$income) # Data frame variables age and income 
                                           # in dataset 'turnout'
# This produces a new data frame.
class(data.frame(turnout$age,turnout$income)) 

# Note the missing variable names in this new dataset.
# To fix this, you just need name your variables again.
data.frame(age = turnout$age, income = turnout$income)

## NOTE, none of these subsamples exist in the workspace since we didn't assign them an object name.
#'object <- data.frame'
ls()

NOTE: Everytime you run a function with a randomizing component the results might be different. You can set a seed to obtain the same results

# Try, in this order:

sample(c(1:2000),10)
sample(c(1:2000),10)

set.seed(2)
sample(c(1:2000),10)
set.seed(2)
sample(c(1:2000),10)

### In this case, everytime you run a randomizing function just after set.seed(number) you'll get the same results

7.2. Subsampling data using conditionals

To select data samples that satisfy one or more conditions, use the following conditionals

‘<’ less than
‘<=’ less or equal than
‘>’ greater than
‘>=’ greater or equal than
‘==’ equal to
‘!=’ different to

You can use more than one conditional to select samples of your data by using

& and
‘|’ or

# Let's use the first 20 rows of the 'turnout' data again.
rm(list=ls())
library("Zelig")
data("turnout")
turnout_short <- turnout[c(1:20),]
turnout_short

# First, let's select the observations for people that voted in the 1992 presidential election (vote==1).
# There are two types of vectors that are helpful for selecting cases.

turnout_short$vote == 1
# Reports whether a condition is TRUE or FALSE for each row.

which(turnout_short$vote == 1)

# Reports the row numbers of those cases that satisfy a condition.
                                    
# Note that the first vector reports TRUE or FALSE for all rows in the data.
# The second vector only reports sucess cases.

# Using these vectors, we can subsample our data
turnout_short

# The two following notations are equivalent:
turnout_short[turnout_short$vote==1         , ]  # Select cases (rows) where condition == TRUE
turnout_short[which(turnout_short$vote == 1), ]  # Select cases (rows) that satisfy the condition



############
# Now, let's select the observations for people that had 40 years old or less and voted in the 1992 presidential election.
# Conditions vote == 1 and age <= 40
turnout_short

# Once again, two types of vectors will be helpful to make this selection
turnout_short$vote == 1 & turnout_short$age <= 40
# Reports whether a condition is TRUE or FALSE for each row.

which(turnout_short$vote == 1 & turnout_short$age <= 40)
# Reports the row numbers of those cases that satisfy a condition

# Using these vectors, we can subsample our data
turnout_short

# The following notations are equivalent
turnout_short[turnout_short$vote==1 & turnout_short$age <= 40, ]  # Select cases where condition == TRUE
turnout_short[which(turnout_short$vote == 1 & turnout_short$age <= 40), ] # Select cases that satisfy the condition

7.3. Dropping data

# Use the '-' operator to drop cases or variables
# Examples

turnout_short
turnout_short[-3, ]         # Drops row 3
turnout_short[-c(3:5),]     # Drops rows 3 to 5
turnout_short[-c(1,3,5),]       # Drops rows 1, 3, and 5

# To drop columns or variables, you can use any of the following notations.
# Remember: the function 'colnames()' return the columns names of a dataset.

# Suppose we want to drop the variable "income".
# The following notations are similar
turnout_short
colnames(turnout_short)
turnout_short[  ,  -4]                          # Drops the 4th column
turnout_short[, c("race","age","educate","vote")] # Subsamples all data except column "income" 
turnout_short[, c(colnames(turnout)!="income")]   # Keeps columns whose name is not equal to "income".
turnout_short[,-c(which(colnames(turnout_short)=="income"))] # Drops the column whose name is equal to "income"

EXERCISE 4 (Data Munging)

Estimate the mean of schooling years for people that were 40 years old or less and voted in the 1992 presidential election.
The function ‘with()’ constructs a local environment defined by the data.No additional references are required.
Syntax: with(data, ‘expression’). Use the function with() to obtain the education level for people that were 40 years old or less and voted in the 1992 presidential election.
Alternatively, you can use the function ‘subset()’. ‘subset()’ returns subsets of vectors, matrices or data frames which meet conditions. Syntax: ’subset(data.frame, subset = conditions for rows, select = conditions for columns). First, from the dataset turnout_short get the people who voted using the function subset. Second, with the same function get the people that were 40 years old or less and voted Finally, get the level of schooling for these people.
For the dataset turnout_short, drop the cases for people that were 30 years old or less and voted. Hint. You can use the ‘which()’ function.

Solutions

8. Merging & appending datasets

8.1. Adding additional data

There are at least 3 ways to add additional data to a data frame:

Appending new data.
Adding existing information from a keyed table using the function ‘match()’
Merging 2 datasets using the function ‘merge()’

# Let's create first two databases

Data_A <- data.frame(
                    ID = c(1:5),
                    name = c("Edgar","Katie","John","Mary","Stephen"),
                    age = c(31,26,34,21,20),
                    department = c("PoliSci","Education","Sociology","Economics","Education"),
                    treatment = c(1,1,0,1,0),
                    score_time1 = c(8.5, 8.6, 7.4, 9.3, 9.2)
                    )

Data_B <- data.frame(
                    name = c("Edgar","Katie","John","Stephen","Stephen","David"),
                    score_time2 = c(8.7, 8.8, 7.3, 9.3, 9.5, 8.2)
                    )

Data_A
Data_B

# Note:
# 'Mary' is not in Data_B
# 'David'is not in Data_A
# 'Stephen' has two rows with different information in Data_B

8.1.1. Appending two datasets

## Supose we have the following information for David
data.david <- data.frame(ID = 6, name = "David" , age = 25 , 
                         department = "Anthropology", 
                         treatment = 0, score_time1 = 8.3)
data.david
Data_A

## Note that data.david and Data_A have the same columns
## To append two datasets, you can use the function 'rbind()' which stands for row binding 
rbind(Data_A,data.david)

# Let's append the data for 'David' and replace Data_A
Data_A <- rbind(Data_A,data.david)
Data_A

8.1.2. Merging two datasets

Let’s go back to our original datasets

Data_A <- data.frame(
                    ID = c(1:5),
                    name = c("Edgar","Katie","John","Mary","Stephen"),
                    age = c(31,26,34,21,20),
                    department = c("Edgar","Katie","Sociology","Economics","Education"),
                    treatment = c(1,1,0,1,0),
                    score_time1 = c(8.5, 8.6, 7.4, 9.3, 9.2)
                    )

Data_B <- data.frame(
                    name = c("Edgar","Katie","John","Stephen","Stephen","David"),
                    score_time2 = c(8.7, 8.8, 7.3, 9.3, 9.5, 8.2)
                    )

Data_A
Data_B

# Note:
# 'Mary' is not in Data_B
# 'David'is not in Data_A
# 'Stephen' has two rows with different information in Data_B

# The function 'merge()' in R merges two databases X and Y.
# Syntax:
#       merge(  x,                      # X dataset
#               y,                      # Y dataset
#               by.x = "Identifier",    # Key in X. More than 1 variable can be used.
#               by.y = "Identifier",    # Key in Y. More than 1 variable can be used.
#               all.x = TRUE,           # Keep all observations in X (TRUE/FALSE)
#               all.y = TRUE,           # Keep all observations in Y (TRUE/FALSE)
#               suffixes = c(".x",".y") # If other variables have the same name in X and Y
#               )                       #  rename them according to the source
                                        #  by adding the suffixes c(".x",".y")

# Notes:In Stata, the merge is based on variables that have the same name in both databases.
#       In R, these variables can have different names.
#       In Stata, the merge can be 1:1, m:1, 1:m, or m:m.
#       In R, these options can be approximated by changing the parameters 'all.x', 'all.y'.
#       Note that R doesn't check for unique identifiers.
#       If there are other variables with the same name, Stata keeps the data of the master dataset.
#       Instead, R allows the user to keep both columns.

# Let's try the 4 different merges
# Note the suffixes for name
merge(Data_A,Data_B,by.x="name", by.y="name", all.x=F, all.y=F)  # It throws 'Mary'and 'David'
merge(Data_A,Data_B,by.x="name", by.y="name", all.x=T, all.y=F)  # It throws 'David'

# More on this soon...

EXERCISE 5

Merge the two datasets (A and B ) by name excluding Mary. Remember that Mary is not in dataset B
Merge the two datasets (A and B) keeping all info on both datasets