There are multiple types of data manipulation that you are likely to use consistently across different projects and datasets. Subsetting, as covered in the previous lesson, is just one of them. While subsetting with brackets is useful, it can be cumbersome to work with, especially if you want to work with many different subsets of your data. This is where dplyr
comes in. dplyr
is a package for making data manipulation easier.
Packages are collections of functions (and sometimes data sets) that build upon the base functionality of R. R comes with many built-in functions, including the ones that we have been using so far (such as read.csv()
, round()
, and c()
, as well as operators like +
, -
, and >=
). As we will see later on, you can also write new functions, to extend what you can do in R. Researchers and others who write functions in R often bundle these new functions into packages, which can then be distributed more broadly. This is a major reason why R is so powerful; as people come up with new approaches and methods for working with data, these approaches can be made available to the wider community of R users.
Before you can use a new package for the first time, you need to download and install it on your computer. You only need to do this once. Once you’ve installed it, you will need to load it into R each time you want to access it. However, it is easy to write code at the top of your script to automatically load the packages that you need for that script.
To download dplyr
, run the following:
install.packages("dplyr")
You may or may not be asked to choose a CRAN mirror, which means that you have to decide which site from which to download the package. It doesn’t really matter which you choose, as the package will be the same in all locations.
In general, if you try to install a package and you get an error message saying that your version of R or RStudio is too old, then it is a good idea to update both of those (and then install the package).
Once you’ve installed dplyr
, you can load it into your R session as follows:
library("dplyr") # Load the package
If it has been loaded properly, you should be able to call up a help file with the following code:
?select
The dplyr
package contains functions that help with the most common data manipulation tasks, and it was designed to work directly with data frames. It ports much of the computation to C++, which can speed up run time. It can also work with data stored in an external database. For our purposes, though, it has some very useful functions that work quickly!
You may find this dplyr cheatsheet useful, both for the functions that we cover in class and for other useful functions that you may be interested in using later on.
In this lesson, we will learn some of the most commonly used dplyr
functions: select()
, filter()
, mutate()
, group_by()
, and summarize()
. We’ll work with the data frame trees
that we’ve used in previous lessons. Be sure to read in the file again if you are working in a new R session. Take a look at the output of str(trees)
if you need a reminder of the data that this file contains.
trees <- read.csv(file = "Data/trees.csv", stringsAsFactors = FALSE)
The first dplyr
function that we will use, select()
, allows you to subset a data frame by extracting specific variables (columns). The first argument to the function is the data frame (trees
), and the next arguments are the columns to be selected. For example, here are the first few lines of the result of a call to select()
on trees
:
select(trees, Site, Species, Count) # Your output will have more lines
## Site Species Count
## 1 Ottawa Pinus strobus 11
## 2 Ottawa Acer rubrum 18
## 3 Ottawa Cornus florida 28
## 4 Ottawa Quercus alba 15
## 5 Ottawa Liriodendron tulipifera 18
## 6 Ottawa Tsuga canadensis 30
What if we want to extract only specific rows, such as the observations where more than 40 trees of a certain species were counted? To do this, we can subset the data frame using the function filter()
. As with select, the first argument is the data frame, but the next arguments are logical statements specifying the data that we want to extract.
filter(trees, Count > 40)
## Province Site Plot Species Count
## 1 Ontario Ottawa 4 Acer rubrum 42
## 2 Ontario Kingston 2 Pinus strobus 44
## 3 Ontario Toronto 1 Quercus alba 41
## 4 Quebec Montreal 5 Tsuga canadensis 44
## 5 Quebec Quebec 5 Liriodendron tulipifera 41
You can assign the result to a new variable. For example:
trees_40 <- filter(trees, Count > 40)
trees_40
## Province Site Plot Species Count
## 1 Ontario Ottawa 4 Acer rubrum 42
## 2 Ontario Kingston 2 Pinus strobus 44
## 3 Ontario Toronto 1 Quercus alba 41
## 4 Quebec Montreal 5 Tsuga canadensis 44
## 5 Quebec Quebec 5 Liriodendron tulipifera 41
You can also include multiple conditions within the same filter()
function call. To do this, we can use the AND &
and OR |
functions that were introduced in the lesson Introduction to R. For example, we can subset the data to include observations with fewer than 20 trees for sites in New Brunswick (try this for yourself):
filter(trees, Count < 20 & Province == "New Brunswick")
But what if you wanted to both select columns and filter rows on the same data set, at the same time? Say, for example, you want to see all of the observations where more than 40 trees were counted, and you only want the Species
and Count
variables. There are several different ways to do this.
One option would be to filter rows, assign the result to an intermediate data frame, and then select columns of this new data frame. For example:
trees_40 <- filter(trees, Count > 40)
trees_40_speciesCount <- select(trees_40, Species, Count)
trees_40_speciesCount
## Species Count
## 1 Acer rubrum 42
## 2 Pinus strobus 44
## 3 Quercus alba 41
## 4 Tsuga canadensis 44
## 5 Liriodendron tulipifera 41
However, you now have an unnecessary data frame cluttering your environment, and as you work with your data, this can very quickly lead to a lot of unnecessary objects being stored.
A second option would be to nest statements inside each other. For example:
select(filter(trees, Count > 40), Species, Count)
## Species Count
## 1 Acer rubrum 42
## 2 Pinus strobus 44
## 3 Quercus alba 41
## 4 Tsuga canadensis 44
## 5 Liriodendron tulipifera 41
In this case, the result of the filter()
step is being used as the input to the select()
step. While this works fine, it can be difficult to interpret the code, especially if your statement becomes more complex.
The third option is to use a pipe. Pipes let you take the output from one function in R and send it (pipe it) directly to a second function. Pipes are a fairly recent addition to R, and are made available via the magrittr
package that is installed as part of dplyr
. (This might help explain the Magritte pun.) Pipes use an inline function that looks like this: %>%
. Operators for inline functions are used the way +
or -
would be used, i.e., they are placed in line with the rest of the code, rather than having a function name followed by parentheses that surround, or wrap, the code.
When you use a pipe, you don’t need to specify the first, required argument of the next function. This is because the first argument will be the output of the previous function.
With pipes, we would do the same filtering and selecting steps as above by running the following code:
# Note that the data frame argument for filter() and select()
# are not specified because it is implied that that is the output
# of the previous code.
trees %>%
filter(Count > 40) %>%
select(Species, Count)
## Species Count
## 1 Acer rubrum 42
## 2 Pinus strobus 44
## 3 Quercus alba 41
## 4 Tsuga canadensis 44
## 5 Liriodendron tulipifera 41
If you want to use pipes but aren’t sure about each part of the function, you can try running just one part at a time, selecting code up to (but not including) the pipe operator. For example, you can run trees
, then trees %>% filter(Count > 30)
, then the whole line. (If you run the pipe operator, R will expect more input. Try this out to demonstrate it to yourself.)
As before, we can assign the results to a new variable. Note that the name of the final, newly created variable is still all the way on the left, before the assignment operator <-
.
trees_sm <- trees %>%
filter(Count > 30) %>%
select(Species, Count)
Using pipes, subset trees
to include only the observations in plot 1 of each site. Do this both without and with pipes.
Use pipes and subsetting to determine how many species have fewer than 10 individuals in plot 2 in Saint John.
Using pipes, subset trees
to include only the observations from Fredericton, and only the columns indicating species and the number of each species counted. Assign the result to a new variable called trees_Fredericton
. Does it matter whether you use filter()
or select()
first? Why or why not?
Using pipes, filter()
, and str()
, figure out how many observations there are in a subset of trees
where the Plot
number is 4 and each species has at least 10 trees in that plot.
Often you’ll want to make a new column based on the content of a column that already exists. For example, you might want to do a unit conversion, or find the ratio of values in two columns. To do this, we can use the function mutate()
. With mutate()
, the first argument is the data set (passed to the function via a pipe). The following arguments consist of the name of the new variable, followed by the equal sign =
and code indicating how the new variable is formed.
For example, say each plot in the trees
study was 1,000 m2, and we wanted to convert this to the number of trees per hectare (1 ha = 10,000 m2). To figure this out, we’d want to multiply Count
by 10. We could do this using mutate()
as follows:
trees %>%
mutate(Count_ha = Count * 10)
If your output runs off the screen and you only want the top 6 lines to be displayed, you can pipe the output to the function head()
.
trees %>%
mutate(Count_ha = Count * 10) %>%
head()
## Province Site Plot Species Count Count_ha
## 1 Ontario Ottawa 1 Pinus strobus 11 110
## 2 Ontario Ottawa 1 Acer rubrum 18 180
## 3 Ontario Ottawa 1 Cornus florida 28 280
## 4 Ontario Ottawa 1 Quercus alba 15 150
## 5 Ontario Ottawa 1 Liriodendron tulipifera 18 180
## 6 Ontario Ottawa 1 Tsuga canadensis 30 300
You can also reassign the new data frame to the original trees
variable, so that the new variable Count_ha
is included in the trees
data frame.
trees <- trees %>%
mutate(Count_ha = Count * 10)
trees
data that includes the variables Site
, Species
, and Count_half
where Count_half
is equal to half the value of Count
and there are no Count_half
values less than 10. Hint: Think carefully about the order of operations!summarize()
functionVery often, you will want to apply an analysis to groups within your entire data set - for example, with the trees
data, you might be interested in total tree count per plot, or mean count per species across plots. This type of analysis can be approached with the “split-apply-combine” paradigm: split the data into groups, apply some analysis to each group, and then combine the results. With dplyr
, you can do this easily with the group_by()
function.
The group_by()
function is often used along with the function summarize()
, which collapses each group into a single-row summary of that group. The arguments to group_by()
are the column names that contain the categorical variables for which you want to calculate the summary statistics, and summarize()
(or summarise()
) defines the summary column, similarly to how new columns were defined with mutate()
.
For example, to view the mean Count
by Species
across all sites and plots:
trees %>%
group_by(Species) %>%
summarize(meanCount = mean(Count))
## # A tibble: 6 × 2
## Species meanCount
## <chr> <dbl>
## 1 Acer rubrum 20.22
## 2 Cornus florida 21.50
## 3 Liriodendron tulipifera 21.42
## 4 Pinus strobus 19.02
## 5 Quercus alba 20.22
## 6 Tsuga canadensis 19.46
You can also group by multiple columns by including both as arguments to the group_by()
function. For example, to view the mean Count
by both site and species, you could do the following:
trees %>%
group_by(Site, Species) %>%
summarize(meanCount = mean(Count))
## Source: local data frame [60 x 3]
## Groups: Site [?]
##
## Site Species meanCount
## <chr> <chr> <dbl>
## 1 Fredericton Acer rubrum 18.6
## 2 Fredericton Cornus florida 20.2
## 3 Fredericton Liriodendron tulipifera 20.2
## 4 Fredericton Pinus strobus 16.0
## 5 Fredericton Quercus alba 19.6
## 6 Fredericton Tsuga canadensis 23.8
## 7 Kingston Acer rubrum 22.4
## 8 Kingston Cornus florida 25.8
## 9 Kingston Liriodendron tulipifera 12.0
## 10 Kingston Pinus strobus 23.8
## # ... with 50 more rows
Note that the output no longer runs off the screen. This is becuase dplyr has changed the data.frame
to a tbl_df
. This data structure is very similar to a data frame, and can be used in the same way, but it automatically shows fewer lines, to make the output more readable. If you want to display more lines on the screen, you can pipe the result to the print()
function, with the argument n
specifying the number of rows to display.
trees %>%
group_by(Site, Species) %>%
summarize(meanCount = mean(Count)) %>%
print(n = 15)
## Source: local data frame [60 x 3]
## Groups: Site [?]
##
## Site Species meanCount
## <chr> <chr> <dbl>
## 1 Fredericton Acer rubrum 18.6
## 2 Fredericton Cornus florida 20.2
## 3 Fredericton Liriodendron tulipifera 20.2
## 4 Fredericton Pinus strobus 16.0
## 5 Fredericton Quercus alba 19.6
## 6 Fredericton Tsuga canadensis 23.8
## 7 Kingston Acer rubrum 22.4
## 8 Kingston Cornus florida 25.8
## 9 Kingston Liriodendron tulipifera 12.0
## 10 Kingston Pinus strobus 23.8
## 11 Kingston Quercus alba 18.8
## 12 Kingston Tsuga canadensis 13.6
## 13 Moncton Acer rubrum 20.6
## 14 Moncton Cornus florida 18.0
## 15 Moncton Liriodendron tulipifera 14.6
## # ... with 45 more rows
Once you have grouped your data, you can summarize multiple variables at the same time, even on different variables. For example, we could find the minimum number of trees counted for each species using min()
, and also the highest plot number of each group using max()
.
trees %>%
group_by(Site, Species) %>%
summarize(meanCount = mean(Count), minCount = min(Count))
## Source: local data frame [60 x 4]
## Groups: Site [?]
##
## Site Species meanCount minCount
## <chr> <chr> <dbl> <int>
## 1 Fredericton Acer rubrum 18.6 9
## 2 Fredericton Cornus florida 20.2 2
## 3 Fredericton Liriodendron tulipifera 20.2 14
## 4 Fredericton Pinus strobus 16.0 3
## 5 Fredericton Quercus alba 19.6 14
## 6 Fredericton Tsuga canadensis 23.8 6
## 7 Kingston Acer rubrum 22.4 16
## 8 Kingston Cornus florida 25.8 20
## 9 Kingston Liriodendron tulipifera 12.0 2
## 10 Kingston Pinus strobus 23.8 9
## # ... with 50 more rows
You may often want to know the number of observations found for each factor or combination of factors in a data frame. dplyr
provides a function that does this directly: tally()
. For example, if we wanted to know how many individual species counts were made in each province, we could do the following:
trees %>%
group_by(Province) %>%
tally()
## # A tibble: 3 × 2
## Province n
## <chr> <int>
## 1 New Brunswick 90
## 2 Ontario 120
## 3 Quebec 90
In the above code, tally()
is applied to each of the groups created by group_by()
, and it counts the total number of records for each category.
We could also do this with the summarize()
function introduced above, by using it with the function n()
, which counts the number of observations in a group.
trees %>%
group_by(Province) %>%
summarize(Records = n())
## # A tibble: 3 × 2
## Province Records
## <chr> <int>
## 1 New Brunswick 90
## 2 Ontario 120
## 3 Quebec 90
Here, we have the same output as above, except that we’ve defined the variable that indicates the number of records for each province. Note that there are often multiple ways to accomplish the same task in R!
For the following questions, use dplyr
functions and pipes to display output that will answer the question.
For each site in the study, what is the mean number of trees counted in an observation (for any species)?
For each site in the study, what is the mean number of trees counted in an observation of Acer rubrum (red maple)? (Hint: What subset of the data do you want to group and summarize?)
How many observations are there for each site in the study?
Are there any observations for which no trees of a given species were counted? If so, which observations are they?
Using pipes and dplyr
functions, make a new data frame that shows how many species are found in each plot across the entire data set. Then, extend this code to subset the data and show whether any plots do not include all 6 species. (Hint: Be sure that you don’t include observations with 0 trees!)
Once you have used dplyr
functions to extract the data that you want to work with, or to summarize your raw data, you might want to export these new datasets to archive them, have them available for later use (e.g., plotting), or share them with colleagues.
To do this, you can use the write.csv()
function, which works similarly to the read.csv()
function. The write.csv()
function creates .csv files from data frames.
In the Organization and R Basics lesson, you should have created a folder in your working directory for output data, with a name along the lines of Data_output
. This is where you should put your newly generated, organized data frame. It is a good idea to keep your raw data and organized data separately. If your raw data is kept in its own folder, it lessens the chance of accidentally deleting or modifying it. On the other hand, the data in your Data_output
folder may be consistently generated by your scripts, and you know that you can delete the files because you have the scripts to recreate them.
Let’s say that we want to create a data set that shows the total number of trees counted for each plot in New Brunswick, and we want to write this data to file to share with a colleague. We can start by creating the new data frame and taking a look at it:
treeCount_NB <- trees %>%
filter(Province == "New Brunswick") %>%
group_by(Site, Plot) %>%
summarize(treeCount = sum(Count))
treeCount_NB
## Source: local data frame [15 x 3]
## Groups: Site [?]
##
## Site Plot treeCount
## <chr> <int> <int>
## 1 Fredericton 1 144
## 2 Fredericton 2 125
## 3 Fredericton 3 108
## 4 Fredericton 4 106
## 5 Fredericton 5 109
## 6 Moncton 1 84
## 7 Moncton 2 128
## 8 Moncton 3 96
## 9 Moncton 4 126
## 10 Moncton 5 133
## 11 Saint John 1 115
## 12 Saint John 2 127
## 13 Saint John 3 123
## 14 Saint John 4 138
## 15 Saint John 5 104
Then we call write.csv()
to write this data to file, in our Data_output
folder. This function requires as input the name of the data frame to write and the path (again, relative to your working directory) and filename of the file, including the file extension. You can also specify whether or not to include a separate variable for row names, which is done by default. In this case, that would be a column with the numbers 1 through 15, which we don’t want, so we will set row.names
to FALSE
:
write.csv(treeCount_NB, file = "Data_output/treeCount_NB.csv", row.names = FALSE)
Once you’ve done this, look in your Data_output
folder in the Files
pane to check that the file was written as expected. Now you have a file that you can read back into R in a separate script, or open in a text editor or in Excel.
This lesson is based on materials from Data Carpentry’s Data Analysis and Visualization in R curriculum (as of 11 October 2016), which is licensed under the Creative Commons CC-BY. This license allows sharing and adapting materials for any purpose, as long as attribution is given. Generally, the content, concepts, and flow are similar to the original lesson, but the words and some specific examples differ.