In this lesson, we will cover some of the basics of working with R and starting a new project. This will help you get your footing and be prepared for future lessons.
There are two aspects to the question of why to use R for data analysis:
Reproducibility
One of the major reasons to use coding and scripts to work with data is that it allows for reproducible science. If all of your analyses are written out and documented, and you are not working with the original data files directly, you can always go back to your code and re-run an analysis.
There are several different options for programs, or combinations of programs, that can serve a similar purpose to R. One commonly used, free alternative is python, which shares many of the same advantages as R. Differences in structure and available packages for the two programs mean that each may be somewhat better suited to different tasks (e.g., statistics, visualization, or programming) or scientific fields (e.g., whether a certain set of advanced functions or analyses are already available), but both are great options (see comparison here). There are also commercial alternatives for statistical analyses and programming, such as SAS or MATLAB.
So, what are the benefits of using R?
RStudio is an is an Integrated Development Environment (IDE) for R. Essentially, it is a graphical interface that makes it easier to work with R, by pulling together different components of the typical analysis workflow. When you open RStudio, it automatically opens an R session. When you run commands in RStudio, it sends them to R and returns the result. RStudio also allows you to do things like navigating the files on your computer, inspecting variables that you have created, and visualizing plots.
Open RStudio on your own computer. Your screen should look something like this:
If you have not already opened a script, open a new one, in one of two ways:
File
-> New File
-> R Script
.R Script
. Your screen should now look something like this:There are 4 panes, or areas, of the program window.
Try this out for yourself: Type the following, and then hit enter: 4 + 5
. Then try a few different math problems.
Run
in the top right corner of the script pane.ctrl
+enter
. On a Mac, you can press command
+enter
.Run
button, or use a keyboard shortcut).
Try this out for yourself: Type a math problem in the console (e.g., 4 + 5
), then run the line in each of the ways described above.
Environment
and History
tabs.
Environment
tab displays the names of variables in your R session, and information on the structure and/or values of each.History
tab shows you all of the code that you have run, either in this session, or as stored to an R history file (.Rhistory).
Try this out for yourself: Define a variable, x
, by typing and running the following (in the console or in your new script): x <- 5
. Then look at your Environment
and History
panes.
Files
: This is how you navigate directories on your computer.Plots
: This is where plots will be displayed.Packages
: This shows you which of your installed packages are loaded.Help
: This is where you can find help files.Viewer
: This is like the plots pane, except that it is especially for displaying web content.
Try this out for yourself: Navigate through your file system in the Files
tab. Then try searching for something in the search bar of the Help
tab - try the word “small”, and then the word “print”. Notice that when there is a specific entry for the word you searched, this entry will be displayed, but if not, you will see all the help pages that include this word.
It is a good idea to keep all of your project files together in one directory, and to have a consistent subdirectory structure. Besides the fact that this helps with general organization, it also allows you to work with relative file paths within a project. This means that you can specify a working directory and then indicate other file paths relative to this directory. This allows you to later move the project directory around, even onto a different computer, without breaking your code. Also, it will be simpler to point to the right directory when writing code to read or write data or other input or output.
Your project directory for this course should be modeled after this example:
Here, the name of the main project directory is IntroR_OnlineCourse
, but you can use any name you’d like. This folder will be your working directory. In this case, this folder is in the Desktop
folder, but you can store yours anywhere on your computer that makes sense to you. Within this folder, you should have the following subdirectories:
In your own work, you may choose to have other directories as well, but this is a good structure to model yours on.
Now let’s start working in RStudio. We’ll begin by saving the new script that you have opened. You can do this in one of three ways:
File
-> Save
.ctrl
+S
. On a Mac, you can press command
+S
.When the message box opens, give your new script any name you’d like (CRI_R_Course_Scratchpad.R
isn’t a bad idea), and save the script into the Scripts
folder of your working directory.
While working through the lessons in the rest of this course, you are welcome to work in this file, or to open (and save!) others. Using different scripts for different lessons may help you if you want to reference class code later. In any case, please use individual R scripts for assignments, and save them with easily interpretable names!
And, most importantly: You can use the console to test lines of code, but store your code in saved scripts so that you can access it later! It is good practice to do so, and the habit will serve you well in the future.
To get used to working in R, let’s do some math in the console. The carrot (>
) means that R is ready to accept a command. Try typing a simple computation: type 3 + 3
and then hit enter
.
The result should be printed below the problem, and then you should see another command prompt. Try a few other math problems on your own. You can use the operators that you might expect to be able to use, such as +
, -
, /
, *
, and parentheses (
and )
.
What happens if you don’t finish a line of code? Try typing 3 *
and then hit enter.
You should get a +
at the beginning of the next line.
This means that R is expecting more input. You can finish the line of code at this point (try typing 4
and then hit enter
). If you are stuck in the middle of a complicated command and don’t know how to finish it, you can get out of this computation by pressing esc
. This should get you back to the >
command prompt.
Here is an example of a script in R:
There are several notable features of this script:
#
.
<-
.
x <- 3
sets the value of x
to 3. It can be read as “3 goes into x.”
x <- 3
). Then type x
and enter
to see the value of x.=
instead of <-
, but try to avoid this. Use <-
to be explicit.<-
is alt
+ -
. $
operator: The $
operator accesses variables within a data frame.
trees$Count
refers to the Count
variable within the data frame trees
. We will work with data frames and data frame variables in later lessons. library()
and print()
. The values inside the parentheses are called arguments, and they determine how the function will operate.
=
(e.g., file = "Data/trees.csv"
on line 15).sqrt(4)
.As an example, let’s work with the function round()
. Try running the following:
round(3.14159)
## [1] 3
Note that the gray boxes indicate code, and the white boxes indicate output.
You may have noticed that the function rounds to the nearest whole number. Why does it do this, instead of rounding to some other number of significant digits? One way to look at the details of a function is to use the function args()
, which gives you the function’s arguments. Try running the following:
args(round)
## function (x, digits = 0)
## NULL
We can get more info from the help file, which we can call with ?round
.
From the output of args(round)
and from the help file, we can learn that round
is a function that has two arguments, x
and digits
. The default for digits
is 0, which means that x will be rounded to 0 decimal places. Let’s try a different value for digits
:
round(3.14159, digits = 2)
## [1] 3.14
You might have noticed that we specified the argument name for digits
, but not for x
. If you include the arguments in the order defined by the function, you don’t have to name them:
round(3.14159, 2)
## [1] 3.14
And if you name them, you can put them in any order:
round(digits = 2, x = 3.14159)
## [1] 3.14
…but for clarity, it is better to do the following: put the non-optional argument first, and then name the other arguments. I.e., use round(3.14159, digits = 2)
, as we did at the beginning.
There are several options for getting help while working in RStudio.
If you know the name of a function and need a reminder of how it works, you can use ?
to call up the help file, as in ?barplot
. (Try this.)
If you just need a reminder of the arguments for a function, you can use the function args()
, as in args(lm)
. (Try this.)
You can use ??
to search the help files for all installed functions, as in ??round
. (Try this.)
If you can’t figure out how to solve a problem, there may be helpful information online!
When you post a question in an online forum - or even if you ask someone you know! - it is a good idea to follow some guidelines:
datasets
. Run data()
to get a list of available datasets.sessionInfo()
.This lesson is based on materials from Data Carpentry’s Data Analysis and Visualization in R curriculum (as of 11 October 2016), which is licensed under the Creative Commons CC-BY. This license allows sharing and adapting materials for any purpose, as long as attribution is given. Generally, the content, concepts, and flow are similar to the original lesson, but the words and some specific examples differ.