7 R
R is a powerful and expressive language, custom built for statistics and data science. Often compared to Python, R has particular strengths in:
- visualisation
- statistics and modelling
- extensions (packages)
- community support
The fact that these course notes have been compiled using R is testament to the versatility of the R package ecosystem. Specifically, these notes are written in R Markdown and converted to HTML using the bookdown package.
7.1 What is R?
From the R Project website:
R is a free software environment for statistical computing and graphics.
There is an interesting subtlety here, which is that R is a software environment rather than being a language. This ambiguity is a somewhat intentional decision from the days of the S language (on which R is based). The following quote from John Chambers sheds some light on this:
We wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important.
This makes R a particularly good entry-point into data science for non-programmers, particularly when compared to Python which requires a much stronger understanding of programming in order to achieve most data science tasks.
Some of the key selling points for using R include:
- Reproducibility, Reporting, and Automation
- Graphics and data visualisation
- Extensibility and community support
- RStudio Integrated Development Environment (IDE)
- A single tool which you can use for end-to-end projects
- Open source and used across many countries and industries
- Excellent online documentation and learning resources
7.2 How R Works
R is typically installed as an executable application on your computer. Once installed, there are a few different ways that it can be used:
- Interactive code execution
- on the command line (type
R
then hit enter) - using the GUI (search for
R.app
orR.exe
in macOS or Windows respectively)
- on the command line (type
- Script execution via Rscript
- on the command line, type
Rscript my_script.R
to run a script from start to finish, directly from the command line
- on the command line, type
- Interactive and script execution via an Integrated Development Environment (IDE)
- RStudio is the only IDE that anyone uses.
There are also a number of clever R packages that also allow you to run R as:
- an interactive notebook, via rmarkdown within RStudio
- an interactive website, via shiny within Shiny Server or RStudio Connect
- an Application Programming Interface (API), via plumber which runs directly on a server, or via RStudio Connect
RStudio Connect and Shiny Server Pro are commercial solutions offered by RStudio - you are unlikely to install them yourself. RStudio does offer Shiny application hosting as a service via shinyapps.io, which is free for small projects.
All of these deployment options have one thing in common: they all require the R software environment to be available on the system where they are being hosted. This means that unlike many compiled languages (C, C++, Swift, etc) they require R to be installed on every computer where the code needs to run - you can’t just send someone a file and expect them to run it without installing anything.
In practice this means that people will often share analytical work using either:
- hosted services like the services offered by RStudio
- static documents created using tools like rmarkdown or flexdashboard, both of which create HTML documents that can be shared with anyone that has a web browser (which is just about everyone) - rmarkdown can also create PDF documents which are also widely accessible.
You can also set up your own hosting environments by installing directly onto a server or by deploying a Docker container - we’ll cover the Docker option in detail in a later chapter of this course.
7.3 Setting up an R Environment
Installing R is really easy - just go to the R-Project Website to download an installer for your system. This will install the R command line tool and the R GUI, which will let you run R interactively, and let you run R scripts. You should also install RStudio which is an Integrated Development Environment (IDE) specifically designed to support data science with R.
For a more detailed guide on installing R and RStudio, you can look at the Introduction to R for Data Science.
During the installation, R sets up one or more “libraries” on your computer. These libraries contain R packages which can be used to extend the functionality of R. Some examples of popular packages include:
- ggplot2 - the most popular way to build beautiful visualisations using R
- dplyr - simple and powerful data manipulation
- data.table - alternative to dplyr
- httr - easy HTTP communications with R (used for communicating with APIs)
- xgboost - interface to the popular XGBoost machine learning toolkit
There are over 10,000 packages available through the Comprehensive R Archive Network (CRAN). You can use R to download and install these packages into the library on your computer. As an example, to download and install the dplyr package:
- Open an R session using RStudio
- In the R Console type
install.packages('dplyr')
and press Enter
This will download the package to your computer and install it into your
library. To see where your libraries are located, you can type .libPaths()
into the R Console then press Enter - this will show you one or more
file paths where R has installed packages on your computer. You can learn more
about how libraries work in R by reading the What is a
Library section of Hadley
Wickham’s R Packages book.
You will learn about how to load packages - and use the functions within them - as part of the DataCamp courses below.
7.4 Learning R with DataCamp
As a Data Scientist working in a business you would be expected to have at least one programming language (R or Python) in which you can:
- Import datasets from files and databases
- Manipulate, aggregate and summarise datasets
- Create compelling visualisations
- Train and evaluate models
In order to develop the first three skills on the list (we won’t have time to learn about modelling in this course), students will be given two assignments on DataCamp, which are compulsory but will not be assessed.
Subject | Topics |
---|---|
Introduction to R | Basics, vectors, matrices, factors, data frames, lists. |
Introduction to the Tidyverse | Data wrangling, visualisation, grouping and summarising. |
These two courses will provide a basic foundation in how to use R to perform basic analysis tasks. The second course introduces the “Tidyverse” which is a collection of packages developed by the RStudio team which aim to make it easy to perform powerful data analysis tasks without needing to learn too much about programming. If you are new to programming, these two courses will take you from not being an R user to being able to manipulate datasets and perform complex analysis.
As an alternative to these two DataCamp modules, you may wish to read R for Data Science which provides a more gentle introduction, with less focus on the R language and more focus on how to perform important analytics tasks. If you work through the examples yourself then it is likely to take longer than the DataCamp courses, but you’ll end up a much more confident R user because you will have written and executed lots of R code that helps you complete tasks that are very similar to what you will encounter in the workplace.
You may choose to either complete the DataCamp courses or read the R for Data Science book; you are not expected to do both.If you intend to use R as your primary language, you should also consider attempting to complete DataCamp’s Data Scientist with R career track which includes most of the R skills you’ll need as a professional data scientist. This will take you significantly longer than a single semester to complete, but you can make a start any time.
7.5 Other R Resources
Resource | Notes | Cost (AUD) |
---|---|---|
R for Data Science | R for Data Science is a comprehensive and opinionated introduction to using R for data science, and to programming in general. | Free! |
Advanced R | Advanced R is a deep-dive into the technical details of how to make the most of R’s advanced features when programming. | Free! |
R Packages | R Packages is a step-by-step guide to creating and distributing your own R packages. | Free! |
The #rstats community is a gold mine of information and news about what people are doing with R. | Free! |