11 Containers

A container can be thought of as being a computer inside your computer. Containers are (mostly) a Linux-only technology which has been popularised in recent years by the emergence of Docker and now Kubernetes. The general idea behind using containers (or virtual machines) is that it makes reproducibility a lot easier when moving code and applications between different computing environments: if it runs on your computer, it should run anywhere.

Containers are conceptually similar to virtual machines; if you are already familiar with virtual machines then it may help you to consider some of the key differences between containers and virtual machines:

Advantages of Containers over Virtual Machines

Containers are lightweight because they do not contain an operating system.
Containers have the same performance as code running on the host operating system (up to three times more performance than virtual machines, when running on the same hardware).
Containers typically have a startup time in milliseconds (compared to minutes for a virtual machine).
Containers require less memory (RAM) than virtual machines.
Containers are defined using code, which means you can take advantage of code control systems like Git.
Containers (in particular Docker) encourage inheritance, so that you can minimise costly re-work.

Drawbacks of Containers compared to Virtual Machines

Containers always run on the Linux operating system (containers share the host operating system), whilst virtual machines can run a different operating system for each virtual machine.
Containers use process-level isolation, which is potentially less secure than virtual machines which are fully isolated.

Trying to tackle all of containerization as a topic would take multiple semesters, however as data scientists we’re only really interested in a very small subset of the capabilities. Specifically, data scientists using containers are normally doing one of the following things:

Deploying web apps (e.g. Shiny, Flask)
Deploy Application Programming Interfaces (APIs)
Deploy automated or scheduled jobs (e.g. scheduled model scoring, ETL jobs)
Build a completely reproducible environment for running experiments

Expectations about what “best practice” means when using containers for data science are constantly evolving (not to mention the changing tooling), so in this course we’ll cover just enough of the basics that you can understand the what and why of containers, collaborate with team mates who are using containers (e.g. development teams), and get you up and running with Docker on your own computer.

11.1 Installing Docker

For the examples in this chapter we’re going to use Docker - by far the most popular containerisation tool in 2019.

Firstly, head over to the Docker Desktop page and click Download Desktop. This will send you to a sign-up page at the Docker Hub website (we’ll look at what this is a little bit later) and you’ll need to create an account in order to download the software. You may choose to create an account using your personal email - we will not be using Docker Hub for assessment in this course so you are not required to your use student email.

Once you have created your account and logged in, you should look for the Download Docker Desktop button which will take you to the download page where you can download the installer for Docker Desktop. You can ignore the rest of the tutorial on the Docker Hub website as this is mostly focused around using their online service - just open the installer from your downloads folder and follow the prompts.

11.2 Using Containers

This section will be brief by necessity - as mentioned above we could write a whole course on containers and still not cover all of the features. Accordingly this will be something of a whirlwind tour of containerisation using Docker, and will serve to give you an overview of what containers can do and how to use them.

All of the commands in this section will be run from the terminal. They should work on both MacOS and Windows machines, although they have only been tested on MacOS.

11.2.1 Basic Operations

The first thing we will do is confirm that Docker is correctly installed and operating as expected. To do this we will run the command docker run hello-world which will instruct the Docker CLI to fetch the hello-world image from the Docker Hub service and run it on my machine. If you run this command on your own computer you should see the output below which confirms that Docker is installed and working correctly.

$ docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
1b930d010525: Pull complete
Digest: sha256:0e11c388b664df8a27a901dce21eb89f11d8292f7fca1b3e3c4321bf7897bffe
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

As mentioned above in the message that was printed to the console, Docker has performed four tasks:

The Docker client contacted the Docker daemon.
The Docker daemon pulled the “hello-world” image from the Docker Hub.
The Docker daemon created a new container from that image which runs the executable that produces the output you are currently reading.
The Docker daemon streamed that output to the Docker client, which sent it to your terminal.

There are two key terms used here:

image - a static set of binary files which store all of the information required to launch a container
container - an isolated operating system using containerisation (in this case via Docker) to run on a host operating system (in this case MacOS)

Importantly, a container is a running instance of an image.

Now that we’ve confirmed that Docker is working correctly, we’ll do something a little less trivial. Let’s use Docker to run the Ubuntu operating system as a container inside MacOS. More specifically, we’ll run bash inside the container, and we’ll use the uname utility to print the name of the operating system to the screen so that we can confirm we are in fact using an Ubuntu operating system inside the container.

Firstly, we’ll confirm that we’re currently using MacOS:

$ uname -a
Darwin Perrys-MacBook-Pro.local 18.6.0 Darwin Kernel Version 18.6.0: Thu Apr 25 23:16:27 PDT 2019; root:xnu-4903.261.4~2/RELEASE_X86_64 x86_64

Then we’ll use Docker to run the ubuntu image in the same way as we ran the hello-world image above. This time, we’ll also add two command line arguments:

--interactive - this will allow us to interactively type commands into the container
--tty - this allocates a pseudo-TTY, which at the user-level just means that it will let the container print it’s output to the screen.

So essentially these two arguments are going to let us type commands into the container, and see the printed output from a container. We’ll also add one final argument to the end of our command: bash. This how we tell Docker what command we want to run inside the container - in this case we’re saying we want to run the bash shell inside the Ubuntu container.

$ docker run --interactive --tty ubuntu bash
Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
6abc03819f3e: Pull complete
05731e63f211: Pull complete
0bd67c50d6be: Pull complete
Digest: sha256:f08638ec7ddc90065187e7eabdfac3c96e5ff0f6b2f1762cf31a4f49b53000a5
Status: Downloaded newer image for ubuntu:latest
root@6f0684d58dca:/#

This is very cool - now we have an Ubuntu command prompt inside a MacOS computer! And instead of taking 15 minutes to launch like a virtual machine, this only took a few seconds to download and then a few milliseconds to launch.

Let’s run the uname -a command again to prove that this is indeed an Ubuntu container:

$ uname -a
Linux 6f0684d58dca 4.9.125-linuxkit #1 SMP Fri Sep 7 08:20:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

We can also run other commands that only work in Ubuntu. For example, we can install R inside our container by using the apt-get utility. It’s okay if you don’t understand what these commands are doing, the key point is that these commands only work because we’re using Ubuntu; they do not work on MacOS

$ apt-get update
<output truncated>
$ apt-get install -y r-base
<output truncated>

To quit the Ubuntu container and return to MacOS, you can simply type exit. In addition to returning you to the host operating system, this will also suspend the running container. You can see the container is still cached on your system by using the docker ps -a command (which also shows that the hello-world container is still there as well).

$ docker ps -a
CONTAINER ID        IMAGE         COMMAND      CREATED          STATUS                      PORTS    NAMES
6f0684d58dca        ubuntu        "/bin/bash"  22 minutes ago   Exited (0) 14 seconds ago            confident_brown
4f8db72c3e9b        hello-world   "/hello"     34 minutes ago   Exited (0) 34 minutes ago            frosty_agnesi

This is interesting, but ultimately a waste of storage, so we can remove them using the docker rm command. You can refer to containers using their CONTAINER ID (e.g. 6f0684d58dca) or their NAMES which have been randomly generated by Docker to make them easier to type (e.g. confident_brown).

$ docker rm 6f0684d58dca
6f0684d58dca
$ docker rm frosty_agnesi
frosty_agnesi

We’ve now removed all of the containers from the machine, but luckily we’ve still got the images. We can see this using the docker image ls command, which lists all of the images stored on the machine.

$ docker image ls
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
ubuntu              latest              7698f282e524        3 weeks ago         69.9MB
hello-world         latest              fce289e99eb9        5 months ago        1.84kB

This is useful because it means next time we launch a container using one of these images, we won’t need to download it from the internet so it should be lightning fast to launch. If you run docker run --interactive --tty ubuntu bash again you’ll see that it launches in less than a second - one of the key advantages of containers over virtual machines.

11.2.2 Versioned Containers

Let’s look at an example that’s a bit more relevant to data science. Specifically we’ll use the rocker/tidyverse image, which is maintained by a small group of volunteers from the R community. We can see how this container works by running the following command: (keep in mind this image is over 700MB so it might take a few minutes to download)

$ docker run --interactive --tty rocker/tidyverse R
Unable to find image 'rocker/tidyverse:latest' locally
latest: Pulling from rocker/tidyverse
c5e155d5a1d1: Pull complete
17ea7c887607: Pull complete
75748969673e: Pull complete
4c4aaee098b4: Pull complete
65d10e2c0dee: Pull complete
816e1302c851: Pull complete
c56b4acb8961: Pull complete
05234b10d70b: Pull complete
Digest: sha256:9155ce73701c3294799e60ee67302a79be0ae8362fdedf489fbbc5063999c018
Status: Downloaded newer image for rocker/tidyverse:latest

R version 3.6.0 (2019-04-26) -- "Planting of a Tree"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>

As we can see from the messages above, this container comes with R 3.6.0 pre-installed, which is handy, but it also comes with all of the latest versions of the Tidyverse packages installed.

> library(tidyverse)
Registered S3 methods overwritten by 'ggplot2':
  method         from
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.1     ✔ purrr   0.3.2
✔ tibble  2.1.3     ✔ dplyr   0.8.1
✔ tidyr   0.8.3     ✔ stringr 1.4.0
✔ readr   1.3.1     ✔ forcats 0.4.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

You can now exit the container by closing R - simply type q() to quit R.

Imagine if you were just given access to a new server (or virtual machine) and wanted to install of these packages - it normally takes about 15 minutes to install them all one-by-one. Being able to just launch a container which contains all of these packages is useful all by itself. Even more impressive is the fact that this container actually contains the entire RStudio application as well!

If you want to convince yourself of this, run the following command:

$ docker run -e PASSWORD=secretpassword -p 8787:8787 rocker/rstudio

If you now open a web browser and navigate to http://localhost:8787 you’ll see your very own RStudio Server running R 3.6.0 with all of the Tidyverse packages installed (the username is rstudio and the password is secretpassword). This is part of why Docker is becoming so popular - it makes the administration of servers easier than ever before.

To shut down the container, return to the terminal window and press CTRL+C a few times until it closes.

This is all very cool, but what about versioning? This is probably the #1 feature for data scientists, and it’s really easy to use. Consider the following command:

$ docker run --interactive --tty rocker/tidyverse:3.3.1 R

This is almost identical to what we ran above, except that this time we specified a version tag: 3.3.1. Docker lets image creators use these tags to label specific versions of images, and in this case the rocker team have used these tags to specify which R version the image refers to. As you can see when you run the command above, using the 3.3.1 tag launches container running R 3.3.1:

$ docker run --interactive --tty rocker/tidyverse:3.3.1 R
Unable to find image 'rocker/tidyverse:3.3.1' locally
3.3.1: Pulling from rocker/tidyverse
<output truncated>
Status: Downloaded newer image for rocker/tidyverse:3.3.1

R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>

This is really powerful, because it means that you can use these tagged images to run previous releases of R, Python, or any other tools you are using. But even more powerful is the fact that these images are snapshots from a specific point in time - they’ll never change. This means that they have package versions that were in use at the time the snapshot was taken - let’s load the Tidyverse packages again to confirm:

> library(tidyverse)
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats

It’s definitely running an older version of tidyverse - so old that it doesn’t even print version numbers! We’ll check dplyr manually to confirm:

> packageVersion('dplyr')
[1] ‘0.5.0’

When we ran the latest version of rocker/tidyverse we saw it was using version 0.8.1 of dplyr - in this snapshot version we’re using 0.5.0. In this way the container effectively lets you travel back in time to when the snapshot was taken, and run all of your R code using the specific versions of packages that were available back when R 3.3.1 was released.

Note that Docker convention is to use the tag latest by default when no tag is specified - in the case of this image the latest and 3.6.0 tags are identical.

This is a powerful capability when working on a piece of analysis where reproducibility is important. If you were to start such a project today, you could run your analysis inside a container based on rocker/tidyverse:3.5.3, and you could ensure that if you ever need to run the code again in future, you could always use this image to recreate your results without having to modify any code. And because the image also contains RStudio, you don’t even need to write all of your code on the command line! You can write it and run it from a familiar RStudio environment.

Keep in mind that the package snapshot only applies to packages which are part of the image. If you install packages as part of your analysis then these packages will be pulled from the CRAN website, and in future the versions may change. There are two options to work around this issue which you can research if you need to work around this:

Create your own image and install the R packages directly into the image using a Dockerfile, push that image to Docker Hub so that it is always available in future (preferred)
Once you finish running the container, save the suspended container as a new image (not preferred as this is harder to audit than using the code-based approach above)

Be careful about where you save your code and data for your analysis. If you save them to a location inside your container, you may lose your work when you close your container. If you want to save your work from within a container, you can research using mounted volumes to connect folders on your computer with folders inside the container.

We’ve covered a lot of material very quickly here, so let’s re-visit the key points:

Docker uses tags to snapshot image versions
For R and Python-based containers, these tags are normally based on R or Python versions, or based on dates (depending on the convention established by the authors). You can see what tags are available from the Docker Hub website.
If reproducibility is important, you may choose to perform your analysis inside a versioned container
If reproducibility is really important, you can save and export a new image based on the container you used for analysis.

If you want to see how the rOpenSci Consortium recommends using Docker containers for reproducible science, you should read their tutorial. Colin Fay also has a great guide to Docker for R users.

If you’re really interested in seeing how Docker can make your life a bit easier, you might like to see how these course notes are built and deployed automatically using a Docker image as the starting point, which saved me a huge amount of development time, speeds up build times (because the packages are pre-installed), and reduced (but in my case, didn’t entirely remove) the risk of things breaking when packages get updated.

11.2.3 Applications and Services

Another powerful feature of Docker is that it can easily be configured to run webservices - this was briefly demonstrated above when we looked at how the rocker/tidyverse container contains the open source version of RStudio Server and we can configure Docker to allow access through the browser. A full explanation of how to use this functionality is beyond the scope of the course, although you may like to try using some of these containers on your own, and research how to extend them (hint: Dockerfiles) to perform useful tasks.

rocker/shiny-verse - a versioned Shiny Server image with Tidyverse packages pre-installed
trestletech/plumber - a container with the plumber package (for building APIs) pre-installed and pre-configured. This image is not using tags consistently, so be careful when using it.

11.3 Containers in the workplace

Many organisations are moving towards containerisation, however the speed and approach will be different in every organisation. One of the key trends pushing this approach is the move towards 12 Factor architecture, which you might like to research if you’re interested in software development trends.

As a result you may start to see some of the following changes in your organisation:

Reduced use of virtual machines as a way of provisioning servers to end-users
Increased use of automated pipelines for building/compiling applications, closely tied to git repositories (continuous integration)
Increased use of automated pipelines for deploying containers into production environments, closely tied to git repositories (continuous deployment)
Moving away from large “monolithic” applications towards collections of services working together
“Decomposing” large applications into smaller services, where it is easier to understand what each of the small services is doing, and easier to make changes without impacting the whole application
Move workloads (containers, virtual machines) to cloud computing providers, potentially even using “container as a service” offerings
“Horizontal scaling”, where you can quickly increase or decrease the number of containers to deal with changing scale. For example you may have a machine learning model which can deal with 5 requests per second, but the process you’re supporting has a daily peak of 48 requests per second. Because containers are so quick to launch, you can simply launch 10 containers as the peak is approaching and deal with the peak load in parallel. In order for this to work containers need to be self-contained and stateless.

As a data scientist you’re only ever going to be a small part of the container ecosystem in your workplace, but it helps to understand what that ecosystem looks like so you can recognise the terms people use. Overall these trends are a good thing for data scientists as they lower the barrier for entry when putting services (e.g. machine learning models) into production. For example, if you developed a credit scoring model which needed to be included in a real-time decision making process, you would have previously had to hand your code over to a developer to incorporate into their software. With the move towards applications built as a collection of microservices, you can simply deploy a container which exposes an API, and your application can be written in R or Python regardless of what language the rest of the system is using. Disregarding compliance considerations, this would also allow you to update your model as regularly as necessary without having any dependencies on the rest of the system - because your service is isolated within it’s own container, you can make internal changes to your code as long as you don’t change the external API. We’ll look at APIs more closely in the final chapter of this course.

11.4 Learning More

The best resource for learning about Docker is the Docker website - they have lots of educational material available, and also provide links to sponsored materials including the Play with Docker website. Also mentioned above are the rOpenSci Consortium Docker tutorial and Colin Fay’s Docker for R users.

As with any other tool, the best way to learn is through trial-and-error. Install Docker on your own computer and experiment!