10 Unix Systems
As a data scientist it’s almost impossible to avoid Unix systems. Unix vs Windows isn’t a question of preference like R vs Python; if you don’t know Unix then you’re likely to struggle in the workplace. Unix is the operating system of data science, so you need to know how to use it.
One of the early stereotypes about a career in data science was that it was a ticket to being given a Macbook for work. This is because the macOS operating system is actually a Unix system - macOS is built on top of Apple’s Darwin operating system, which is a Unix system. Being a Unix operating system means that macOS provides the best of both worlds for a data scientist - a clean and user-friendly operating system with a powerful and well-supported Unix system underneath.
This isn’t to say that you must use macOS in order to be a successful data scientist - it is perfectly acceptable to use a Windows computer as long as you also know how to work with Unix systems. You can also install Linux on most “Windows” computers, which gives you all the benefits of a Unix-like system, with a slight trade-off in terms of polish and ease-of-use compared to macOS.You may encounter Unix systems as a data scientist:
- when working on the command line using macOS or Linux operating systems
- when working with remote servers in your employer’s data centre
- when working with remote servers provided by AWS, Azure, GCP, etc
- when working with shared Jupyter or RStudio servers (these are hosted on Unix servers)
Whilst not related to data science, you are probably also using Unix every time you browse the internet (almost every server on the internet is running a Unix operating system) and every time you use your phone (iOS and Android are both Unix operating systems).
10.1 What is Unix?
Unix is a family of operating systems that derive from the original Unix operating system developed by AT&T Bell Labs in the 1970’s. Some of the more well-known Unix operating systems include macOS, Solaris, BSD, IBM AIX and HP-UX. The Linux operating system is often referred to as a Unix-like system however it is technically a Unix clone; for all practical purposes you can consider Linux distributions (including Ubuntu, Red Hat, etc) to be Unix-like operating systems.
Unix systems are characterized by a modular design that is sometimes called the “Unix philosophy”. This concept entails that the operating system provides a set of simple tools that each performs a limited, well-defined function, with a unified filesystem (the Unix filesystem) as the main means of communication, and a shell scripting and command language (the Unix shell) to combine the tools to perform complex workflows. Unix distinguishes itself from its predecessors as the first portable operating system: almost the entire operating system is written in the C programming language, thus allowing Unix to reach numerous platforms.
Wikipedia: Unix
Understanding these “simple tools” is the key to working productively with Unix. For example:
- learning bash (the Bourne Again SHell) lets you write simple scripts and navigate the Unix file system
- learning ssh lets you connect to other Unix machines.
- learning cat lets you print the contents of a file to the terminal
- learning less lets you interactively scroll through long documents
This chapter will focus on teaching you how to navigate Unix filesystem, how to use these “small tools” to build complex workflows and get things done, and where to go when you need help.
You might have realised by now that “using Unix” means many different things depending on the context. Technically, every time you use a Macbook you are using Unix, but learning to use Safari to browse Facebook isn’t exactly going to help you with data science projects.
Some examples of using Unix for data science include:
- using ssh to connect to a remote analytics server
- using scp to move files between
- using a package manager (e.g. brew, apt, yum) to install command line tools
- using command line tools to interact with third-party services
- using wget to download files from the internet
- using top to monitor system resource utilisation
- using docker to manage and run containers
Of course it is possible to perform some of these tasks using Graphical User Interfaces (GUIs) without using the Unix command line, however there are a few key advantages of using the Unix CLI:
- New data science tools are normally written for Unix, and normally require the use of a Command Line Interface (CLI)
- Most tools with GUIs only let you use the most popular functionality - power users normally require access to the CLI to get all of the features they need
- Once you get the hang of them, most of the command line tools are much faster than the alternatives
- If you start writing your own tools for others to use, it is many orders of magnitude harder to create a GUI application. Command line interfaces are much easier to create.
10.2 Practicing Unix
If you’re using a macOS computer then you already have Unix - you can simply open Terminal.app and you’ll see a new window open where you can type and execute bash commands. This also applies if you’re using Linux - look for an application called Terminal to get access to the command line.
If you’re using Windows then you’ll have to do a little more work to get access to a Unix-like system for practicing.
Click to see how to install a Unix-like environment in Windows…
The easiest option is to install
VirtualBox which lets you
install and run virtual machines on your computer. Once you have installed
VirtualBox, go to the Ubuntu Server
page and download Ubuntu Server 18.04.2 LTS
. While that file is downloading
you can get ready to start your virtual machine:
- open VirtualBox and click New to create a new Virtual machine
- give your virtual machine a name (e.g. “My Linux Machine”)
- in the drop-down box for Type select Linux
- in the drop-down box for Version select Ubuntu (64-bit)
- click Continue
- when asked about memory size, select 1024MB and click Continue
- when asked about a virtual hard disk, select create a virtual hard disk now and then click Create
- when asked about hard disk file type, select VDI (VirtualBox Disk Image) and then click Continue
- when asked about storage on physical hard disk, select dynamically allocated and then click Continue
- when asked about file location and size, accept the default values and then click Create
You will then be taken back to the main screen of VirtualBox and you will
see your new virtual machine is in the “off” state. Click Start to start
the virtual machine, then VirtualBox will ask you for the location of a “virtual
optical disk file” - this is referring to the .iso
file you downloaded from
Ubuntu earlier. Navigate to the location of this downloaded file then click
Start.
You will see the machine start running within a few seconds, and it will install the Ubuntu operating system for you. Eventually it will start asking you some questions about language - use the keyboard and follow the prompts to complete the installation. You can accept all of the default values until you get to this screen:
You can enter whatever you like for these fields, making a note of your username and password so that you can log in to your virtual machine. You will need to use the tab key or arrow keys to move between text fields.
On the next page you may choose to install OpenSSH server (you do not need to, but you can use this for practice too if you like). Use the arrow keys to navigate to Done then press enter.
On the next screen, use the arrow keys to navigate to Done, then press enter. The installer will then take another minute or so to complete the installation, before prompting you to reboot. Press enter to reboot the virtual machine. The restart will pause to ask you to remove the installation CD - just press enter to proceed.
When the machine reboots it will pause when it is ready - press enter to
bring up the username prompt. If you named your virtual machine linux-vm
then your prompt will look like this:
If you used a different virtual machine name, that that name will appear here. To log in to your system:
- Enter your username, then press enter
- Enter your password, then press enter
You will then see about 20 lines of welcome information followed by a command prompt, which should look something like this:
You’re now ready to practice using the Unix command line.We will also look at how to connect to remote Unix servers using ssh (Mac) or PuTTY (Windows) later in this chapter.
echo Hello World!
into the window
below, and then pressing Enter.
10.3 The Unix Shell
The shell gets it name from the notion that it is a user-facing “shell around the computer’s whirring innards”. The idea is that the designers of the operating system don’t want users tinkering with the internals of the operating system, so they built a shell around it, and expect users of the system to interact with the computer using that shell.
When people say “shell” these days they are almost always referring to the Bourne Again SHell (bash) which is effectively a global standard across all of the major Unix systems. From the user point of view bash is an example of a REPL - a Read-Evaluate-Print Loop. If you haven’t used command line tools before than this is a useful term to help understand how to use all interactive command line tools, including bash, R and Python. Every time you press enter in one of these interactive tools, the following events happen:
- the shell reads your command
- the shell evaluates your command, and calculates the output
- the shell prints the output to the terminal
- the shell prints a new prompt, ready for your next command
You can practice this using the shell below. Try running a few of these commands one after another (we’ll cover what they mean in the next section):
ls
pwd
ps
You can see how the bash shell is just repeating these same steps over and over again:
read -> evaluate -> print
10.4 Basic commands
Commands can range from simple instructions (e.g. ls
which lists the files
in your current working directory) all the way to complex scripts with hundreds
of commands. For this course we’ll only look at how to use these simple
commands to help you work in Unix environments.
10.4.1 Where am I?
The Unix file system is a little easier to understand than the Windows file
system. At the very root of the file system is /
- you can think of this like
c:\
in Windows systems. This means that all file paths in Unix start with
/
in the same way that all Windows file paths begin with c:\
. As a user of
Unix systems you typically don’t need to worry about which disk you are using,
you only need to think about your location in the file system.
Folders are also separated in Unix using the forward slash /
- this is
different to Windows where the backslash \
is used to separate directories in
a file path.
Sometimes you will also see Unix file paths end with another /
- this is the
same in Windows systems where file paths will often end with \
. It is not
required but is often used to show that the location is a directory rather
than a file. As this is optional you do not need to type it, however it is a
useful convention as it makes it clear to anyone reading that you’re referring
to a directory rather than a file.
These are all examples of valid directory paths on my macOS (Unix) computer:
/Users/perrystephenson/
/Users/perrystephenson/code
/usr/bin/
and these are examples of valid file paths on my macOS (Unix) computer:
/Users/perrystephenson/tweets.csv
/Users/perrystephenson/data/faces/training/s1/1.pgm
/Users/perrystephenson/code/dsp/10-Unix-Systems.Rmd
(this file!)
The biggest difference between macOS and Linux systems in terms of the file
system is the way the user folders are arranged. In macOS systems the user
folders are all stored inside /Users/
, whilst in many Linux systems
(particularly Ubuntu) the user folders are stored inside /home/
.
Now that you understand what the file system
is, how do you know where you are? In a terminal, you can use the pwd
command
(print working directory), which will tell you where you are. Try typing
pwd
into the terminal below and then pressing enter - it should tell you
that you are located at /home/runner
.
/home/runner
because (in these embedded
examples) we’re running commands as a user called runner, and this is the
home directory for the runner user.
Home Directories
A “home directory” in Unix systems is similar in concept to “My Documents” in Windows. It is a folder where you can store anything you like, and by default it is normally private and not accessible by any other users of the computer.
The home directory is used so often in Unix systems that it has it’s own
shortcut: ~
. If you type echo ~
into the shell above you will see that it
prints /home/runner
again because ~
is just a shortcut to that folder. You
can use ~
anywhere you would normally use a file path. For example, using the
~
(tilde) shortcut to show the locations of the three example files above:
~/tweets.csv
~/data/faces/training/s1/1.pgm
~/code/dsp/10-Unix-Systems.Rmd
This has some advanges and drawbacks, besides the obvious reduction in typing.
These file locations are now specific to my user account, which means that
whilst they work for me they will not work for anyone else on my computer
(because the ~
shortcut will point to their own folder, not mine). This can
cause issues! On the other hand if I am writing an R script and want to write a
temporary file to disk, I can write it to ~/temp.rds
and be confident that no
matter who runs the script, it will store the file in that user’s home
directory rather than my own.
~
is a shortcut to your home directory.
10.4.2 What is here?
One of the first things you might want to do is look around and see what is in
your folder. To do this you’ll use the ls
command (list directory
contents). Try running the ls
command in the shell below.
You should see three files listed: file_2.R, file_3.py and main.sh.
If you want to see more detail about these files, you can use ls -l
to print
the output in “long” format. Try this again using the shell above.
You can now see lots more information about the files in this directory, including:
- File permissions (beyond the scope of this course)
- File ownership (beyond the scope of this course)
- File size in bytes
- Last modified date and time
- File name
10.4.3 Arguments
When you type ls -l
, you are providing an
argument to the ls
command. In general, when you execute a command in
Unix systems, everything you type after the name of the command is an
argument and you can provide more than one if needed.
Arguments allow you to:
- provide instructions to the command about how you want it to work
- provide input data to the command
Most Unix commands have more arguments than anyone could possibly remember, so for this course we’ll only learn about the useful arguments as they are needed.
There is one argument you should remember because it works for almost every
Unix command: --help
. This argument opens a file viewer which lets you read
the help documentation for the command. You can try this out in the shell
below using ls --help
, which will print the help documentation for ls
to
the terminal (you will need to scroll up to read it all).
10.4.4 Moving around
So far we’ve learned how to see where we are (pwd
) and look at which files
are in the directory (ls
). The next thing we need to learn to do it move
around - this is done using the cd
command (change directory).
If you use ls
in the shell below, you’ll see that there are two directories
- dir1 and dir2. If you use pwd
you’ll also see that we’re in the
same directory as before - /home/runner
.
To move into the dir1 directory you just need to type cd dir1
and press
enter. If you then type pwd
again you’ll see that you’re now in
/home/runner/dir1
- success! You can use ls
to look around and see that
there are two files in this folder: file1 and file2.
To move back “up” one level to the /home/runner
directory, you can use one of
two approaches:
- use
cd /home/runner
to change to the directory using the full path - use
cd ..
to move “up” one directory using the..
shortcut
There is an even easier shortcut for going back to your home directory:
- use
cd
with no arguments to go straight back to your home directory.
For practice:
- use
pwd
to confirm that you are back in/home/runner
- use
cd
to navigate inside dir2 - use
ls
list the files inside that directory
10.4.5 File manipulation
Now that we can move around the file system, we need to learn to make changes to files using the bash shell. We’re going to use three commands:
We will look at each of these commands in detail
10.4.5.1 Copy (cp)
The cp command copies a file from one location to another. The syntax for
this command is cp [from] [to]
.
Using the shell below, you can use the following commands to copy two files from inside dir1 into the home directory:
cp /home/runner/dir1/file1 /home/runner/
cp /home/runner/dir1/file2 /home/runner/
You can now use ls
to confirm that these two copy operations worked correctly
and that copies of the files are now in your home directory.
This is the safest way to copy, however there are some shortcuts that you can use to make it much faster to type:
- Instead of using full paths in the from argument, you can use relative paths. Assuming that your working directory is
/home/runner
you can usedir1/file1
instead of typing out the whole path. - Instead of typing out the current directory in the to argument, you can use the
.
shortcut..
is a shortcut to your current working directory, so in this case you can replace/home/runner
with just.
.
Using both of these together, you could write the following:
You can also use cp
to copy whole directories. Because this requires cp
to
scan the whole directory (and any subdirectories) and then copy each and every
file, you need to use the -r
argument to tell cp
to copy recursively.
For example, to copy dir2 inside dir1, you could use cp -r dir2 dir1
(and then use cd
and ls
to check that it worked).
10.4.5.2 Move (mv)
The mv command moves a file from one location to another. You can think of
it as being just like the cp command, except that it deletes the original
file/directory. You can try this out using the shell below - use the command
mv dir2 dir1
to move dir2 inside dir1. Use ls
and cd
to confirm
that dir2 has been removed from your home directory, and is now inside
dir1.
Note that you do not need to use the -r
argument when moving directories.
10.4.5.3 Rename (mv)
You can also use the mv command to rename files and directories. You can think of this like “moving to a new name”, because you can move and rename at the same time. Use the shell below to try using mv to rename:
- to rename dir2 to dir3, use
mv dir2 dir3
- to rename main.sh to new.sh and move it inside dir1, use
mv main.sh dir1/new.sh
Use ls
and cd
to move around and confirm that the renaming and moving have
worked as expected.
*Note that you can also rename files and directories during cp operations, using the same syntax.**
10.4.5.4 Remove (rm)
The rm command removes files and directories.
- To remove a file, use
rm [filename]
- To remove a directory, use
rm -rf [directory]
The -rf
argument when removing a directory is similar to the -r
(recursive)
argument used with the cp command, but in this case we’re also using the
f argument which means force. This is because deleting a whole
directory is a potentially dangerous operation and you’re using the f
argument to tell the shell that you’re really sure about what you are doing.
Use the shell below to:
- remove main.sh (
rm main.sh
) - remove dir2 (
rm -rf dir2
)
You can then use ls
to confirm that both files have been removed.
Be careful when using rm
Unlike Windows Explorer and macOS Finder, there is no Recycle Bin when using the shell. Once you delete a file or a directory you will never be able to get it back.
This is especially dangerous because it would be very easy to make a typo and runrm -rf /
, which would delete every file on your computer. Make sure you never do this!
10.4.6 Directories
The mkdir command lets you make a directory inside your current working
directory. To create a new directory newdir inside the dir1 directory,
you can use cd dir1
to change your working directory to dir1, then use
mkdir newdir
to create the new directory. Use ls
to confirm that the
command was successful.
10.5 Working with files
Now you know how to manipulate files in the Unix filesystem, but what good is this if you can’t read or edit files? This comes up all the time in the workplace:
- Working on a remote server (via SSH, which we will cover below) and you want to see the code in a script
- Working on a remote server and you want to make a small edit to a script
- Inspecting the output of a script and you want to check the first few lines in a very long CSV
- Checking the contents of a configuration file
10.5.1 Printing a file
The cat command lets you concatenate and print files to the screen. The
syntax is cat [file1] [file2] [file3] ...
and the command will print one
file after another. Of course the concatenation feature is rarely used, and the
most common usage of cat is printing a single file to the screen.
Using the shell below, use cat .gitignore
to print the contents of .gitignore
to the
screen.
You should see that the file contains a list of files and folders to be excluded from Git commits.
10.5.2 Viewing a long file
You will regularly need to read files which are far longer than you can comfortably view in a terminal window. For these files you can use more or less.
The more command lets you view the first screen-full of content, and then
lets you scroll through the document using space (which jumps a whole
screen ahead) or enter (which jumps a line at a time). To use the more
command to view a very long file in the shell below, type
more LICENSE.md
.
When you’ve finished viewing the document, press q
to quit and return to the
command prompt (or simply press space until you get to the end of the file).
The less command provides additional functionality, because it also lets you scroll backwards through a document. It’s also a bit easier to use because it lets you use the arrow keys to navigate the document. less is not installed on the embedded shell above, however you will find it on almost every system you encounter in real life.
If you just want to look at a few lines of a document, you can use the head or tail commands which let you view the top or bottom of a document quickly.
head rstats_tweets_2017.csv
will print the first 5 lines of the filehead -1 rstats_tweets_2017.csv
will print the first line of the filetail -1 rstats_tweets_2017.csv
will print the last line of the file
10.5.3 Editing a file
Editing files in the shell can be fairly frustrating, and most people generally try to avoid it when possible. If you do find yourself with a need to edit a file directly in the shell (for example when tinkering with something on a remote server) you can normally use one of the following command line tools:
- nano - easiest to use, but less powerful than vim. Not always installed.
- vim - very powerful, but hard to use. Installed by default on most systems.
- vi - prehistoric precursor to vim. Only use this if nano and vim are unavailable on your system.
In all cases, opening a text editor is as simple as:
Saving and closing is fairly straight-forward in nano (the keyboard commands
are displayed on screen at all times) and basically impossible to remember in
vim/vi. If you are going to need to edit files in the shell regularly it is
probably worth learning at least some of the basic commands in vim, which
can be done by using the vimtutor
program installed alongside vim.
We’ve surpassed the capabilities of the embedded bash shell so you will not be able try using any of these tools without crashing the shell. You can try using them both on your own machine - nano and vim are both installed on macOS systems by default, and should be installed on most Linux distributions if you’re running Linux using a virtual machine on Windows.
10.5.4 Creating a file
As with editing, most of the time you will want to avoid using the command line to create new files. However I will show you two ways to do this so that you know what to search on Google if you ever need to do this.
# Creating a new empty file
touch my_new_empty_file
# Creating a file with a small amount of content
echo 'This is my content!' > my_new_file
You can run both of these commands in the shell below, and then use ls
and
cat
to inspect the files you have created.
10.6 Connecting to servers
One of the most powerful features of the Unix command line is the ability to easily and seamlessly connect to remote servers; this is achieved using a tool called Secure Shell (SSH). SSH is a very powerful tool and it probably requires a whole course all by itself just to cover the most common features. In general however, the basic workflow for SSH looks something like this:
- Use
ssh <server address>
to connect to the server - Enter your username and password interactively when prompted
- You will be connected to the remote server, and probably see a welcome message
- You can use the bash shell to interact with the server, including launching applications (like
ls
,pwd
,less
and others that we have learned about), launching scripts in R or Python (we’ll cover this in the next section), and just about anything else you can do on the command line on your own machine - When you’re finished, use
exit
to close the connection and return to your own machine.
Many organisations will have some additional security controls in place when using SSH. Common security controls include:
- Banning the use of usernames and passwords to log in (you will need to use a set of cryptographic keys to connect)
- Using a “jump host” to let users connect to servers in secure environments - in this case you just need to connect to the jump host using SSH and then use SSH again to connect to the server you need access to.
Due to these differences in how most organisations use SSH we won’t be practicing how to use it as part of this course. If you do need to use it in the workplace, most organisations will have instuctions on how to connect to servers using SSH.
10.7 Working with R and Python
One common task that you will likely want to perform over and over again when using the command line is running R and Python scripts, or interactively using the R and Python REPLs.
10.7.1 R scripts
R scripts are executed using the Rscript
program, which is installed whenever
you install R. To run a script called my_script.R
from the command line, you
just need to type the following:
This will run the entire R program from start to finish, and will print any
outputs generated (normally using print()
or message()
statements) to the
terminal.
10.7.2 R REPL
You can run R interactively by just typing R
in the shell - you can quit the
R application by running q()
in the R REPL. You will find that running R in
this way is pretty unpleasant (no RStudio!) however it can be really useful for
troubleshooting R code on remote machines. Using R interactively can help you
identify missing packages, environmental differences, or other things which
could be causing unintended operation of your scripts.
10.7.3 R Shiny
If you want to run R Shiny in such a way that colleagues can see your work, then you will probably want to install Shiny Server on a Linux machine. This is beyond the scope of the course, however you can find the installation instructions here.
10.7.4 Python scripts
Running a Python script from the command line is just like running an R script,
except that you use the main python3 application rather than a specific
“script” version. To run my_script.py
from the command line, you just need to
type the following
This will run the entire Python program from start to finish, and will print any outputs generated to the terminal.
10.7.5 Python REPL
Whilst you can run Python interactively (by just typing python3
) you probably
will prefer using IPython. Assuming you have already installed IPython
(if you installed Python using Anaconda then you likely already have IPython
installed), you can simply type ipython
to enter the IPython 3 REPL. You can
quit the application by typing quit
.
You can also launch the Jupyter Console (which is essentially the same
thing, but launched using Jupyter) by typing jupyter console
at the command
line.
10.7.6 Python packages
Unlike R, Python packages are installed from outside Python, using an application called pip from the shell.
To install packages using pip, you just need to run
sh pip install <packagename>
.
For example, to install tensorflow, you would type pip install tensorflow
.
10.8 Further reading
There is so much to learn about Unix that you’ll likely spend your whole career learning about handy tools and neat ways of solving problems. For anything that comes up in the process of using Unix you’re probably best served by using Google to search for what you’re trying do, and probably ending up looking at Stack Overflow.
For more structured learning materials you can learn more about Unix and Bash by reading The Unix Workbench by Sean Kross. Julia Evans (@b0rk) also publishes handy bite-size comics (Wizard Zines) which help explain complex Unix tools and concepts. Learn Python the Hard Way (one of the related materials from the Python Module) also includes a Command Line Primer which might be helpful.