3 Git

The name “Git” strikes fear into the hearts of just about everyone when they hear about it for the first time, but it doesn’t need to be this way! When you understand why Git exists (and what problems it solves for teams who use it) you’ll wonder how you ever collaborated without it.

Let’s start from the basics, and build up a case for why Git exists. We’ll keep track of features that we need to collaborate effectively, as well as features that would be handy.

Feel free to forget about code for now, and think about any document you work with regularly. This could be a text document, a spreadsheet, etc.


Imagine you have just started working on a document on your own computer. You can edit and save as many times as you like, which is fine because whilst you’re working on the document you know that you’re the only one who has access to make changes, and you can make sure no one else is editing it at the same time.

But there are still problems that pop up when you’re working on a document by yourself - what happens when you realise you made a mistake and deleted something accidentally? Obviously you have undo, but let’s be honest it’s not that reliable and it would be nice to have something better. It would be nice to have a detailed history for every document you are working on, and the ability to undo changes you made in the past without having to keep track of everything else you’ve changed since then.

Feature Request #1:

Easily undo changes to your files.

Other features that would be nice:

  • Undo changes you made in the past, and then re-apply all the changes you have made since then.
  • See a list of all of the changes you have made over time
  • See how your document now is different to the same document at a specific point in the past

What about if you’re about to make some big changes to your file, and you want to save a copy so you can come back later if you stuff it all up. So you save it as my_file_v1.txt and then start working on my_file_new.txt. You get to the point where you’re happy with it, and save a new copy as my_file_final.txt, and send it to your boss. Your boss sends back my_file_final_boss_edits.txt and you incorporate those changes into my_file_final_v1.1.txt. The same thing goes back and forth for a while - basically this:

You’re probably thinking about collaborative editing tools at this point - things like Google Docs, Office 365 or even custom-built collaboration suites like Atlassian Confluence. And those tools undoubtedly solve some of these problems by keeping track of changes over time, and discouraging people from making copies of documents. However they still don’t have a way of saving a “version” of the document before major changes, which means you still see documents being saved on Google Sheets as “V1”, “V2” etc. These tools also typically don’t work very well for programming languages. They’re certainly an improvement over emailing, but wouldn’t it be good if there was a better way?

Feature Request #2:

Safely make major changes to your files without having to save them with funny file names.

Other features that would be nice:

  • Make many big changes at the same time, but not all in the same version of the file
  • Let your team make changes to your files, but without all the funny file names
  • Easy way to see the changes your boss made to your file (must be easier than Microsoft’s Track Changes…)

One of the nice features from online collaborative editors like Google Docs, Office 365 or Confluence is that two people can work on the same document at the same time. This helps teams work more efficiently because they’re not blocked while waiting for their teammates to finish working on a file. It would be great if this “Git” thing also allowed people to work on the same document at the same time.

Feature Request #3:

Multiple editors working on the same document at the same time.

Other features that would be nice:

  • Manual override for merging when two people have edited the same section of the document

Another thing you’ve probably experienced with collaboration through documents is that sometimes people make changes to things and you may:

  1. not realise that the change has been made, and
  2. not realise that the change was incorrect

It would be super handy if (once you start collaborating) all changes to the document go through some sort of quality review process.

Feature Request #4:

Flexible ways to review quality before applying changes.

Other features that would be nice:

  • As the document owner, I want to have the final say on all changes
  • During the early stages of document creation I don’t want to worry about the process

With a big team and competing demands for attention, it’s easy to come back to a project after a while and forget where everyone was up to. Which version of the document was the “master” document? Which version has the latest changes? What were you even working on?

Feature Request #5:

Standardised naming conventions to identify versions of documents.

Other features that would be nice:

  • Globally recognised naming conventions.

I think we’ve collected enough features to build the case that we need something to make it easier to collaborate, especially when working with code. If we can find a tool that helps us do all of these things, it would be pretty handy to know about!

3.1 Introducing Git

Git is a tool which gives you all of the features mentioned above. Most of these features are features of Git itself, but when combined with a remote repository like GitHub, Bitbucket or GitLab, it gives you all of the features above and more. We’ll start by looking at Git for now, then add in the remote repositories as we go.

On it’s own website, Git describes itself as:

Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency. Git is easy to learn and has a tiny footprint with lightning fast performance.

The main way people use the tool is via a Command Line Interface (CLI). This can be a bit of a pain if you don’t use a CLI regularly, but once you get the hang of it, it lets you work faster and allows you to automate things.

If you don’t use Git often and aren’t comfortable with using a CLI, you might like to look at GitHub Desktop or Atlassian SourceTree which both let you use a Graphical User Interface to work with Git on your own computer. Many popular Integrated Development Environments (IDEs) including RStudio, PyCharm, Atom also include basic Git GUIs.

Why not just use a GUI all the time?

Even if you plan to use the GUIs all the time, there are some really good reasons to learn the Git CLI when you’re getting started:

  • It is the same on every system, so you can work on any computer.
  • It has the widest range of features - the GUIs only have a subset.
  • It can be used when you’re connected to a remote server using SSH.
  • Sometimes you won’t have a choice.
  • It helps you understand what all of the other tools are doing.

3.2 First Steps with Git

You will definitely want to install Git on your own computer - you will find downloads for Mac, Windows and Linux here. This will install the Git Command Line Interface (CLI) which is what most people are referring to when they say Git.

An alternative (and probably easier) way to install Git is to install Sourcetree. Sourcetree sets up Git for you during the installation process, and it also gives you access to a popular GUI to give you an alternative to the CLI when things get confusing. We’ll be using Sourcetree later in the course, so you might as well install it now and let it set up Git for you at the same time.

If you decide to use Sourcetree you will need to create an account on Bitbucket. You should sign up with your UTS student email so that you automatically get given an Academic License.

You may be prompted to also install Mercurial - this is another (much less popular) tool and you do not need to install it.

When prompted whether you want to register using Bitbucket Server or Bitbucket, choose Bitbucket.

Once installed, you should open a terminal so we can run the configuration commands.

In this case, I will use Terminal to mean Terminal.app in MacOS, and the git bash executable in Windows. In both cases this will open a window with a CLI where you can run git commands.

This terminal window is actually running a piece of software called bash which is the most popular Unix shell; a Unix shell is a command-line interpreter or shell that provides a command line user interface for Unix-like operating systems. On MacOS you’re running a version of bash that was shipped as part of the Mac operating system, but on Windows you’re running a stripped-down version of bash that is installed just to let you run git. For the purpose of everything covered in this guide, the commands for Windows and Mac should be identical.

We’ll cover bash in much more detail in a later section of the course.

Run each of the two commands below in your terminal (one at a time) to finish setting up Git. Replace John Doe with your own name (make sure to include the quotes), and replace the email address with your own email address. These identifiers are used to record who made each change to each file, so for the purpose of this course you will want to set user.name to the name other people call you at Uni, and user.email should be your UTS student email.

$ git config --global user.name "John Doe"
$ git config --global user.email John.Doe@student.uts.edu.au

Alternatively, Sourcetree will do this for you when you first open it.

A note on conventions

In code examples, I will use $ to mark the start of terminal inputs. The printed outputs (if any) will be shown below, and will not start with $. For example:

$ echo Hello World
Hello World
If you are following along with the examples, you should type the commands which begin with $ and you should expect the output to look similar to the lines without $.

You should also make sure that you have an easy-to-use text editor - if you don’t have one already then I suggest downloading Atom which is an open source text editor developed by the Github team.

3.2.1 Creating a Repository

Let’s start with a practical Git example using a single file - you’ve started working on an important document and you’ve decided you’d like to start tracking it using Git. Using any text editor (Atom, if you like), create a new file called ‘my_poem.txt’ and save it in a new folder called ‘poetry’.

Roses are red
Violets are blue
Try to love your data
It works hard for you

Your new directory and file should look something like this:

poetry
└── my_poem.txt

You will now need to navigate to this folder in bash. You will need to use the cd command to change directory until you have moved into the poetry directory.

You can use cd in a few ways to move around your files.

  • cd .. moves you “up” one directory
  • cd <folder> moves you “down” into the folder specified
  • cd takes you back to your “home directory”
  • cd /path/to/your/directory/ takes you straight to that location
  • ls shows you the contents of the directory you’re currently in (and ls-lh gives you heaps more information about the files if you need it)
  • You might also find it useful to use pwd (print working directory), which will print your current location in the file system.

For Windows users, note that you’ll want to use forward slashes (/) instead of back slashes (\) when typing directory paths, and when working in git bash you’ll need to refer to the C drive as /c/ instead of c:/.

For more information about how to navigate using bash, check out this guide from Digital Ocean.

Now that we’re in the right directory, we’re going to initialise a Git repository in this folder. Git repositories are just normal folders but with some special hidden files inside, so the process of initialising a repository is just telling Git to add those hidden files for you. To initialise a repository, just type git init. In my terminal, this is what happens:

$ git init
Initialized empty Git repository in /Users/perrystephenson/code/dsp_tmp/poetry/.git/

This operation has created a whole list of hidden files inside my folder to keep track of all of my changes. You don’t need to understand what any of these files do, but to demonstrate that it’s not all magic, this is what is created:

poetry
├── .git
│   ├── HEAD
│   ├── config
│   ├── description
│   ├── hooks
│   │   ├── applypatch-msg.sample
│   │   ├── commit-msg.sample
│   │   ├── fsmonitor-watchman.sample
│   │   ├── post-update.sample
│   │   ├── pre-applypatch.sample
│   │   ├── pre-commit.sample
│   │   ├── pre-push.sample
│   │   ├── pre-rebase.sample
│   │   ├── pre-receive.sample
│   │   ├── prepare-commit-msg.sample
│   │   └── update.sample
│   ├── info
│   │   └── exclude
│   ├── objects
│   │   ├── info
│   │   └── pack
│   └── refs
│       ├── heads
│       └── tags
└── my_poem.txt
You can take a look at the files in your own folder using a combination of ls -la (list all files with details) and cd to navigate down into the .git folder.

Again, you don’t need to know about any of these files, but it is handy to know that everything is stored inside the folder - if you need to move the folder around then everything should still work.

If you have followed along and the git init command worked, then congratulations! You have correctly installed Git and learned your first Git command.

3.2.2 Adding Files

Let’s take a quick look at the git log. This is where we can see all of the changes that have been made to your repository.

$ git log
fatal: your current branch 'master' does not have any commits yet

Oh no! Git can be a little dramatic, but fatal just means it’s encountered an error and cannot proceed with the instruction. But it does tell me exactly what I need to do: I need to commit some changes to the “master” branch. We can ignore the master bit for now, and focus on commiting some changes.

As far as Git is concerned, my repository is empty, and by adding my poem I will be making a commit, which is basically another word for “change”. In order to make this commit, I need to do two things:

  • stage the changes (i.e. tell Git about what we have changed)
  • commit the staged changes, and provide a commit message

We’ll start with staging changes. We can check what changes are currently staged by using the git status command.

$ git status 
On branch master

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)

    my_poem.txt

nothing added to commit but untracked files present (use "git add" to track)

Git is telling me that we’re on the master branch (which we’ll ignore for now), we have made no commits, the my_poem.txt file is currently “untracked”, and we can use the git add command to track it. Let’s do that!

We’ll use the command git add my_poem.txt to tell Git that we’ve made some changes to a file, and we’d like Git to start tracking them.

$ git add my_poem.txt

No response! If Git doesn’t throw any errors, that usually means it worked. We can check that it worked by using git status again:

$ git status
On branch master

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)

    new file:   my_poem.txt

Now Git is telling us that we have some “changes to be committed”, and listing my_poem.txt as a “new file”. This is exactly what we expected, so let’s make a commit! To commit the changes, we’ll use the git commit command:

git commit -m "Initial Commit"

The -m "Initial Commit" is the “commit message”, and is basically a description of the changes we made. We could have equally said “Adding a poem” or “My first poem” or whatever we like - as long as it describes the changes to you and your team. The -m instruction is simply telling Git that we’re going to provide the commit message straight away (there are other ways to provide commit messages which we’ll learn about later). Let’s see what happens when I make the commit:

$ git commit -m "Initial Commit"
[master (root-commit) 936fa7d] Initial Commit
 1 file changed, 4 insertions(+)
 create mode 100644 my_poem.txt

This is telling us that we’ve made a commit to the master branch, the commit message is Initial Commit, we’ve added 4 lines (insertions) in 1 file, and a bunch of other information that isn’t important right now. We can take a look at git status again to see what has changed.

$ git status
On branch master
nothing to commit, working tree clean

The “working tree” is basically just the folder we’re working in, and Git is telling us that there are no uncommitted changes. Let’s take a look at the git log again to see what changed there:

$ git log
commit 936fa7d267aca5a125546f312105704d63db9ad3 (HEAD -> master)
Author: Perry Stephenson <perrin.stephenson@uts.edu.au>
Date:   Tue Feb 19 22:03:13 2019 +1100

    Initial Commit

(You will probably need to press q to exit the log viewer)

No error this time - instead we’re being presented with a bunch of information about the first commit. You’ll see the details of the person who made the commit (me), the timestamp of when the commit was made, and the commit comment. But you’ll also see a “commit hash”, which is a string of 40 hexidecimal characters that represents a “cryptographic signature” for the commit. This is useful for a number of reasons:

  • It is a unique reference for the changes we just made - it is impossible for any other commit to have the same string of characters (there are 2^160 different possible combinations!)
  • It provides a guarantee that (as long as the latest commit hash hasn’t changed) your repository hasn’t had any changes made without your knowledge (Linus Torvalds, the creator of Git, is clearly a bit paranoid!)
  • It gives you a standardised way to refer to commits

You can learn more about git commit hashes, and hashes in general, at the following sites:

You might notice that the first 7 characters of the commit hash above (936fa7d) were displayed when I made the initial commit. Even though Git needs all 40 characters to guarantee a unique reference, it is almost impossible for two commits in the same repository to start with the same seven characters. You’ll therefore often see commits referenced using these 7-character commit hashes, and you’ll see them everywhere.

You might ask “why not use the commit message to refer to commits?”, which is a reasonable question! If you take a look at all of the commits for these course notes you’ll see that commit messages aren’t always informative or straight forward, and having a shorthand reference for every commit is super handy.

How often should you commit?

There is no one-size-fits-all answer to this question. It depends on what you’re working on, how often you need feedback from your peers, how many changes you’re working on, how many files you’re working on, etc.

One approach is to commit every time you complete one idea. For example, if you took some messy code and cleaned it up using functions, that whole task could be a single commit. But then if you need to change a configuration value in a YAML file, that could also be a single commit.

It really depends on the granularity you want in your commit history, and how likely you are to need to go back into your commit history. It also depends on how much you trust your backup process - if you don’t run regular backups on your computer then it pays to make commits regularly, and push them to the remote repository many times throughout the day (we’ll cover this below).

It’s also worth keeping in mind that you can squash multiple commits into a single commit later on (advanced feature that will not be covered in this course), but you can’t add more commits at a later stage, so erring on the side of lots of commits today is preferable.

Also, as you work more with source control tools, you will start to develop your own intuitive feel. Until that happens, it is better to over commit than under commit. This will lead you to commit more often than a seasoned developer would, but you may save yourself grief at some point.

3.2.3 Tracking Changes

This is all well and good, but so far it seems like a whole lot of work for no real benefit. So let’s take a look at Git’s first key feature: tracking changes.

Making the changes is the easy part. Because I’ve committed the changes I have already made, I can make changes to my file without fear of stuffing anything up. Open up my_poem.txt, make a few changes, and save as per usual.

Roses are many colours
I don't even know if I've ever seen a Violet
Try to love your data
It works hard for you

The first cool feature we can use here is git diff, which will tell us the difference between the most recent commit, and what’s currently in our folder.

$ git diff
diff --git a/my_poem.txt b/my_poem.txt
index 779034b..2e11943 100644
--- a/my_poem.txt
+++ b/my_poem.txt
@@ -1,4 +1,4 @@
-Roses are red
-Violets are blue
+Roses are many colours
+I don't even know if I've ever seen a Violet
 Try to love your data
 It works hard for you

(You will probably need to press q to exit the diff viewer)

This isn’t the easiest thing in the world to read, but it’s using - to show you which lines have been removed (the first two lines), using + to show which lines have been added, and then it’s showing you a few lines after the changes to help provide context. If you find yourself wondering “what have I changed since my last commit?” then this is the tool for you. If you want to get fancy, it can actually tell you “what have I changed since commit 936fa7d?” which can be pretty useful in a large project.

It’s actually a bit easier to read than it looks, because most terminals will automatically colour-code the output. It will normally look something like this:

Much better!

The next cool feature is git reset, which resets specific files back to how they were at the most recent commit.

  • Using git reset without any arguments will simply un-stage any changes you have added (using git add)
  • Using git reset --hard will undo any changes you have made since the last commit. Be careful, this will delete any changes you have not committed!

To demonstrate this:

$ git reset --hard
HEAD is now at 936fa7d Initial Commit
$ git diff

Nothing! git reset --hard has deleted my changes.

I’ll make another two small change now so that we can make a second commit and see how that works. Firstly, a small change to the poem (capitalising the colour names):

Roses are Red
Violets are Blue
Try to love your data
It works hard for you

and I’ll also create a new file in the same directory: my_name.txt

Perry

Now when I use git status I’ll see both of these files listed:

$ git status
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

    modified:   my_poem.txt

Untracked files:
  (use "git add <file>..." to include in what will be committed)

    my_name.txt

no changes added to commit (use "git add" and/or "git commit -a")

Git knows that we have modified my_poem.txt, and created my_name.txt. We can stage both files at once using:

git add my_poem.txt my_name.txt

or we could even use the shortcut git add --all if we just want to stage everything that has changed since the last commit.

If we call git status again we’ll see both changes staged, ready for commit:

$ git status
On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

    new file:   my_name.txt
    modified:   my_poem.txt

and then we can commit the changes using git commit:

$ git commit -m "Small changes to poem, added my_name.txt"
[master 535df8b] Small changes to poem, added my_name.txt
 2 files changed, 3 insertions(+), 2 deletions(-)
 create mode 100644 my_name.txt

If we take a look at the git log now, we can see both commits:

$ git log
commit 535df8b3aa6ba5db9ee3c4cd33329c7f840dbdef (HEAD -> master)
Author: Perry Stephenson <perrin.stephenson@uts.edu.au>
Date:   Tue Feb 19 22:46:05 2019 +1100

    Small changes to poem, added my_name.txt

commit 936fa7d267aca5a125546f312105704d63db9ad3
Author: Perry Stephenson <perrin.stephenson@uts.edu.au>
Date:   Tue Feb 19 22:03:13 2019 +1100

    Initial Commit

Now we can see details about all commits in reverse chronological order.

By repeating this process, you’ll be able to track your changes across the whole repository, no matter how many files you add:

  • Make changes to files inside the repo, and save them when complete.
  • Stage the new files and edits using git add
  • Commit the changes using git commit
  • Check the status of the repo using git status
  • Un-stage changes using git reset (and delete changes using git reset --hard)
Once you get the hang of it, the process takes no time at all.

One last thing - deleting files. There are two ways to do this:

  1. Use git rm <file> to delete the file from disk and stage the deletion
  2. Delete the file however you like, then remember “oh no I should have used git rm” and then just use git add <file> to tell Git to “add the changes” you have made to the file (which in this case is “adding” a deletion)

The first option is easier, but I always forget to do it, so the second option is useful to know about. Either way, once you’ve staged and committed the changes, you’ll see another commit appear in the commit log, and your file will be gone.

Let’s go ahead and delete my_name.txt:

$ git rm my_name.txt
rm 'my_name.txt'

Too easy! Now we can just go ahead and commit the changes as per usual.

$ git commit -m "Deleting my_name.txt"
[master 83cd1c2] Deleting my_name.txt
 1 file changed, 1 deletion(-)
 delete mode 100644 my_name.txt

3.2.4 Recovering Changes

Tracking changes is all well and good, but what use is it if you can’t go back in time and see the old versions? You can do this from the command line too, but in reality you probably won’t. So I’ll show you how to do it quickly so that you know how to do it if you need to, but I’ll show you some easier ways to see old changes (using online tools) later on in this chapter

To find the commit you want to go back and look at, first use the git log function to find the commit hash you want to go back to.

You can make git log a little bit easier to read by adding the --oneline option.

$ git log --oneline
83cd1c2 (HEAD -> master) Deleting my_name.txt
535df8b Small changes to poem, added my_name.txt
936fa7d Initial Commit

(You will probably need to press q to exit the log viewer)

In this case, I’ve decided I want to go back to my second commit and see how my repository looked at that point in time. Note that I currently have one file in my repo:

poetry
└── my_poem.txt

And when I made the second commit, I had two files in my repo. This means that when I go back in time, git will actually add the extra file for me. This is a key difference to working with undo on individual files - commits apply to the whole repository, not specific files.

To go back to commit 535df8b, we need to use the git checkout function. Note that if you’re following along with these instructions on your own computer you will need to run git log --oneline yourself to find the commit hash you want to checkout.

$ git checkout 535df8b
Note: checking out '535df8b'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at 535df8b Small changes to poem, added my_name.txt

There is a lot of helpful information here, but you can mostly ignore it for now. Let’s instead take a look at the files in the repo:

poetry
├── my_name.txt
└── my_poem.txt

Git has restored my repository to the same state it was in when I made the second commit, and it’s even re-created a file to do that for me. Pretty cool!

If you look at the git log now, you’ll see that it looks a little different:

$ git log --oneline
535df8b (HEAD) Small changes to poem, added my_name.txt
936fa7d Initial Commit

It has removed the most recent commit, and moved that HEAD reference to the commit that I have checked out. We’ll cover this later, but for now I just want to show that the 3rd commit hasn’t been lost! We can ask Git to show it again by adding the --all argument to the git log call.

$ git log --oneline --all
83cd1c2 (master) Deleting my_name.txt
535df8b (HEAD) Small changes to poem, added my_name.txt
936fa7d Initial Commit

Now you can see all three commits that we’ve made, so even though we’ve gone back in time we’ve still kept track of all of the changes I’ve made in the third commit. To go back to the latest commit, we have to do something slightly different: we need to checkout master. We’ll explain what this means in the next section.

$ git checkout master
Previous HEAD position was 535df8b Small changes to poem, added my_name.txt
Switched to branch 'master'

If you have followed along so far and successfully run the commands above, congratulations!

Let’s quickly review the commands we have used so far:

  • git init - Initialise a new git repository within a folder
  • git add (and git rm) - Stage changes ready for commit
  • git status - Shows the status of your working directory, and lists files which have been added, modified or removed since the last commit
  • git commit - Commit changes to the repository
  • git log - View commit history
  • git diff - See what edits have been made since the last commit
  • git reset - Un-stage all changes
  • git checkout - Check out a commit or a branch

Let’s quickly review how many of the “desirable features” we’ve covered off:

  1. Easily undo changes to your files. (DONE)
    1. Undo changes you made in the past, and then re-apply all the changes you have made since then.
    2. See a list of all of the changes you have made over time (DONE)
    3. See how your document now is different to the same document at a specific point in the past (DONE)
  2. Safely make major changes to your files without having to save them with funny file names. (SORT OF DONE)
    1. Make many big changes at the same time, but not all in the same version of the file
    2. Let your team make changes to your files, but without all the funny file names
    3. Easy way to see the changes your boss made to your file.
  3. Multiple editors working on the same document at the same time.
    1. Multiple offline editors working at the same time.
    2. Manual override for merging when two people have edited the same section of the document.
  4. Flexible ways to review quality before applying changes.
    1. As the document owner, I want to have the final say on all changes
    2. During the early stages of document creation I don’t want to worry about the process.
  5. Standardised naming conventions to identify versions of documents.
    1. Globally recognised naming conventions.
We’re making good progress, but there is still plenty to cover!

3.2.5 Getting Help

As you continue through this chapter, you may run into situations where you want to learn more about a command that you’ve been shown. For all git commands, you can get help by running:

git <command> --help

For example, if you wanted to learn more about the git commit command, you would run

git commit --help

You should be able to scroll through the help documentation using the arrow keys, and you will probably need to press q to exit the help viewer.

You can also just use Google or StackOverflow.

3.3 Branching

This is one of the harder concepts to convey in a book, so if you find that you are getting lost at any point you might like to check some of these other resources which explain branching in different ways:

The diagrams in this chapter are taken from the Atlassian Git Branching tutorial, which is permitted as per the specified Creative Commons license: CC BY 2.5 AU

Let’s start from the very beginning.

When you make a commit in a repo, the contents of your repo get stored in Git’s internal database (the hidden files inside your folder). Git also stores details about the author of the commit, the timestamp, a commit message, and a cryptographic hash which we use to reference the commit. It also stores information about the parent of the commit, i.e. the commit that came before it.

In the diagram above, the circles represent commits, and the lines represent relationships between commits. In this case (and all of the diagrams used in this chapter), the circle on the left represents the first commit, and the most recent commits are on the right hand side. This means that the commit on the very left is the parent of the 2nd commit, which is the parent of the 3rd commit, which is the parent of the 4th and final commit. This diagram represents almost exactly what we did in the previous section, except that we only made three commits and the diagram shows four.

Whenever you start a new Git repo, you’ll automatically be working on what is known as the master branch. This is a Git convention where the master branch is the default branch. The diagram above shows the master branch pointing to the most recent commit, which brings us to the first key thing you need to know about branches:

Branches are just “pointers” to commits.

The diagram above shows this well - when you create a new repo and make a few commits, the master branch is just a pointer to your most recent commit.

Simple, right?

To get a bit of a feel for why this is important to understand, let’s look at what could happen if we had three branches.

There is a lot going on this diagram! You’ll see three branches here: Little Feature, Master and Big Feature. Each of these branches is a pointer to a commit, pointing to the most recent commit down each pathway.

Also note that each commit has exactly one parent commit, but some commits have two children - this is really the point of branching. Branching lets you create diverging history for your documents, giving you the ability to work on multiple versions at once. In the example above, there are three “versions” of the repository at any time, each with their own history:

  • The “Little Feature” version has two commits in history
  • The “master” version has four commits in history
  • The “Big Feature” version has six commits in history

This hints at the reasons why Data Scientists will use branching when collaborating on a repository. In the most general sense, you will encounter or create branches whenever:

  • You (or someone else) have made significant changes to a shared repository and want to avoid making changes that impact other people’s work
  • You want to make changes that you can easily throw away if they turn out to be bad changes
  • You want to make changes without impacting the “production” (master) branch

Okay, so now we have an idea about what branches are, and how they relate to commits, let’s work through an example of how to create a branch. The first step in branching is to create a new branch pointer, which points to the latest commit. This means that we’ll have two branch pointers pointing at the same commit - for example if we create a new branch called “Crazy Experiment” it will look like this:

Let’s do this in the poetry repo from the example above. To create the new branch, we’ll use the git checkout command (because we’re checking out a new branch). The full command will be git checkout -b crazy-experiment where -b means we’re creating a new branch, and crazy-experiment is the name of our new branch.

$ git checkout -b crazy-experiment
Switched to a new branch 'crazy-experiment'

If we use git log --oneline, we can see what’s just happened.

$ git log --oneline
83cd1c2 (HEAD -> crazy-experiment, master) Deleting my_name.txt
535df8b Small changes to poem, added my_name.txt
936fa7d Initial Commit

The most recent commit (where I deleted my_name.txt) has two branch pointers pointing to it: master and crazy-experiment. This matches the diagram above, which shows both pointers pointing to the most recent commit. There is also a reference to HEAD:

HEAD represents where you are right now. It is a pointer which points to either:

  • a branch pointer, or
  • a specific commit

If it’s pointing to a branch pointer, it means that your working directory corresponds to the latest commit on that branch. If it’s pointing to a specific commit, it means that you are not “on a branch” but your working directory corresponds to that specific commit.

Most of the time, HEAD will be pointing to another branch, and in conversation you would say that you are “on” the branch that HEAD is pointing to. In the case above, we could say we are on the crazy-experiment branch.

In the example we’re working through, HEAD -> crazy-experiment means that the working directory is on the most recent commit in the crazy-experiment branch. This is exactly what we wanted to happen: git checkout has created the new branch pointer for us, and has moved HEAD to point at the new branch.

This branch now allows us to make changes without impacting the master branch. Let’s make a few commits and see what happens. Firstly I’ll make a new file called “attribution.txt” and commit that change.

Thanks for all of the diagrams, Atlassian!
$ git add attribution.txt
$ git commit -m "Added an attribution note"
[crazy-experiment 558e604] Added an attribution note
 1 file changed, 1 insertion(+)
 create mode 100644 attribution.txt

Now I’ll make another commit, adding the creative commons license to the bottom of my poem.

Roses are Red
Violets are Blue
Try to love your data
It works hard for you

License: CC BY-SA 4.0
$ git add my_poem.txt
$ git commit -m "Adding creative commons license"
[crazy-experiment dd0a9f7] Adding creative commons license
 1 file changed, 2 insertions(+)

Very cool! I don’t have a diagram to show where we’re up to at the moment, so let’s take a look at the log.

$ git log --oneline
dd0a9f7 (HEAD -> crazy-experiment) Adding creative commons license
558e604 Added an attribution note
83cd1c2 (master) Deleting my_name.txt
535df8b Small changes to poem, added my_name.txt
936fa7d Initial Commit

This is exactly what we wanted to happen. The master branch is still back where we left it (the 3rd commit) and the two commits we just made are part of the crazy-experiment branch. At the moment it all looks a bit like a straight line, but what happens if we keep making changes to master at the same time?

Consider that we find a bug in master and need to fix it ASAP - we don’t have time to wait until the crazy-experiment is finished and ready for production. We can now just checkout the master branch, make changes, and commit them as per usual. Firstly, we’ll checkout the master branch, and list the files to show that the “attribution.txt” file isn’t there (because we created it on the other branch).

$ git checkout master
$ ls
my_poem.txt

Note that the checkout command here is almost identical to how we created the new branch, but this time we’re not including the -b argument because the branch already exists - we only use -b when we want to change to a new branch and create it at the same time.

We can also look at my_poem.txt and see that it doesn’t have the license we added either.

$ cat my_poem.txt
Roses are Red
Violets are Blue
Try to love your data
It works hard for you

We haven’t used the cat command before, so it deserves a quick explanation before we cover it in more detail later in the course. Using cat <filename> will print the contents of a file to the screen.

The name cat is derived from concatenation, as the tool can also be used to concatenate multiple files into a single output.

Everything is exactly as we expected - the two changes that were made on the crazy-experiment branch aren’t present on the master branch. Let’s go ahead and make an edit to the master branch, which we’ll pretend is a “critical bug fix” - we’ll add a “last updated” date to the top of the file because the legal department advised that we needed to do this to stay compliant with strict new poetry laws.

Last updated: 2019-02-21

Roses are Red
Violets are Blue
Try to love your data
It works hard for you

Now we’ll stage and commit the change, this time committing the changes to the master branch.

$ git add my_poem.txt
$ git commit -m "Critical bug fix"
[master 0d10486] Critical bug fix
 1 file changed, 2 insertions(+)

To make sure it’s 100% clear what’s happened here, I’m going to jump ahead a little and use Sourcetree to visualise what’s happening with these two branches. We’ll look at Sourcetree later in the chapter, but for now the git log view below makes it very clear that three commits we just made are on two different branches, and Sourcetree is using the bold text to let us know that we’re currently working on the master branch.

We can see that the commit history has a clear “branch”, and we’ve got two divergent versions of the same repository. This means that we can continue making changes on either branch without impacting the state of the other one.

Because branching is so easy in Git, you can have as many branches as you like. This is in fact the key difference between Git and other source control systems (e.g. SVN) - Git makes branching easy and you are encouraged to branch often.

You’ll typically want to use branches whenever you are making big changes, working with team members (everyone gets their own branch), or when you’re making changes to a production system.

There are a few popular branching strategies that you’re likely to come across in the workplace - these are essentially different sets of agreements that a team might have which dictate when and how to create a new branch. The Atlassian Git Docs outline the most popular strategies; if you start in a new team it will be helpful to identify which strategy the team is using so you can collaborate effectively.

The best part about these popular strategies is that they are more or less globally recognised - any Git user can immediately recognise that the master branch represents production, the dev branch is probably pretty close to production, and anything starting with feature is probably still being worked on. This shared knowledge makes it easy to collaborate through Git because everyone starts from a shared understanding of what the branch names mean.

For assignments in this course you may use whatever strategy you like, however you’ll likely find the Feature Branch Workflow is the easiest to use in small data science teams.

A final note on branches - you can list all of the branches in your repository using git branch -a.

$ git branch -a
  crazy-experiment
* master

The asterisk tells you which branch you are currently using.

3.4 Merging Changes

What’s the point of branching if we can’t merge the changes back together when we’re ready?

The general idea of a merge is that we’re going to make a new commit which has two parents. The diagram below shows what we’re currently have:

This is identical to our repository, but with the crazy-experiment branch shown as Some Feature.We want to create a new commit which has the changes from both the master branch and the crazy-experiment branch, and we want this new commit to be on the master branch. We are finished making changes to the crazy experiment, so we don’t really care what happens to that branch pointer. We essentially want this to happen:

Note that the new commit on the right has two parents and is on the master branch, and that we’re leaving the crazy-experiment branch pointer behind as we don’t need it anymore.

There are a few steps required to make this merge happen:

  • Make sure all of your changes have been committed (on both branches)
  • Checkout the master branch
  • Merge the branches using git merge

Let’s run through this on our demonstration repository, we’ll start by assuming we’ve already committed all of our changes. The next thing to do is checkout the master branch.

$ git checkout master
Switched to branch 'master'
If you’re following along with these examples then you are probably already on the master branch. In this case we’re just including the command because in most cases you’ll need to checkout master before merging.

Now we’re going to merge the crazy-experiment branch into the master branch. The direction of this merge isn’t super important, the thing to remember is that the branch you are on when you start the merge is the branch you will be on when you finish the merge. In most cases you want the new commit to be on the master branch, and therefore we want to make sure we’re on the master branch when we start the merge.

The next step is to do the merge itself. The command for this is unsurprisingly git merge, but we’ll also need to provide two pieces of information:

  1. The name of the branch we’re merging with
  2. The commit message for the new commit we’re making

So the full command is:

$ git merge crazy-experiment -m "Crazy experiment was a success, merging into master"
Auto-merging my_poem.txt
Merge made by the 'recursive' strategy.
 attribution.txt | 1 +
 my_poem.txt     | 2 ++
 2 files changed, 3 insertions(+)
 create mode 100644 attribution.txt

Sourcetree confirms that the new commit was created with two parents, and that the master branch pointer is now pointing to that new commit:

It’s also worth noting that Git worked out how to merge the changes! Let’s take a look at how it merged the two changes in the my_poem.txt file:

Last updated: 2019-02-21

Roses are Red
Violets are Blue
Try to love your data
It works hard for you

License: CC BY-SA 4.0

Git has a bunch of clever merging strategies, and it’s successfully worked out how to apply the changes from the two branches to the same file. Pretty cool!

What about when Git can’t auto-merge?

When Git can’t work out how to automatically merge changes, it will ask you how to resolve the issue. The process for dealing with this using the CLI is a bit complicated (and you’ll normally do it using a GUI) so I won’t covered it here.

If you want to learn more about manually resolving conflicts, see the Resolving Conflict section of the Atlassian Git Merge tutorial.

The final step in merging is to delete the branch that we were working on. Remember that branches are just pointers, and as the changes we made in the crazy experiment are now part of the master branch, we don’t have any need for the crazy-experiment branch pointer any more. In big projects with lots of collaborators the branch names can quickly get messy, so it’s best to delete branches once they have been merged. The command for this is:

$ git branch -d crazy-experiment
Deleted branch crazy-experiment (was dd0a9f7).

Let’s take another look at the features we wanted to cover at the start of this chapter:

  1. Easily undo changes to your files. (DONE)
    1. Undo changes you made in the past, and then re-apply all the changes you have made since then.
    2. See a list of all of the changes you have made over time (DONE)
    3. See how your document now is different to the same document at a specific point in the past (DONE)
  2. Safely make major changes to your files without having to save them with funny file names. (DONE)
    1. Make many big changes at the same time, but not all in the same version of the file (DONE)
    2. Let your team make changes to your files, but without all the funny file names
    3. Easy way to see the changes your boss made to your file.
  3. Multiple editors working on the same document at the same time.
    1. Multiple offline editors working at the same time.
    2. Manual override for merging when two people have edited the same section of the document.
  4. Flexible ways to review quality before applying changes.
    1. As the document owner, I want to have the final say on all changes
    2. During the early stages of document creation I don’t want to worry about the process.
  5. Standardised naming conventions to identify versions of documents. (DONE)
    1. Globally recognised naming conventions. (DONE)
We’ll look at how to collaborate with colleagues in the next chapter.

3.5 Graphical User Interfaces

Now that you’re a pro at using the Git CLI, you can take a look at some of the GUI tools available for free online. If you want a full-featured Git client that almost never requires you to use the command line, you’ll want to take a look at GitHub Desktop or Atlassian SourceTree. If you just want a GUI for the main commands (stage, commit, push, pull) then you can probably take a look at your existing Integrated Development Environment - tools like RStudio, Atom and PyCharm have very convenient git integrations which can make your life much easier when committing regularly.

We won’t cover these tools individually but you should pick one of them and give it a try - and pay attention to how much easier it is to learn to use the tool because you started from the CLI!

You don’t ever need to use a GUI - the remote repositories are getting much better and a lot of the best features are available online without having to install anything on your computer.

3.6 Other Git Resources

Resource Notes Cost (AUD)
Atlassian Docs Atlassian’s Git documentation is aimed at new Git users and is a very comprehensive resource. Free!
DataCamp DataCamp has an Introduction to Git for Data Science course. $35/month
Oh Shit, Git! Git documentation has this chicken and egg problem where you can’t search for how to get yourself out of a mess, unless you already know the name of the thing you need to know about in order to fix your problem. Oh Shit, Git! gives solutions to common Git problems in plain English. Free!
The Pro Git Book The book is a tough read but it’s the definitive to all things Git. Free!
Try Github Try Github started out strong, but they’ve removed some of their best content recently. They still have a few cool interactive learning tools here which might be useful. Free!