6 Structured Query Language (SQL)

SQL is the easiest language in data science, and it’s also one of the most powerful. In most cases you’ll be using SQL to query large production databases being managed by someone else, which means that SQL will often be the best choice when the dataset is very large or where speed is important.

The majority of this topic will be taught using DataCamp - see Learning SQL with DataCamp for more information.

6.1 What is SQL?

SQL isn’t really one language - it is a set of very similar languages which are implemented by database vendors. If you’re working with an Oracle database then you’re likely to be using Oracle SQL, if you’re working with Microsoft SQL Server then you’re using T-SQL, if you’re using Teradata then you’re using Teradata SQL, and so on. The ecosystem has been extended into the “big data” space in recent years, with HiveQL, Presto, SparkSQL and more providing SQL-like query interfaces.

This can be a double-edged sword. Whilst these similarities make it easy to get data out of just about any system due to the shared language, each system implements the advanced features differently. For example, SparkSQL provides a “pivot” clause which does not exist in Teradata SQL, whilst Presto implements a “cosine_similarity” function which isn’t available in SparkSQL. So whilst the basics are the same across all SQL systems, users of advanced functionality can get frustrated when moving between systems which use different SQL dialects.

Taking a step back, SQL is what’s known as a declarative programming language, because the syntax lets you declare what you want and not worry about how the system goes about doing it for you. This allows you to write fairly efficient code, and the database management system generates the execution plan for you behind the scenes. Importantly, this is different to both R and Python which are imperative languages where each command is executed sequentially, and each command changes the application’s state. SQL on the other hand runs all at once, and the state of the system is much less important.

The nature of SQL being a declarative language means that many vendors are able to provide very effective optimisation for most queries, that ensure you get your data and/or results quickly no matter how efficiently you wrote your code. In many cases when working with large datasets (and definitely when working with “big data”) you will find that SQL is many orders of magnitude faster than using R or Python.

SQL is also sometimes referred to as a set language, as SQL is predominantly used to describe a set of records you want to extract. Most of the language syntax is used to describe which records you want, and how you want to transform them. Understanding this is key to understanding how to write and debug SQL.

6.2 How SQL Works

One quirk of SQL is that you will almost never run it on your own computer. The database will typically be managed by someone else (a Database Administrator) and you’ll just be given access to run SQL queries against the database. In small companies the data scientist might be tasked with setting up such a database, but this is becoming easier to manage with the advent of cloud computing providers. Amazon Web Services (AWS) provides a managed database as a service (known as Redshift), Google and Microsoft Azure provide similar offerings. Accordingly, we’re not going to cover how to set up a database, and we’ll stick to learning about how to use SQL to query a database that someone else is managing for you.

You can run SQL from a few different places:

A dedicated SQL client on your desktop (DataGrip, DbVis, etc) which connects to a remote database (see the note on ODBC and JDBC below)
A vendor’s custom interface in your web browser (AWS Athena, Google BigQuery) which is managed by the same vendor that provides the database
A browser-based analytics tool (Redash, Databricks, etc) which has been set up by someone else in your business, and connects to the corporate database

In each of these cases, you will typically write the SQL code into an editor then run the code and see the results straight away. Some of these tools have rudimentary visualisation capabilities, but if you want to get the data into R or Python for visualisation then most tools will also give you the ability to download the results of your SQL query as a flat file.

In addition to allowing access via these simple query editors, databases commonly offer one or both of the following connectivity interfaces:

Open Database Connectivity (ODBC)
Java Database Connectivity (JDBC)

Many popular analytics tools like Tableau, Alteryx, RStudio, etc will use ODBC or JDBC connections to connect to databases - in these cases you will write SQL and run queries from these tools and the results will be delivered back to your tool. The two protocols are fairly similar in terms of features and performance, and the choice of which one to use will often come down to finding something that works with both your database and your analytics tool. They’re both a bit of a pain to install on your own computer, so it can also come down to choosing whichever one you can get working. There is no single guide for how to troubleshoot ODBC and JDBC connections, in most cases Google is your friend.

We will cover some of the popular ODBC and JDBC connection packages for R and Python later in this module.

6.3 Setting up a SQL Environment

Optional

This course will be using DataCamp to teach SQL, which gives you access to an excellent interactive SQL environment. If you would like to set up your own environment on your own computer so that you can practice, the instructions in this section will help you.

In most real life situations you’ll be working with an existing database, but that won’t work for teaching in this class. To learn and practice SQL you’ll need to install some software on your computer, and to make things easy we’ll be using dbVis. Head over to the download page and grab the installer for your operating system, making sure to grab the “With Java VM” version. Click through the prompts to install dbVis.

We’ll also need a dataset to work with, so we’ll grab the Chinook database from http://www.sqlitetutorial.net/sqlite-sample-database/. Click “Download SQLite sample database” (about 1/3 down the page) to download the database. You may need to unzip the file after downloading.

To test that everything works, open dbVisualizer, then when you see the New Connection Wizard:

Give the new connection a name (I used DSP, named after this course)
Select “SQLite” as the database driver
Click on the field for Database file name and navigate to the chinook.db database file that you have downloaded
Click finish

When you get to the main dbVis application, click SQL Commander then New SQL Commander to open a new editor tab. Finally, to test that everything is working, type the following command:

select * from albums;

then click the play button to run the query. In the window down the bottom, you should a table with 347 records. If this is all working, then you’re ready to go ahead and practice.

6.4 Learning SQL with DataCamp

As a Data Scientist working in a business you would be expected to have a working knowledge of SQL, which includes:

Querying data from tables
Aggregation functions
Joining tables

In order to develop these skills, students will be given two assignments on DataCamp, which are compulsory but will not be assessed.

Subject	Topics
Intro to SQL for Data Science	Selecting, filtering, aggregating, sorting, grouping.
Joining Data	Joins, unions, subqueries

In addition, the Intermediate SQL course is optional but encouraged, and covers features including case when, subqueries, nested queries, common table expressions and window functions. If you expect to be working regularly with SQL (which includes anyone working with big data) then this course is essential knowledge.

One final note: being a declarative language means that SQL has an order of operations; unlike R and Python which both run from top to bottom you will need to keep the order of operations in mind when writing SQL. This blog post has a good primer on how the order of operations works in SQL.

6.5 Popular SQL Systems

As a practicing data scientist you’ll likely come across three key types of SQL systems:

Relational Database Management Systems (e.g. Teradata, Oracle, SQL Server)
Hadoop-based systems (e.g. Hive, Impala)
Cloud-based systems (e.g. Athena, BigQuery)

Making things a little more complicated is that the most popular “big data” language these days - Spark - runs on top of both Hadoop-based and Cloud-based systems. Spark is not actually a SQL language (it is an execution engine with APIs available in Scala, R and Python) however it does allow you to write queries using HiveQL, which is the SQL-like language used to query Hive.

In all cases the basics of SQL remain the same, but there are some differences between SQL dialects from different providers. In general however, developing generic “SQL skills” will make it very easy to switch between dialects. If you want to pick a single dialect to learn as a data scientist, obviously your first choice should be whatever SQL dialect is used in your workplace, but if you’re learning on your own and want to build SQL skills beyond what you have learned from DataCamp, then you should probably focus on learning HiveQL and practicing with a Spark system.

You can get access to such an environment by applying for access to the Databricks Community Edition which gives you access to a small “cluster” with Spark installed and pre-configured, and a fully-featured data science notebook that supports SQL, R and Python.

6.6 Improving Readability

When written well, SQL can be a very easy language to read. When written carelessly it might as well be hieroglyphs - it can be very difficult to interpret what the author wanted. In this section we’re going to focus on good habits that you should incorporate into your own SQL queries.

One theme that I’m going to keep coming back to throughout this course is that code is for humans. If you just want to write something that runs on the computer, you could write assembly or machine code and throw it away once it was done. The reason we use “high level” languages like SQL, R and Python is that we want our code to be human readable. For any code that you’ll use more than once, you’ll need a colleague to review your work, you’ll need someone to maintain it (and repair it if it breaks), and you’ll need someone to rewrite it from time to time as systems and data schemas change. Not only are you making life easier for your colleagues if you think about readability from the beginning, but you’re also reducing risk by making it easier to see what you want to happen, and therefore easier to spot errors.

6.6.1 Code Style

Above all, you should adhere to the style guide that is used in your workplace. In the absence of a corporate style guide it helps to have a sensible personal style guide to fall back on, and for SQL I think the Kickstarter SQL Style Guide (blog and direct link) is a pretty sensible baseline.

In general, there are some pretty easy things you can do to improve readability and we’ll cover some of them in this section.

6.6.2 Indenting

If you’re just writing a short query you’ll probably write it like this:

select * from customer.details

As soon as you move beyond 4 or 5 words, you should be looking to lay out your query more like this:

select
  name,
  address,
  phone_number
from
  customer.details

This is important because as your query gets more complex, with filters, aggregations, joins etc, it quickly becomes unreadable. Take this made up query for example:

select c.name as customer_name, c.address as customer_address, sum(s.sale) as total_sales
from customers as c left join sales as s on c.id = s.id
where c.name in ('John', 'Jane')
group by 1,2

Pretty tough to read, right?

select
  c.name          as customer_name,
  c.address       as customer_address,
  sum(s.sale)     as total_sales
from
  customers       as c
  left join sales as s
    on c.id = s.id
where 
  c.name in ('John', 'Jane')
group by 1,2

With lots of new lines and consistent indentation, it’s much easier to see what is going on in this query.

Some people will place commas on the line below and at the start of the next item, rather than at the end of the line. For example, the select clause of the above query could be written as:

select
  c.name          as customer_name
 ,c.address       as customer_address
 ,sum(s.sale)     as total_sales

This is a matter of preference, however the general rule of thumb is the same: fall in line with the prevailing style used by everyone else in your organisation.

6.6.3 Comments

Comments in SQL start with --. You can also use comment blocks in most SQL systems using /* at the start and */ at the end of the block.

-- this is a comment
select * from magic_table -- you can also comment at the end of a line

/* 
In some SQL systems
you can write comments
over multiple lines
*/

Contrary to what you might have heard elsewhere, you should try to minimise the requirement for comments. There are a couple of reasons you want to reduce comments:

Because they’re not part of the code, it’s easy for comments to get forgotten and they quickly become out of date. Whilst you always start out with the best intentions, it’s hard to keep comments up to date when you’re iterating through different ideas.
Comments are often a symptom of poorly written code. If it isn’t obvious what your code does, then writing a comment is less useful than rewriting your code to make it more clear what is happening.

That is not to say that comments are all bad - the key point is that they should be used to explain why you are doing things, rather than explaining what you are doing. If you find yourself writing a comment to explain what you are doing, consider rewriting the code to make it clearer.

6.6.4 Intermediate Tables and Temporary Views

Perhaps one of the biggest SQL-sins is overuse of nested queries. Nested queries can be useful because they help you explicitly control the order of operations, however they very quickly become unwieldy and impossible to read.

If you haven’t seen nested queries before, they look a bit like this:

select 
  name,
  address,
  sum(sales) as sales
from (
  select *
  from table_1      as t1
  left join table_2 as t2
    on t1.key = t2.key
  where date = '2018-01-01'
)
group by 1,2

This one isn’t too hard to read because it’s short, but it’s not uncommon to encounter queries which are over 100 lines long and made up of many nested subqueries, sometimes up to 2 or 3 queries deep. Apart from anything else it forces the reader to read the code “inside out”, where they need to read the middle section of the code before they can understand what is happening at the top. Thankfully there is a better way!

Depending on capabilities and restrictions placed on your system, you might be able to create one of the following to help clean up your query:

tables - using create table <table name> as ...
temporary views - using create temporary view <view name> as ...
volatile tables - using something like create volatile table <table name> as ...

There are some subtle differences here but the general idea is that you can write a query and “save” the results as a new table with a clear name.

Using the above example, you could rewrite the query more clearly like this:

create table customer_sales as
  select *
  from table_1      as t1
  left join table_2 as t2
    on t1.key = t2.key
  where date = '2018-01-01';

select 
  name,
  address,
  sum(sales) as sales
from customer_sales
group by 1,2

The exact syntax will differ depending on the SQL dialect that you’re using, and also depending on the restrictions put in place by your database administrators.

A table generally represents data stored on disk. When you create a table you are causing data to be written to disk, and when you read a table you are reading data from disk.

A view generally represents a query stored on disk, except in the case of temporary views which represent queries stored in temporary memory. When you create a view you are saving a query, and when you read from a view you are running a saved query to generate a table “on the fly”. You can think of views as virtual tables and you can generally treat them like tables most of the time.

You may also encounter stored procedures which are another way of storing queries in SQL, however they are losing relevance these days and are not used in most of the popular “big data” database systems. We will not cover stored procedures in this course.

6.6.5 Common Table Expressions

Common Table Expressions (CTEs) are part of most SQL dialects, although the syntax might vary slightly. They’re very similar to intermediate tables and temporary views, with the subtle difference that they’re never actually saved in the system because they’re part of a single query.

Using the same example as above, a CTE solution would look something like this:

with customer_sales as (
  select *
  from table_1      as t1
  left join table_2 as t2
    on t1.key = t2.key
  where date = '2018-01-01'
)

select 
  name,
  address,
  sum(sales) as sales
from customer_sales
group by 1,2

Note that there are no semi-colons in the middle of the query - the whole query runs as one without saving any of the intermediate tables. CTEs can often be more efficient than temporary tables because the SQL optimiser can optimise the whole query rather than doing a section at a time, but this may vary depending on the system you’re using.

The general syntax for CTEs in most SQL systems is:

with table1 as (
  select ...
  from ...
    left join ...
      on ...
),

table2 as (
  select ...
  from table1
  left join ...
    on ...
),

table3 as (
  select ...
  from ...
  group by ...
)

select * 
from table2
  left join table3
    on ...

You can use multiple named tables (separated with commas) and you can refer to named tables you’ve created already in the same query. The CTE finishes with a standard select statement that can refer to any of the intermediate tables as part of the from clause (including joins).