12 Application Programming Interfaces (APIs)

The term API is heavily overloaded, and could apply to many different parts of your data science toolkit. For example, refer to the Wikipedia definition:

An application programming interface (API) is a set of subroutine definitions, communication protocols, and tools for building software. In general terms, it is a set of clearly defined methods of communication among various components. A good API makes it easier to develop a computer program by providing all the building blocks, which are then put together by the programmer.

Wikipedia: Application Programming Interface

By this definition, the R and Python languages are APIs, every R or Python package is also an API, Docker is an API, Bash is an API, and so on. So what do we normally mean when a data scientist is talking about an API? We’re almost always talking about RESTful APIs.

Representational State Transfer (REST)

This term carries a lot of history, however in practice it means that you are implementing an API that uses the Hypertext Transfer Protocol (HTTP) to send information to, or retrieve information from, another service.

When working as a data scientist this means that you’ll be using HTTP or HTTPS connections to communicate between applications.

So why should a data scientist care about a web protocol that is nearly 20 years old? There are two main things that you’ll do with APIs in your career as a data scientist:

  1. Use them to access data from remote services (e.g. you might access tweets using the Twitter API)
  2. Build them to provide data science capabilities (typically machine learning) within your organisation

We’ll consider each of these use-cases individually, and cover them at a basic level. Just like with containerisation you could spend years learning all about APIs, and this chapter will serve as a basic introduction to the concepts, rather than an exhaustive guide.

12.1 Accessing APIs

We’ll demonstrate this use-case using the httr package in R. If you prefer to use Python, the equivalent package for working with RESTful APIs is the Requests package.

The most basic request you can make is a GET request. This is essentially the same request that is made by your web browser every time you visit a webpage. We can dempnstrate this by showing what happens when we make a GET request to this textbook using the address https://datasciencepractice.study.

## Response [https://datasciencepractice.study/]
##   Date: 2019-08-21 10:36
##   Status: 200
##   Content-Type: text/html; charset=UTF-8
##   Size: 35.6 kB
## <!DOCTYPE html>
## <html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
## <head>
##   <meta charset="utf-8" />
##   <meta http-equiv="X-UA-Compatible" content="IE=edge" />
##   <title>Data Science Practice</title>
##   <meta name="description" content="Course notes for 94692 Data Science ...
##   <meta name="generator" content="bookdown 0.12 and GitBook 2.6.7" />
## ...

So when we make a GET Request to a webserver, it returns the HTML code for the page - if we were trying to build a browser this would be useful, but we’re trying to do data science! How do we do something more useful? Let’s grab some data! We’ll use the REF Impact Case Study database, which contains data about government funded research programs in the UK.

## # A tibble: 257 x 3
##    CaseStudyId ImpactType   Title                                          
##    <chr>       <chr>        <chr>                                          
##  1 2372        Technologic… "\r\n    Microelectrode Biosensors to Monitor …
##  2 2373        Technologic… "\r\n    Improvement of Seed Vigour and Perfor…
##  3 2374        Environment… "\r\n    Improving Farming Strategies by Model…
##  4 2375        Environment… "\n    Declaration of the world's largest mari…
##  5 2376        Environment… "\r\n    A Novel Way to Detect Infection Statu…
##  6 3140        Environment… "\n    Genetic data optimises conservation of …
##  7 3141        Environment… "\n    Delivering UK policy for river conserva…
##  8 3143        Technologic… "\r\n    New data analysis methods drive trans…
##  9 3144        Technologic… "\r\n    Cardiff research supports the commerc…
## 10 3145        Health       "\n    Improved Diagnostic Technology with the…
## # … with 247 more rows

This raises an important consideration when using RESTful APIs from R or Python: you’ll often need to do a little bit of work processing the data that you receive. In most cases you’ll receive data in XML (eXtensible Markup Language) or JSON (JavaScript Object Notation) format, and you’ll need to use helper functions to reshape it into something you can work with. In this case I’ve used the jsonlite package to convert the data from a JSON into a list-of-lists, which can then be treated like a data frame in R. There is no one-size-fits-all approach to processing data that you receive from APIs - you’ll generally just need to take a look at the data that you receive and try different functions and parsing strategies until you find a function that will reliably transform the data as required.

There is plenty more to learn about using APIs - to learn more about using APIs from R you can read the httr vignette; for Python you can read the Requests website.

12.2 Building APIs

As a data scientist, building fast and reliable RESTful APIs is probably one of the harder programming tasks you’ll ever have to work on. For most of your career you can deal with slow computation by either waiting a bit longer, or running your code on a faster computer. When it comes to APIs however you’re bound by someone else’s performance standards - it’s not uncommon to be expected to handle more than one request per second, and in many cases your service will need to respond within the order of 100ms in order to not hold up other processes. Writing high-performance code is an advanced skill and is not something that is expected of you in this course, and it is a skill that you’ll likely pick up over time - learning bits and pieces as you need them.

In addition to the performance requirements, building an API requires a change of mindset from “my code runs in order from top to bottom” to “my code runs in all sorts of different ways depending on who calls my API and what they do with it”. It also requires a service to run continuously, listening for requests and dispatching any requests to the various methods inside the service. Until recently, hosting such an API has presented quite a significant challenge for users of R and Python, however recent developments in both languages have made it easier than ever to build and deploy your own APIs.

We’ll use R again for these examples, leveraging the plumber package to run our API service; if you want to use Python to build an API then you may like to research the flask or CherryPy packages.

In both R and Python you’ll want to deploy your API within a Docker container; this is beyond the scope of the course, and for the examples below we will launch the API service interactively to demonstrate some of the concepts and terms.

12.2.1 Universal Resource Identifiers (URIs)

The first thing you need for your API is an address - these are referred to as either Universal Resource Locators (URLs) or Universal Resource Identifiers (URIs). When you’re developing locally this is really easy - you’ll be hosting your application locally, and your computer provides a shortcut to this local service: localhost. We actually used this in the Docker RStudio example in the previous chapter - by using the browser to navigate to http://localhost:8787 we were telling the browser to connect to the same computer (using port 8787 but we’ll ignore ports for now). The key here is to remember that localhost is the URI for your own computer. You may also see references to - this is an IP address which is effectively the same as localhost.

If you want to run a REST API on the web, then you’re going to need a URI from somewhere else. If you’re hosting on someone else’s server then they’ll probably create a URI for you - for example if you host an API on RStudio’s Shinyapps service, they’ll give you a URI. If you’re running an API inside your organisation, they’ll also have some way of assigning you a URI. It’s not super important for you to know your URI when creating your service, but it’s a critical piece of information for anyone who wants to use your service - without an address they won’t know how to connect with your API.

12.2.2 Endpoints

When thinking about the service you are building, you will want to think about what you want to expose to your consumers. You might have written all sorts of scripts and helper functions, but generally you only want to expose a small number of capabilities to the consumer. In R and Python, you should think about these as functions which you want to make available to others - you want consumers to be able to connect to your API and run one of your functions. The way that you provide these functions to them is through the use of endpoints.

The way that this usually works (with a GET request) is that users of your API will call your service with a URL that looks something like this:

http://<Uniform Resource Identifier>/<endpoint>?<arguments>

Think of the URI as the name of a restaurant, whilst the endpoint is the menu item they want (and if we want to labour the analogy, the arguments are whether or not they want chips or vegetables).

In plumber APIs, this is really easy because you can simply:

  • write a script with one or more functions (including arguments)
  • use special decorators to tell plumber which functions you want to serve as endpoints
  • launch the script as an API using plumber

Let’s take a look at an example - we’ll call this file plumber.R.

In this script we’ve defined three functions and their corresponding endpoints: echo, plot and sum:

  • echo takes one optional argument (msg) and returns a string
  • plot takes no arguments and returns an image
  • sum takes two arguments and returns a number

Each of these functions/endpoints also includes a few lines of additional documentation which explain how to use the functione/endpoint - this information is used by plumber to construct documentation for the API endpoints.

Once we have saved this file, we can launch the API using the following commands:

These commands launch an API service running on localhost using port 8000. If we visit this address using a browser, we can see that plumber has created documentation for us:

This is a pretty cool feature - using a service called swagger the plumber package has not only created an API for us, it has also created interactive documentation that tells you how to use the API.

We can test this API by connecting from bash, from Python, or even from another R session if we like. The key thing here is that because the API uses the HTTP protocol, we can communicate between any pair of languages we like as long as they can both make HTTP requests.

For this example, we’ll call the API from bash using the example code provided by Swagger:

We won’t get into the formatting of URLs for now - again this is something that you can learn as needed, and both the httr package (R) and Requests package (Python) have helper functions to help you format URLs. The important thing to note here is that we’ve successfully created an API service using R, and by using the HTTP communication protocol we are able to do the following tasks from bash:

  • call R functions and observe the output
  • send data to R which changes the output(in this case, function arguments)
  • retrieve data from R, including binary data (in this case an image file)

12.3 Use Cases

We’ve only just scratched the surface of what you can do with APIs, but even with these short examples it should be clear that you could use APIs for any of the following data science tasks:

  • reponding to requests for the latest data (for example, when you don’t want a production system talking to a SQL database)
  • providing the latest model score for a customer, on demand
  • providing a model score, where the model coefficients are provided as arguments to the API
  • providing advanced analytics functionality for use in a web service (for example, if developers wanted to incorporate a plot into an internal website)