Making Unreliable APIs Reliable with Python

Small details can make truly dramatic differences in reliability.

Part of modern programming includes making API calls to third-party services, over HTTP or HTTPS. For example, in the Facebook graph API, you can get a link to your profile picture:¹

from urllib.parse import quote
import requests
from creds import fb_access_token
facebook_api_url = 'https://graph.facebook.com/me/picture?redirect=false'
picture_data = requests.get(facebook_api_url + '&access_token=' + quote(fb_access_token))
assert picture_data.status_code == 200, 'API call failed: {} {}'.format(
    picture_data.status_code, picture_data.text)
picture_url = picture_data.json()['data']['url']
# Now we have the URL for the photo, and can do something with it.

That's easy enough. In practice, you need to do more work than that, because the service is not perfectly reliable. Good APIs usually include thorough documentation, describing what endpoints require what parameters; what different response status codes mean; and so on. Of course, the providers of the API face the same technical challenges anyone does when exposing some service on the web:

Bugs happen. So do race conditions.
The endpoint's load balancer may dispatch a given request to an ailing server.
An underlying data store may be only eventually consistent... so your first request fails to find a needed underlying resource.

For these and other reasons, if you make 10,000 API calls, it may not be surprising if one of them doesn't respond as documented - especially for a popular service under heavy load.

What are the possible failure modes? There are two I have seen in real-world web APIs:

You make a valid call (GET, POST, PUT, etc.) and it returns a 4xx or 5xx error. You make the same call again a moment later, and it succeeds.
The call succeeds with a status code of 200. But its response body fails to include something it should, such as a critical JSON object field. When you make the same call again, the response body is correctly fleshed out.

This affects everyone. I routinely use some popular services, including Facebook and Amazon Web Services, whose APIs are staggeringly reliable given the sheer volume they see. But my colleagues and I have encountered these kinds of transient errors with both services, and others. For Facebook, a call to get an access token returns a 200, but was missing the "access_token" field; an immediate retry always had it. As an example in AWS, we've seen making one API call to create a resource; getting an acknowledgement back, with a resource ID; then another call (say, to add a tag) fails to find that resource... until we repeat the call one quarter-second later, when it succeeds without complaint. Both of these happen much less than 1% of the time, but not close enough to 0%.

Of course, web services can fail in other possible ways - extended outages, in particular. Right now, I'm going to focus on the above scenarios, and how you can make your Python applications deal with them well.

A Simple API

Let's start with a simple imaginary API for a todo-list service. All HTTP requests return a JSON object in the response body; the status codes are:

200 on success
401 if the credentials are wrong
403 if you don't have permission (you can't read other people's todo lists!)

The endpoints and actions on them are:

GET /items: Return a list of items on the todo list, in the format {"id": <item_id>, "summary": <one-line summary>}
GET /items/<item_id>: Get all available information for a specific todo item, in the format {"id": <item_id>, "summary": <one-line summary>, "description" : <free-form text field>}
POST /items: Create a new todo item. The POST body is a JSON object with two fields: "summary" (must be under 120 characters, no newline), and "description" (free-form text field).
DELETE /items/<item_id>: Mark the item as done. (I.e., strike it off the list, so GET /items will not show it.)

Each user has their own todo list, tied to the authentication layer. (We'll gloss over that part to keep this reasonably short.) For this article, I implemented a simple HTTP server providing this todo-list API, if you'd like to try these out as you keep reading.

Given the above, your code can do things like:

import requests
from api_info import API_DOMAIN
API_URL = 'https://{}'.format(API_DOMAIN)
resp = requests.get(API_URL + '/items')
assert resp.status_code == 200, 'Cannot get todo items: {}'.format(resp.status_code)
todo_items = resp.json()
for item in todo_items:
    # Do something, like print it out, etc.

(In this article, for brevity I'm ignoring the authentication headers.)

Now suppose this is a massive service operating at scale - with dozens of servers behind the endpoint load balancer, heavy use of queues to improve latency, and so on. Because of some rare race condition the service's engineering team has not quite ironed out, the response will fail with a 500 code - an internal server error - about once every 1,000 API requests. You are just a consumer of the service, so you can't do anything about it, except file a bug report and wait. What do you do in the meantime?

The GET to /items is idempotent - i.e., has no additional side effects if done more than once. This makes it safe to just retry the call. So let's automate that:

# Retry this many times before giving up.
MAX_TRIES = 3
tries = 0
resp = None
while True:
    resp = requests.get(API_URL + '/items')
    if resp.status_code == 500 and tries < MAX_TRIES:
        tries += 1
        continue
    break
todo_items = resp.json()
for item in todo_items:
    # Do something, like print it out, etc.

This is much more robust. We know for a fact there is a 0.1% chance of failure, which we recognize by the status code. By retrying up to 3 times, we improved the reliability of this code block from 99.9% (three "nines") to 99.9999999% (nine "nines").

Of course, your application probably makes many such API calls. Suppose a user goes through a certain UI flow in their mobile app, which triggers 10 API calls; the original code has a nearly 1% chance of fatal error at some point. But if you wrap all such calls in a retry loop, that improves to better than 99.999999% reliability - only one "nine" lower than the retried single call!

From experience using this API, we know that multiple endpoints are susceptible. So let's make this code reusable. In Python, the best way to do that is with a decorator.

def retry(func):
    def retried_func(*args, **kwargs):
        MAX_TRIES = 3
        tries = 0
        while True:
            resp = func(*args, **kwargs)
            if resp.status_code == 500 and tries < MAX_TRIES:
                tries += 1
                continue
            break
        return resp
    return retried_func

This is then applied to the function that makes the HTTP call:

@retry
def get_items():
    return requests.get(API_URL + '/items')

todo_items = get_items().json()
for item in todo_items:
    # Do something with the item

To work properly, the retry() decorator must be applied to a function (or callable) that returns an instance of requests.Response. Such a function minimally encapsulates the action we want to automatically retry. Sometimes your code isn't naturally organized that way. I have found it helpful to factor out methods or functions following a certain naming convention that calls attention to their purpose, for example:

class TodoHelper:
    def fetch_all_items(self):
        items_resp = self._get_items_response()
        for item in items_resp.json():
            yield item

    # Methods named like "<something>_response" are understood
    # to return an instance of requests.Response. This can be
    # documented in the project's coding style guidelines.
    @retry
    def _get_items_response(self):
        return requests.get(API_URL + '/items')

for item in TodoHelper().fetch_all_items():
    # Do something with the item

There's one other topic we need to mention. Look back up at the source for retry() for a moment. What's the algorithm for deciding whether to retry the API call? It's decided in the check "if resp.status_code == 500". What if you need different logic though? Maybe your application integrates with two different APIs, which infrequently fail in two different ways. In addition to the 500 error, one I have seen is a 4xx error being returned with valid authentication headers; the request would succeed when immediately retried with the same credentials. But for this one, we may not want to retry on 5xx status codes, as that could mask a different problem.

Customizing The Retry Decorator

What's needed is a way to parameterize the check. We do that by passing a predicate: a function object that takes a Response instance as its sole argument, and returns True or False depending on whether it is valid (or needs to be retried).

# is_valid represents our predicate function.
# Takes a response object argument; returns True or False.
def mk_retryer(is_valid):
    # retry() is the decorator we are building.
    def retry(func):
        def retried_func(*args, **kwargs):
            MAX_TRIES = 3
            tries = 0
            while True:
                resp = func(*args, **kwargs)
                if not is_valid(resp) and tries < MAX_TRIES:
                    tries += 1
                    continue
                assert resp.status_code == 200, \
		    'Failure getting todo items: {}'.format(resp.status_code)
                break
            return resp
        return retried_func
    return retry

With this tool, we can create custom retry decorators:

# A simple check that the status code is not 500.
def status_not_500(resp):
    return resp.status_code != 500
retry_on_internal_error = mk_retryer(is_valid = status_not_500)

# We can use a lambda to invert the logic.
def is_auth_error(resp):
    return resp.status_code >= 400 and resp.status_code < 500
retry_on_auth_failure = mk_retryer(lambda resp: not is_auth_error(resp))

# Or inline a lambda directly.
# (This is also how you check the response body.)
retry_on_missing_id = mk_retryer(lambda resp: 'summary' in resp.json())

# You can create arbitrarily complex predicates, of course.
def is_fully_valid(resp):
    return resp.status_code < 400 and 'summary' in resp.json()
retry_on_anything_wrong = mk_retryer(is_fully_valid)

We can then apply these custom retry decorators at will.

@retry_on_internal_error
def get_items():
    return requests.get(API_URL + '/items')

@retry_on_auth_failure
def get_resources_from_foreign_api():
    return requests.get(FOREIGN_API_URL + '/resources')

Class-Based Decorators Are Even More Flexible

There is an alternative approach, with class-based decorators. This has the advantage of letting you alter different aspects of the retry decorator more easily. It relies on the __call__ method ²:

# This class will create the standard retry decorator.
# It only retries on a 500 status code.
class Retry:
    # By default, retry up to this many times.
    MAX_TRIES = 3
    # This method holds the validation check.
    def is_valid(self, resp):
        # By default, only retry if a status code is 500.
        return resp.status_code == 500

    def __call__(self, func):
        def retried_func(*args, **kwargs):
            tries = 0
            while True:
                resp = func(*args, **kwargs)
                if self.is_valid(resp) or tries >= self.MAX_TRIES:
                    break
                tries += 1
            return resp
        return retried_func

# This will retry on 4xx failures only.
class RetryOnAuthFailure(Retry):
    def is_valid(self, resp):
        return not (resp.status_code >= 400 and resp.status_code < 500)
    
# This will retry on *any* 5xx error, and do so up to 5 times.
class RetryOnServerError(Retry):
    MAX_TRIES = 5
    def is_valid(self, resp):
        return resp.status_code < 500

# Now we create the decorator "functions" (callables, really).
retry_on_500 = Retry()
retry_on_auth_failure = RetryOnAuthFailure()
retry_on_server_error = RetryOnServerError()

@retry_on_500
def get_items():
    return requests.get(API_URL + '/items')

@retry_on_server_error
def get_single_item(item_id):
    return requests.get(API_URL + '/items/{}'.format(item_id))

@retry_on_auth_failure
def drop_item(item_id):
    return requests.delete(API_URL + '/items/{}'.format(item_id))

Using decorators in these ways gives you a lot of flexibility to make your API integrations more robust. If you need to add in time delays between retries, or inject an extra API call to refresh an access token or free some resource, etc., this approach gives you the hooks to do that and more. Try it out (all code in this essay is in the public domain), and let me know what you think.

Footnotes

Code in this essay uses the requests module. It's not part of the Python standard library, but in practice I treat it like it is. Highly recommended for any application that operates over HTTP.
__call__ is a method of object - the base class of all classes in Python - which allows you to make any object invokable as a function.

Course Training