Scaling Python

Programs don't automatically scale forever. Almost any application that processes data can start to massively slooooow down, or break, or even silently corrupt output and miss events. All you have to do is keep feeding it a thousand times more data. Or a million. Or a billion. As you keep increasing the size of the input, at some point, the program will hit internal bottlenecks, and start to misbehave.

That's why it's important to invest engineering effort in making our software scalable: designing applications to gracefully handle increasing orders of magnitude of input, whatever that may be. Broadly, there are three different regimes:

Scaling to make the most of the memory on the machine. Imagine loading the contents of a large CSV file into memory before processing. What algorithms and designs will handle the largest input size, for a fixed system memory, without paging out?
Going past the memory capacity on the machine, and making the most effective use of persistent storage. Like an SQL database process running on the server.
Going past the capabilities of a single machine, to a distributed system - an application running across a cluster of separate servers. Imagine a Fortune 500 company's ecommerce database - something so massive, even the beefiest AWS instance type can't handle it.

The first two are about vertically scaling - getting the most you can out of a single machine - while the last one kicks in once you hit the physical limits of what's possible on a single server. This article focuses on the first regime: designing Python applications to scale as far as possible within available memory. Parts two and three will explore the others.

Modern OSes provide ample virtual memory, which can be many times the available physical memory. One of the performance bottlenecks kicks in when a program uses more physical memory than it has available to it, and the OS moves some of its data to secondary storage. This is called paging out to disk (or SSD, etc.), and dramatically harms both responsiveness and overall performance. For the purposes of this article (part 1), we're focused on maximizing the amount of data our algorithms can handle, before paging out becomes necessary.

Cleverly Culling Collections

Python sports some rich and powerful data structures for collections: list, dict, set, and more. As valuable as they are, overusing them can impede scalability, by making memory become a bottleneck when it doesn't need to be. Some of these overuses are easy to spot... and some are not. Look at the following code:

# books.csv holds meta information on a collection of books:
# title, author, pub year, etc.
# load_from_file returns a list of dicts.
books = load_from_file('books.csv')
book_summaries = dict()
for book in books:
    book_summaries[book["title"]] = book["summary"]
for title, summary in book_summaries.items():
    # Do something interesting with the summary.
    print(title, summary)

After reading book data from the CSV file, this code constructs a table mapping titles to summaries of the book, before iterating through that dictionary. From a memory-usage viewpoint, there is nothing wrong with this if books.csv contains a few dozen or hundred titles. But what if it holds the entire inventory of a regional bookstore chain... with millions of titles? You have either one or three problems, depending on whether you're using Python 3 or Python 2.

First, you are creating a new data structure called book_summaries. This is strictly speaking unnecessary, but does improve readability: when you read the "for" line, you immediately know the loop is cycling through the summaries. In general, it's a great idea to create new, well-named variables just to make code more readable. Because it can boost maintainability, I often do so myself, and recommend you do as well.

In this case, though, it creates a memory-scalability bottleneck that is not worth the trade-off. The new data structure contains the full summary of every book, which is likely to take up more memory than all the other fields for that book combined. So if books consumes N bytes of memory, the entire block of code above will require at least 1.5 * N bytes, and probably closer to 2 * N. This scales better:

# Python 3.
books = load_from_file('books.csv')
for book in books.items():
    print(book["title"], book["summary"])

That's straightforward enough, but sometimes this can be more subtle. Look at this example:

# load_customer_names returns a list of strings.
customer_names = load_customer_names()
for name in sorted(customer_names):
    # Do something interesting.

Here, sorted(customer_names) creates a new list. If customer_names takes N bytes of memory, the whole block of code takes 2*N total. Sometimes this is unavoidable: you really do need to sort the list. You can still improve the scalability by sorting in-place instead, if you can tolerate losing the original order:

customer_names = load_customer_names()
customer_names.sort()
for name in customer_names:
    # Now the names are sorted.

Sometimes with a little thought, you can even move the sorting or preprocessing outside the application, to a part of the stack that can do it better. For example, if the customer names are originally read from a database, you may be able to do something like "SELECT name FROM customers AS customer_name ORDER BY name ASC", so that it's already sorted by the time it gets to your Python code.

Intelligent Iterating

Going back to the book example: in Python 2, you have several other problems. The for line iterates over book_summaries.items(). In Python 2, the items() method creates a list of key-value pairs... effectively creating another copy of books. Besides using twice the memory, there will be a performance penalty, causing the program to briefly hang right before starting the for loop. Can you see why?

The reason is that the call to book_summaries.items() must allocate and populate a potentially very large list. When Python executes the bytecode behind the for loop, it starts by allocating the memory for this list; populating it with the correct values; and only then starting the first cycle through the loop. This will be hard to notice or even measure if you know the dictionary will never have more than a few hundred key-value pairs. But if the size is unbounded - and it is, more often than you might think - then it becomes a hidden speed bump.

Again, this only affects Python 2; Python 3's items() returns something that acts like an iterator, which is both faster and does not needlessly make a whole extra copy of book's contents. But you can get the same benefit in Python 2 by using iteritems instead of items:

# In Python 2, use "iteritems" instead of "items".
books = load_from_file('books.csv')
for book in books.iteritems():
    print(book["title"], book["summary"])

In all Python versions, it's your responsibility to carry this pattern to other contexts. Be aware of the difference between creating a list and using an iterator, and design your code for the latter wherever possible. And when implementing your own code, the easiest way to do that is by using generators.

Generating Scalability

Imagine developing software that takes in free-form text, splitting into sentences before doing some kind of grammar analysis. Lexically, each sentence will be split by a period followed by one or more whitespace characters:

import re
text = '''This is a body of text. It has several sentences.
Some are long and some are short.'''
sentences = re.split(r'\.\s+', text)
for sentence in sentences:
    print(sentence)

This emits something like:

This is a body of text
It has several sentences
Some are long and some are short.

A little inconsistent on the ending periods, but that seems minor. There's a bigger problem lurking here, which is similar to what we saw above. The input size, so to speak, is the number of bytes in the text variable. But the memory we are using is more than twice that, because we are duplicating every single byte in the list called sentences. That's not a big deal when you have three sentences. But what if text has three thousand? Or three million?

We can do better. Suppose our application only needs to examine one sentence at a time¹. What's a scalable way to iterate through them, while also keeping your code readable and maintainable?

Let's imagine an interface that might work well for us. Consider a function, called each_sentence, which we can use like this:

for sentence in each_sentence(text):
    # Do something with the sentence

Imagine it produces one sentence at a time, instead of all sentences up front, like re.split does. Then the maximum extra memory used is equal to that required to store the longest sentence in the text. Unless you are parsing a novel composed of a single, gigantic run-on sentence², this will vastly raise the size of input your application can handle.

That's a nice idea, but can we make it work? Yes, using the yield keyword to construct a generator. Here is a simple example of a generator, and how you use it:

def nums():
    n = 0
    while n < 4:
        yield n
        n += 1

for num in nums():
    print(num)

Note this function has no return keyword, but instead uses yield. A function using "yield" instead of "return" is called a generator in Python. To help yourself ingrain the syntax, open a new file, type the above in, and run it as a Python program. You will see the following output:

The return type of nums is a generator object³, which quacks like an iterator. Each time through, it picks up again on the line after the yield (instead of starting at the beginning, like a regular function would).

In other words: In Python, generators are a way to quickly create iterators. They are lazy, in the good sense of the word, allocating memory for each sentence one at a time (and garbage-collecting the old one as it goes along). The returned object implements the iterator protocol, which is why you can use it in a for loop.

You can have multiple yield statements in a generator:

def nums2():
    n = 0
    while n < 4:
        yield n
        n += 1
    yield 42 # Second yield

for num in nums2():
    print(num)

This time, the program prints:

Reason through the code above, and convince yourself it makes sense. The second yield is reached after while loop exits. When the function reaches the implicit return at the end, the iteration stops.

With these pieces, we can realize the each_sentence function above:

import re
def each_sentence(text):
    start = 0
    for match in re.finditer(r'\.\s+', text):
        end = match.start() + 1
        sentence = text[start:end]
        yield sentence
        start = match.end()
    yield text[start:]

There's a little bit of complexity here, but since you're subscribed to a newsletter devoted to advanced Python programming, I know you can handle it. The each_sentence function iterates through the sentence boundaries it finds in the text, using the indices in the regex match object to find the starting and end character positions of each sentence. It then returns a slice⁴ of the original full text containing just the sentence. Since the last sentence of the text may not have any space after the final period, there's an extra yield statement at the end just for that tail.

Now when we run the for-loop using each_sentence, here's what we get:

This is a body of text.
It has several sentences.
Some are long and some are short.

This performantly handles any size corpus of text, up to available system memory. And it will do so with high responsiveness - iterating through the sentences immediately, without an expensive preprocessing step before entering the for loop.

(Bonus: the ending punctuation is consistent now!)

This Is Easier In Python 3

Python 3 has many improvements over Python 2, far more than most people realize. One pervasive and subtle change: iterators are used in many places where, previously, full lists were created and allocated. Some of these were mentioned above, and there are more as well:

In Python 3, range is an iterator, generating the numbers as needed; in Python 2, it allocates a full list of the numbers before the loop even starts. You have to use xrange instead to get an iterator.
map, filter and zip all return iterators in Python 3. In Python 2, they allocate and return full lists.
As mentioned, in Python 3 my_dict.items() returns an iterator; in Python 2, it creates a list of all key-value pairs, and you must instead remember to use iteritems() to get an iterator.
Actually, my_dict.items() returns something called a view. This behaves like an iterator, but is better in ways I'll talk about in another article.

Generally speaking, Python 3 uses iterators by default in more places, unless you explicitly convert it to a list or other sequence. So one advantage of implementing an application in Python 3, or porting an existing one, is that the program may scale better "for free".

Since being able to write scalable software is one of the things that separates average developers from world-class engineers, it's a key topic I'll focus on in the upcoming Advanced Python course, and in future editions of this newsletter.

Or N at a time, i.e. some bounded number. Most applications will actually fit this mold; it's rare you will need to cross-correlate every record of a giant data set.
Apparently, run-on sentence novels exist. But hopefully you won't... run into them.
Often you will see people use the word "generator" interchangeably to describe the function, or the object returned by the function. It's often clear enough by the context. If you want to be more explicit, use "generator" to refer to the function object, and "generator object" for what the function returns when called. A generator object can always be used exactly like an iterator; if you want to be exacting, it's not identical to an iterator, but that's a topic for a different article.
A slice of a string is itself a string, created by a concise syntax: my_str[start:end], where start is the zero-based starting character, and end is one past the ending character. So "Hello there, world!"[6:11] == "there".

White Paper Training