Scaling Python
Programs don't automatically scale forever. Almost any application that processes data can start to massively slooooow down, or break, or even silently corrupt output and miss events. All you have to do is keep feeding it a thousand times more data. Or a million. Or a billion. As you keep increasing the size of the input, at some point, the program will hit internal bottlenecks, and start to misbehave.
That's why it's important to invest engineering effort in making our software scalable: designing applications to gracefully handle increasing orders of magnitude of input, whatever that may be. Broadly, there are three different regimes:
- Scaling to make the most of the memory on the machine. Imagine loading the contents of a large CSV file into memory before processing. What algorithms and designs will handle the largest input size, for a fixed system memory, without paging out?
- Going past the memory capacity on the machine, and making the most effective use of persistent storage. Like an SQL database process running on the server.
- Going past the capabilities of a single machine, to a distributed system - an application running across a cluster of separate servers. Imagine a Fortune 500 company's ecommerce database - something so massive, even the beefiest AWS instance type can't handle it.
The first two are about vertically scaling - getting the most you can out of a single machine - while the last one kicks in once you hit the physical limits of what's possible on a single server. This article focuses on the first regime: designing Python applications to scale as far as possible within available memory. Parts two and three will explore the others.
Modern OSes provide ample virtual memory, which can be many times the available physical memory. One of the performance bottlenecks kicks in when a program uses more physical memory than it has available to it, and the OS moves some of its data to secondary storage. This is called paging out to disk (or SSD, etc.), and dramatically harms both responsiveness and overall performance. For the purposes of this article (part 1), we're focused on maximizing the amount of data our algorithms can handle, before paging out becomes necessary.
Cleverly Culling Collections
Python sports some rich and powerful data structures for collections: list, dict, set, and more. As valuable as they are, overusing them can impede scalability, by making memory become a bottleneck when it doesn't need to be. Some of these overuses are easy to spot... and some are not. Look at the following code:
- # books.csv holds meta information on a collection of books:
- # title, author, pub year, etc.
- # load_from_file returns a list of dicts.
- books = load_from_file('books.csv')
- book_summaries = dict()
- for book in books:
- book_summaries[book["title"]] = book["summary"]
- for title, summary in book_summaries.items():
- # Do something interesting with the summary.
- print(title, summary)
After reading book data from the CSV file,
this code constructs a table mapping titles to summaries of the book,
before iterating through that dictionary.
From a memory-usage viewpoint, there is nothing wrong with this if
books.csv
contains a few dozen or hundred titles. But
what if it holds the entire inventory of a regional bookstore
chain... with millions of titles? You have either one or three problems,
depending on whether you're using Python 3 or Python 2.
First, you are creating a new data structure called
book_summaries
. This is strictly speaking unnecessary,
but does improve readability: when you read the "for" line, you
immediately know the loop is cycling through the summaries. In
general, it's a great idea to create new, well-named variables just to
make code more readable. Because it can boost maintainability, I often
do so myself, and recommend you do as well.
In this case, though, it creates a memory-scalability bottleneck
that is not worth the trade-off. The new data structure contains the
full summary of every book, which is likely to take up more memory
than all the other fields for that book combined. So if
books
consumes N bytes of memory, the entire block of
code above will require at least 1.5 * N bytes, and probably closer to
2 * N. This scales better:
- # Python 3.
- books = load_from_file('books.csv')
- for book in books.items():
- print(book["title"], book["summary"])
That's straightforward enough, but sometimes this can be more subtle. Look at this example:
- # load_customer_names returns a list of strings.
- customer_names = load_customer_names()
- for name in sorted(customer_names):
- # Do something interesting.
Here, sorted(customer_names)
creates a new list. If
customer_names
takes N bytes of memory, the whole block
of code takes 2*N total. Sometimes this is unavoidable: you really do
need to sort the list. You can still improve the scalability by
sorting in-place instead, if you can tolerate losing the original order:
- customer_names = load_customer_names()
- customer_names.sort()
- for name in customer_names:
- # Now the names are sorted.
Sometimes with a little thought, you can even move the sorting or
preprocessing outside the application, to a part of the stack that can
do it better. For example, if the customer names are originally read
from a database, you may be able to do something like "SELECT
name FROM customers AS customer_name ORDER BY name ASC
", so
that it's already sorted by the time it gets to your Python code.
Intelligent Iterating
Going back to the book example: in Python 2, you have several other
problems. The for
line iterates over
book_summaries.items()
. In Python 2, the
items()
method creates a list of key-value
pairs... effectively creating another copy of
books
. Besides using twice the memory, there will be a
performance penalty, causing the program to briefly hang right before
starting the for
loop. Can you see why?
The reason is that the call to book_summaries.items()
must allocate and populate a potentially very large list. When Python
executes the bytecode behind the for
loop, it starts by
allocating the memory for this list; populating it with the correct
values; and only then starting the first cycle through the loop. This
will be hard to notice or even measure if you know the dictionary will
never have more than a few hundred key-value pairs. But if the size is
unbounded - and it is, more often than you might think - then it
becomes a hidden speed bump.
Again, this only affects Python 2; Python 3's items()
returns something that acts like an iterator, which is both faster and
does not needlessly make a whole extra copy of book's
contents. But you can get the same benefit in Python 2 by using
iteritems
instead of items
:
- # In Python 2, use "iteritems" instead of "items".
- books = load_from_file('books.csv')
- for book in books.iteritems():
- print(book["title"], book["summary"])
In all Python versions, it's your responsibility to carry this pattern to other contexts. Be aware of the difference between creating a list and using an iterator, and design your code for the latter wherever possible. And when implementing your own code, the easiest way to do that is by using generators.
Generating Scalability
Imagine developing software that takes in free-form text, splitting into sentences before doing some kind of grammar analysis. Lexically, each sentence will be split by a period followed by one or more whitespace characters:
- import re
- text = '''This is a body of text. It has several sentences.
- Some are long and some are short.'''
- sentences = re.split(r'\.\s+', text)
- for sentence in sentences:
- print(sentence)
This emits something like:
- This is a body of text
- It has several sentences
- Some are long and some are short.
A little inconsistent on the ending periods, but that seems
minor. There's a bigger problem lurking here, which is similar
to what we saw above. The input size,
so to speak, is the number of bytes in the text
variable. But the memory we are using is more than twice
that, because we are duplicating every single byte in the list called
sentences
. That's not a big deal when you have three
sentences. But what if text
has three thousand? Or three
million?
We can do better. Suppose our application only needs to examine one sentence at a time1. What's a scalable way to iterate through them, while also keeping your code readable and maintainable?
Let's imagine an interface that might work well for us. Consider a
function, called each_sentence
, which we can use like this:
- for sentence in each_sentence(text):
- # Do something with the sentence
Imagine it produces one sentence at a time, instead of all
sentences up front, like re.split
does. Then the maximum
extra memory used is equal to that required to store the longest
sentence in the text. Unless you are parsing a novel composed of a
single, gigantic run-on sentence2,
this will vastly raise the size of input your application can handle.
That's a nice idea, but can we make it work? Yes, using the
yield
keyword to construct a generator. Here is
a simple example of a generator, and how you use it:
- def nums():
- n = 0
- while n < 4:
- yield n
- n += 1
-
- for num in nums():
- print(num)
Note this function has no return
keyword, but instead
uses yield
. A function using "yield" instead of "return"
is called a generator in Python. To help yourself ingrain the
syntax, open a new file, type the above in, and run it as a Python
program. You will see the following output:
- 0
- 1
- 2
- 3
The return type of nums
is a generator
object3, which quacks like an
iterator. Each time through, it picks up again on the line after the
yield
(instead of starting at the beginning, like a
regular function would).
In other words: In Python, generators are a way to quickly create iterators. They are lazy, in the good sense of the word, allocating memory for each sentence one at a time (and garbage-collecting the old one as it goes along). The returned object implements the iterator protocol, which is why you can use it in a for loop.
You can have multiple yield statements in a generator:
- def nums2():
- n = 0
- while n < 4:
- yield n
- n += 1
- yield 42 # Second yield
-
- for num in nums2():
- print(num)
This time, the program prints:
- 0
- 1
- 2
- 3
- 42
Reason through the code above, and convince yourself it makes
sense. The second yield is reached after while
loop
exits. When the function reaches the implicit return at the end, the
iteration stops.
With these pieces, we can realize the each_sentence
function above:
- import re
- def each_sentence(text):
- start = 0
- for match in re.finditer(r'\.\s+', text):
- end = match.start() + 1
- sentence = text[start:end]
- yield sentence
- start = match.end()
- yield text[start:]
There's a little bit of complexity here, but since you're
subscribed to a newsletter devoted to advanced Python programming, I
know you can handle it. The each_sentence
function
iterates through the sentence boundaries it finds in the text, using
the indices in the regex match object to find the starting and end
character positions of each sentence. It then returns a slice4 of the original full text containing just
the sentence. Since the last sentence of the text may not have any
space after the final period, there's an extra yield statement at the
end just for that tail.
Now when we run the for-loop using each_sentence
,
here's what we get:
- This is a body of text.
- It has several sentences.
- Some are long and some are short.
This performantly handles any size corpus of text, up to available system
memory. And it will do so with high responsiveness - iterating through
the sentences immediately, without an expensive preprocessing step
before entering the for
loop.
(Bonus: the ending punctuation is consistent now!)
This Is Easier In Python 3
Python 3 has many improvements over Python 2, far more than most people realize. One pervasive and subtle change: iterators are used in many places where, previously, full lists were created and allocated. Some of these were mentioned above, and there are more as well:
- In Python 3,
range
is an iterator, generating the numbers as needed; in Python 2, it allocates a full list of the numbers before the loop even starts. You have to usexrange
instead to get an iterator. map
,filter
andzip
all return iterators in Python 3. In Python 2, they allocate and return full lists.- As mentioned, in Python 3
my_dict.items()
returns an iterator; in Python 2, it creates a list of all key-value pairs, and you must instead remember to useiteritems()
to get an iterator. - Actually,
my_dict.items()
returns something called a view. This behaves like an iterator, but is better in ways I'll talk about in another article.
Generally speaking, Python 3 uses iterators by default in more places, unless you explicitly convert it to a list or other sequence. So one advantage of implementing an application in Python 3, or porting an existing one, is that the program may scale better "for free".
Since being able to write scalable software is one of the things that separates average developers from world-class engineers, it's a key topic I'll focus on in the upcoming Advanced Python course, and in future editions of this newsletter.
Or N at a time, i.e. some bounded number. Most applications will actually fit this mold; it's rare you will need to cross-correlate every record of a giant data set.
Apparently, run-on sentence novels exist. But hopefully you won't... run into them.
Often you will see people use the word "generator" interchangeably to describe the function, or the object returned by the function. It's often clear enough by the context. If you want to be more explicit, use "generator" to refer to the function object, and "generator object" for what the function returns when called. A generator object can always be used exactly like an iterator; if you want to be exacting, it's not identical to an iterator, but that's a topic for a different article.
A slice of a string is itself a string, created by a concise syntax:
my_str[start:end]
, wherestart
is the zero-based starting character, andend
is one past the ending character. So"Hello there, world!"[6:11] == "there"
.