The Python Concurrency Story (Part 2)

In Part One, we talked about the key dilemma of programming Python in a concurrent world. In a nutshell, for CPU-bound tasks, Python threads will not help. So what do we do instead?

Broadly speaking, there are two kinds of real-world concurrency problems you will face:

Tasks where having more than one CPU can help it run faster, and
tasks where having extra CPUs won't help much.

In this essay, you're going to learn about the first one: tasks that ARE empowered by having extra CPUs. We do this by writing Python code to run across several processes. In the best case, having N CPUs will let you run the program N times faster.

(The second kind, which we will cover in part 3, has to do with juggling multiple tasks that are not CPU-bound. For something like this, having multiple CPUs or cores doesn't really help. To win that game, you have to get the absolute most you can out of a single thread... which modern Python gives you some great tools for.)

Two Approaches to Concurrent Programming

Among Python programmers, one thing separating the good from the great is an understanding of the concurrency primitives provided by modern operating systems - processes and threads - and the different patterns for working with them. This is fundamentally independent of language. You may know some of it from your formal education, or your experience implementing concurrent applications in other languages.

Broadly, there are two main approaches to implementing concurrent systems: Shared Memory, and Message Passing. The key difference: shared memory means your program is designed for two or more threads to read from and write to the same block of memory - dancing without stepping on each other's toes. In the message passing approach, any chunk of memory is only accessed by a single thread. When two or more need to synchronize, interact or communicate, data is sent from one to another - i.e., a message is passed.

Now, the two threads can be in the same process, or in two different processes - each of which has its own main thread. So you can use shared memory or message passing in either.¹ However, typically, people will used shared memory only in single-process, multi-threaded systems. It's possible to walk the shared memory path via mmap-like constructs or IPC for a multiprocess program, but is not as performant, and can be complex to implement.²

I recommend you not go down that path, though, if you can avoid it. Those of you who have done it know why: it's very difficult to do well, and too easy to create subtle race conditions, deadlocks, and other bugs. Sometimes you can't avoid sharing memory. But if you can use message-passing instead, do it.

For multiprocess programs, message passing tends to be a more natural fit anyway. For two reasons: first, shared memory is not so performant and elegant in multiprocess, compared to doing so in a single-process, multithreaded program. More importantly, many of the kinds of programs you'd want to implement using multiple processes lend themselves naturally to a message-passing architecture, especially in Python.

Writing Multicore Python

Python essentially forces you to use multiple processes to leverage more than one CPU. ³ There are ways to share memory across processes, and modern Python makes it relatively easy to do. But we'll focus on the message-passing approach here.

We really have it easy now. Once upon a time, if you wanted to make Python use multiple CPUs well, you had to do something horrible, hacky and unportable using os.fork. But now, we have the nice multiprocessing module in the standard library. It's a nice Pythonic interface to working with multiple processes, making many hard things easy⁴. There are plenty of articles out there with nice, neat toy examples. But how about we look at something more realistic?

Let me introduce you to Thumper. It's a python program that generates thumbnails of gigantic image libraries. This is what you want to use if your webapp fetches a new batch of 100,000 images at 2am each night, and you need thumbnails of them ready to view in a reasonable amount of time. This is a perfect use case for multiple CPUs: generating a thumbnail is CPU-bound, and the calculation is completely independent for two different images.

Feast your eyes on this beauty:

This comes from repeatedly running thumper on a data set of NASA TIFF images, on an AWS c3.2xlarge instance. The horizontal is the number of worker processes spawned.

There's a LOT going on in this graph. I'm working on an Advanced Python Mastery course, which will delve into much of it, granting my students amazing Batman-like concurrent programming superpowers. For now, we are going to focus on Thumper's implementation - and how it bypasses the GIL to scale with the number of available CPU cores.

The core is quite simple, thanks to Pillow - the modern, maintained fork of PIL (Python Imaging Library):

# PIL is actually Pillow. Confusing, I know,
# but nicely backwards compatible.
from PIL import Image
def create_thumbnail(src_path, dest_path, thumbnail_width, thumbnail_height):
    image = Image.open(src_path)
    image.thumbnail((thumbnail_width, thumbnail_height))
    os.makedirs(os.path.dirname(dest_path), exist_ok=True)
    image.save(dest_path)

Very straightforward. Now, we want to farm this out to several worker processes. Here are some considerations:

How many processes? This is a surprisingly hard question, as we will see, and it turns out to be a moving target. So we want flexibility - the ability to specify how many at runtime.
How do we keep from creating too many processes, or too few?
How can we effeciently wait until one worker process is done, then immediately hand it a new image to thumbnail?

Python's multiprocessing Pools

The multiprocessing module has a very useful abstraction, in the form of the Pool class. You create it like this:

import multiprocessing
pool = multiprocessing.Pool(num_processes)

Whew, that was hard! The word "multiprocessing" has way too many characters... so much typing. Thankfully that's behind us, and what's left is to dispatch images to each of these workers. The Pool provides several different ways to do this. Each takes a callable - which will be create_thumbnail, in our case - and its arguments.

Some of the different pool methods are:

apply_async: Takes a function and some arguments, and sends it to one of the worker processes to be run. Replies immediately with a result object, which is kind of like a future - you can get the returned value from it once it's done.
map and map_async: Similar to Python's built-in map() function, except the work is farmed out over the child processes. map will block until all jobs are done; map_async immediately returns a result object. Limitation: the callable can only take one argument.
imap and imap_async: Like map and map_async, but returns an iterator instead of a completed sequence. Could have been called lazy_map.
starmap and starmap_async: Like map and map_async, except its callable can take many arguments.

These are most⁵ of the methods you will use, and as you can see, they are mostly variants of similar ideas. Since our create_thumbnail function takes multiple arguments, Thumper uses starmap_async.

# src_dir is the folder containing full-size images.
# dest_dir is where we are going to write the thumbnails,
# in a parallel file hierarchy.
    
def gen_child_args():
    for (dirpath, dirnames, filenames) in os.walk(src_dir):
        for filename in filenames:
            src_path = os.path.join(dirpath, filename)
            dest_path = find_dest_path(src_dir, dest_dir, src_path)
            yield (src_path, dest_path, thumbnail_width, thumbnail_height)
pool.starmap_async(create_thumbnail, gen_child_args())

As you can infer, the second argument of startmap_async is an iterable. Since we want Thumper to literally work with millions of images, I coded a memory-efficient generator that creates the arugment tuples as needed, rather than calculating a huge list (one tuple for each image).

Understanding Performance

How well does it work? It actually takes some thought at first to understand. Imagine you are using a machine with eight CPUs (or cores). How many will you want Thumper to use? Theoretically, it is going to be eight - the number of cores. It may be another value, depending on (a) what else is going on in the system, and (b) what's going on in your application. It's common to find that you'll get the exact same performance using fewer processes than the number of CPUs on the system. (That means you've hit a point where raw CPU is no longer the bottleneck.) And believe it or not, I've also seen cases where using too many workers can cause contention for resources and slow things down.

In my own experience, the only way to know for sure is to test: repeating execution varying only the number of CPUs exercised. That's what I did above. Let's focus on elapsed time, because minimizing that is the whole point:

This was done on a c3.2xlarge EC2 instance, a hefty fellow with eight CPUs. The horizontal axis is the number of worker processes spawned by Thumper; the vertical is the total elapsed time to thumbnail all images, so lower is better.

Looking at the graph, as you add worker processes, elapsed time keeps decreasing until you hit six. Why not eight? Or seven? There are many possible reasons, which can be very application specific. In general, you have scaled out as much as you can via CPU, and have now hit another bottleneck. Perhaps you are using all available memory, and have started paging. Perhaps you are now saturating CPU cache. Perhaps you are hitting an I/O throughput limit, where you can't load any more bytes of images per second.

As you can see, learning the classes, methods and functions provided by multiprocessing is only the first step.

For programming in general - in any language - scaling out across CPUs will only get you so far. Plus, throwing more CPUs at a problem gets expensive. If you really want to get the most out of the machine, you also have to master getting the most out of a single thread. And that's what we are going to talk about in part 3. The main key here - the one secret that will let you really kick a single thread's butt - is...

Well, I'll tell you when I publish part 3. Make sure you're subscribed to get the announcement.

For a single-process, multithreaded program, message passing can be implemented through a global, thread-safe queue or other data structure; multiple processes can do this through IPC.
Unless you use Python's multiprocessing module. But, let's not get ahead of ourselves.
This is oversimplified. Using C extensions can let you leverage multiple CPUs in about the way C can, at least within a certain limited portion of your Python application. And the story is different for implementations other than CPython.

For most Python code that's being written today - in the form of pure Python applications, using the official language interpreter - using more than one CPU means using more than one process.
Or at least, less hard. And with fewer sharp pointy jagged edges.
Plus a few others, like apply - which is just apply_async, except it blocks before returning. Can't think of a reason to use this in a program that uses multiprocessing? Me neither.

White Paper Training Courses