Scaling With Generators

This for loop seems simple:

  1. for item in items:
  2. do_something_with(item)

And yet, miracles hide here. As you probably know, the act of efficiently going through a collection, one element at a time, is called iteration. But few understand how Python’s iteration system really works…​ how deep and well-thought-out it is. This chapter makes you one of those people, giving you the ability to naturally write highly scalable Python applications…​ able to handle ever-larger data sets in performant, memory-efficient ways.

Iteration is also core to one of Python’s most powerful tools: the generator function. Generator functions are not just a convenient way to create useful iterators. They enable some exquisite patterns of code organization, in a way that - by their very nature - intrinsically encourage excellent coding habits.

This chapter is special, because understanding it threatens to make you a permanently better programmer in every language. Mastering Python generators tends to do that, because of the distinctions and insights you gain along the way. Let’s dive in.

Iteration in Python

Python has a built-in function called iter(). When you pass it a collection, you get back an iterator object:

  1. >>> numbers = [7, 4, 11, 3]
  2. >>> iter(numbers)
  3. <list_iterator object at 0x10219dc50>

Just as in other languages, a Python iterator produces the values in a sequence, one at a time. You probably know an iterator is like a moving pointer over the collection:

  1. >>> numbers_iter = iter(numbers)
  2. >>> for num in numbers_iter: print(num)
  3. 7
  4. 4
  5. 11
  6. 3

You don’t normally need to do this. If you instead write for num in numbers, what Python effectively does under the hood is call iter() on that collection. This happens automatically. Whatever object it gets back is used as the iterator for that for loop:

  1. # This...
  2. for num in numbers:
  3. print(num)
  4. # ... is effectively just like this:
  5. numbers_iter = iter(numbers)
  6. for num in numbers_iter:
  7. print(num)

An iterator over a collection is a separate object, with its own identity - which you can verify with id():

  1. >>> # id() returns a unique number for each object.
  2. ... # Different objects will always have different IDs.
  3. >>> id(numbers)
  4. 4330133896
  5. >>> id(numbers_iter)
  6. 4330216640

How does iter() actually get the iterator? It can do this in several ways, but one relies on a magic method called __iter__. This is a method any class (including yours) may define; when called with no arguments, it must return a fresh iterator object. Lists have it, for example:

  1. >>> numbers.__iter__
  2. <method-wrapper '__iter__' of list object at 0x10130e4c8>
  3. >>> numbers.__iter__()
  4. <list_iterator object at 0x1013180f0>

Python makes a distinction between objects which are iterators, and objects which are iterable. We say an object is iterable if and only if you can pass it to iter(), and get a ready-to-use iterator. If that object has an __iter__ method, iter() will call it to get the iterator. Python lists and tuples are iterable. So are strings, which is why you can write for char in my_str: to iterate over my_str​'s characters. Any container you might use in a for loop is iterable.

A for loop is the most common way to step through a sequence. But sometimes your code needs to step through in a more fine-grained way. For this, use the built-in function next(). You normally call it with a single argument, which is an iterator. Each time you call it, next(my_iterator) fetches and returns the next element:

  1. >>> names = ["Tom", "Shelly", "Garth"]
  2. >>> # Create a fresh iterator...
  3. ... names_it = iter(names)
  4. >>> next(names_it)
  5. 'Tom'
  6. >>> next(names_it)
  7. 'Shelly'
  8. >>> next(names_it)
  9. 'Garth'

What happens if you call next(names_it) again? next() will raise a special built-in exception, called StopIteration:

  1. >>> next(names_it)
  2. Traceback (most recent call last):
  3. File "<stdin>", line 1, in <module>
  4. StopIteration

This is part of Python’s iterator protocol. Raising this specific exception is, by design, how an iterator signals the sequence is done. You rarely have to raise or catch this exception yourself, though we’ll see some patterns later where it’s useful to do so. A good mental model for how a for loop works is to imagine it calling next() each time through the loop, exiting when StopIteration gets raised.

When using next() yourself, you can provide a second argument, for the default value. If you do, next() will return that instead of raising StopIteration at the end:

  1. >>> names = ["Tom", "Shelly", "Garth"]
  2. >>> new_names_it = iter(names)
  3. >>> next(new_names_it, "Rick")
  4. 'Tom'
  5. >>> next(new_names_it, "Rick")
  6. 'Shelly'
  7. >>> next(new_names_it, "Rick")
  8. 'Garth'
  9. >>> next(new_names_it, "Rick")
  10. 'Rick'
  11. >>> next(new_names_it)
  12. Traceback (most recent call last):
  13. File "<stdin>", line 1, in <module>
  14. StopIteration
  15. >>> next(new_names_it, "Jane")
  16. 'Jane'

Now, let’s consider a different situation. What if you aren’t working with a simple sequence of constants? What if you are calculating or reading or otherwise obtaining the sequence elements as you go along?

Let’s start with a simple example (so it’s easy to reason about). Suppose you need to write a function creating a list of square numbers, which will be processed by other code:

  1. def fetch_squares(max_root):
  2. squares = []
  3. for n in range(max_root):
  4. squares.append(n**2)
  5. return squares
  6. MAX = 5
  7. for square in fetch_squares(MAX):
  8. do_something_with(square)

This works. But there is potential problem lurking here. Can you spot it?

Here’s one: what if MAX is not 5, but 10,000,000? Or 10,000,000,000? Or more? Your memory footprint is pointlessly dreadful: the code here creates a massive list, uses it once, then throws it away. On top of that, the second for loop cannot even start until the entire list of squares has been fully calculated. If some poor human is using this program, they’ll wonder if the program is stuck.

Even worse: What if you aren’t doing arithmetic to get each element - which is fast and cheap - but making a truly expensive calculation? Or making an API call over the network? Or reading from a database? Your program is sluggish, perhaps unresponsive, and might even crash with an out-of-memory error. Its users will think you’re a terrible programmer.

The solution is to create an iterator to start with, lazily computing each value only when needed. Then each cycle through the loop happens just in time.

For the record, here is how you create an equivalent iterator class, which fully complies with Python’s iterator protocol:

  1. class SquaresIterator:
  2. def __init__(self, max_root_value):
  3. self.max_root_value = max_root_value
  4. self.current_root_value = 0
  5. def __iter__(self):
  6. return self
  7. def __next__(self):
  8. if self.current_root_value >= self.max_root_value:
  9. raise StopIteration
  10. square_value = self.current_root_value ** 2
  11. self.current_root_value += 1
  12. return square_value
  13. # You can use it like this:
  14. for square in SquaresIterator(5):
  15. print(square)

Holy crap, that’s horrible. There’s got to be a better way.

Good news: there’s a better way. It’s called a generator function, and you’re going to love it!

Generator Functions

Python provides a tool called the generator function, which…​ well, it’s hard to describe everything it gives you in one sentence. Of its many talents, I’ll first focus on how it’s a very useful shortcut for creating iterators.

A generator function looks a lot like a regular function. But instead of saying return, it uses a new and different keyword: yield. Here’s a simple example:

  1. def gen_nums():
  2. n = 0
  3. while n < 4:
  4. yield n
  5. n += 1

Use it in a for loop like this:

  1. >>> for num in gen_nums():
  2. ... print(num)
  3. 0
  4. 1
  5. 2
  6. 3

Let’s go through and understand this. When you call gen_nums() like a function, it immediately returns a generator object:

  1. >>> sequence = gen_nums()
  2. >>> type(sequence)
  3. <class 'generator'>

The generator function is gen_nums() - what we define and then call. A function is a generator function if and only if it uses "yield" instead of "return". The generator object is what that generator function returns when called - sequence, in this case.

MEMORIZE THIS FACT: A generator function will ALWAYS return a generator object. It can never return anything else.

This generator object is an iterator, which means you can iterate through it using next() or a for loop:

  1. >>> sequence = gen_nums()
  2. >>> next(sequence)
  3. 0
  4. >>> next(sequence)
  5. 1
  6. >>> next(sequence)
  7. 2
  8. >>> next(sequence)
  9. 3
  10. >>> next(sequence)
  11. Traceback (most recent call last):
  12. File "<stdin>", line 1, in <module>
  13. StopIteration
  1. >>> # Or in a for loop:
  2. ... for num in gen_nums(): print(num)
  3. ...
  4. 0
  5. 1
  6. 2
  7. 3

The flow of code works like this: when next() is called the first time, or the for loop first starts, the body of gen_nums starts executing at the beginning, returning the value to the right of the yield.

So far, this is much like a regular function. But the next time next() is called - or, equivalently, the next time through the for loop - the function doesn’t start at the beginning again. It starts on the line after the yield statement. Look at the source of gen_nums() again:

  1. def gen_nums():
  2. n = 0
  3. while n < 4:
  4. yield n
  5. n += 1

gen_nums is more general than a function or subroutine. It’s actually a coroutine. You see, a regular function can have several exit points (otherwise known as return statements). But it has only one entry point: each time you call a function, it always starts at the first line of the function body.

A coroutine is like a function, except it has several possible entry points. It starts with the first line, like a normal function. But when it "returns", the coroutine isn’t exiting, so much as pausing. Subsequent calls with next() - or equivalently, the next time through the for loop - start at that yield statement again, right where it left off; the re-entry point is the line after the yield statement.

And that’s the key: Each yield statement simultaneously defines an exit point, and a re-entry point.

For generator objects, each time a new value is requested, the flow of control picks up on the line after the yield statement. In this case, the next line increments the variable n, then continues with the while loop.

Notice we do not raise StopIteration anywhere in the body of gen_nums(). When the function body finally exits - after it exits the while loop, in this case - the generator object automatically raises StopIteration.

Again: each yield statement simultaneously defines an exit point, and a re-entry point. In fact, you can have multiple yield statements in a generator:

  1. def gen_extra_nums():
  2. n = 0
  3. while n < 4:
  4. yield n
  5. n += 1
  6. yield 42 # Second yield

Here’s the output when you use it:

>>> for num in gen_extra_nums():
...     print(num)
0
1
2
3
42

The second yield is reached after the while loop exits. When the function reaches the implicit return at the end, the iteration stops. Reason through the code above, and convince yourself it makes sense.

Let’s revisit the earlier example, of cycling through a sequence of squares. This is how we first did it:

  1. def fetch_squares(max_root):
  2. squares = []
  3. for n in range(max_root):
  4. squares.append(n**2)
  5. return squares
  6. MAX = 5
  7. for square in fetch_squares(MAX):
  8. do_something_with(square)

As an exercise, pause here, open up a new Python file, and see if you can write a gen_squares generator function that accomplishes the same thing.

Done? Great. Here’s what it looks like:

  1. >>> def gen_squares(max_num):
  2. ... for num in range(max_num):
  3. ... yield num ** 2
  4. ...
  5. >>> MAX = 5
  6. >>> for square in gen_squares(MAX):
  7. ... print(square)
  8. 0
  9. 1
  10. 4
  11. 9
  12. 16

Now, this gen_squares() had a potential problem. Can you spot it?

And once you do: IS it an actual problem? Why or why not?

Take a moment to think about it…​

Okay, time’s up:

I’m referring to the range() function. What does that return, exactly?

Because imagine range() returns a list. If that’s the case, and MAX is huge, that creates a huge list inside your generator function. Completely destroying its scalability. On the other hand, if range() itself returns a scalable iterator…​ then no problem.

Good news: it turns out range() does in fact return a scalable iterator. Whew!

The larger point: Generator functions are ONLY as scalable as their LEAST scalable line of code.

Generator functions potentially have a small memory footprint, but only if you code intelligently. When writing generator functions, be watchful for hidden bottlenecks.

Now, strictly speaking, we don’t need generator functions for iteration. We just want them, because they make certain patterns of scalability far easier. Now that we’re in a position to understand it, let’s look at the SquaresIterator class again - the same one you saw above:

  1. # Same code we saw earlier.
  2. class SquaresIterator:
  3. def __init__(self, max_root_value):
  4. self.max_root_value = max_root_value
  5. self.current_root_value = 0
  6. def __iter__(self):
  7. return self
  8. def __next__(self):
  9. if self.current_root_value >= self.max_root_value:
  10. raise StopIteration
  11. square_value = self.current_root_value ** 2
  12. self.current_root_value += 1
  13. return square_value
  14. # You can use it like this:
  15. for square in SquaresIterator(5):
  16. print(square)

Each value is obtained by invoking its __next__() method, until it raises StopIteration. This produces the same output; but look at the source for the SquaresIterator class, and compare it to the source for the generator above. Which is easier to read? Which is easier to maintain? And when requirements change, which is easier to modify without introducing errors? Most people find the generator solution easier and more natural.

Authors often use the word "generator" by itself, to mean either the generator function, or the generator object returned when you call it. Typically the writer thinks it’s obvious by the context which they are referring to; sometimes it is, but often not. Sometimes the writer is not even clear on the distinction to begin with. But it’s important: just as there is a big difference between a function, and the value it returns when you call it, so is there a big difference between the generator function, and the generator object it returns.

In your own thought and speech, I encourage you to only use the phrases "generator function" and "generator object", so you are always clear inside yourself, and in your communication. (Which also helps your teammates be more clear.) The only exception: when you truly mean "generator functions and objects", referring to the language feature as a broad concept, then it’s okay to just say "generators". I’ll lead by example in this book.

Generator Patterns and Scalable Composability

Here’s a little generator function:

  1. def matching_lines_from_file(path, pattern):
  2. with open(path) as handle:
  3. for line in handle:
  4. if pattern in line:
  5. yield line.rstrip('\n')

matching_lines_from_file() demonstrates several important practices for modern Python, and is worth studying. It does simple substring matching on lines of a text file, yielding lines containing that substring.

The first line opens a read-only file object, called handle. If you haven’t been opening your file objects using with statements, start today. The main benefit is that once the with block is exited, the file object is automatically closed - even if an exception causes a premature exit. It’s similar to:

  1. try:
  2. handle = open(path)
  3. # read from handle
  4. finally:
  5. handle.close()

(The try/finally is explained in the exceptions chapter.) Next we have for line in handle. This useful idiom, which not many people seem to know about, is a special case for text files. Each iteration through the for loop, a new line of text will be read from the underlying text file, and placed in the line variable.

Sometimes people foolishly take another approach, which I have to warn you about:

  1. # Don't do this!!
  2. for line in handle.readlines():

.readlines() (plural) reads in the entire file, parses it into lines, and returns a list of strings - one string per line. By now, you realize how this destroys the generator function’s scalability.

Another approach you will sometimes see, which is scalable, is to use the file object’s .readline() method (singular), which manually returns lines one at a time:

  1. # .readline() reads and returns a single line of text,
  2. # or returns the empty string at end-of-file.
  3. line = handle.readline()
  4. while line != '':
  5. # do something with line
  6. # ...
  7. # At the end of the loop, read the next line.
  8. line = handle.readline()

But simply writing for line in handle is clearer and easier.

After that, it’s straightforward: matching lines have any trailing \n-character stripped, and are yielded to the consumer. When writing generator functions, you want to ask yourself "what is the maximum memory footprint of this function, and how can I minimize it?" You can think of scalability as inversely proportional to this footprint. For matching_lines_from_file(), it will be about equal to the size of the longest line in the text file. So it’s appropriate for the typical human-readable text file, whose lines are small.

(It’s also possible to point it to, say, a ten-terabyte text file consisting of exactly one line. If you expect something like that, you’ll need a different approach.)

Now, suppose a log file contains lines like this:

WARNING: Disk usage exceeding 85%
DEBUG: User 'tinytim' upgraded to Pro version
INFO: Sent email campaign, completed normally
WARNING: Almost out of beer

... and you exercise matching_lines_from_file() like so:

for line in matching_lines_from_file("log.txt","WARNING:"):
    print(line)

That yields these records:

WARNING: Disk usage exceeding 85%
WARNING: Almost out of beer

Suppose your application needs that data in dict form:

{"level": "WARNING", "message": "Disk usage exceeding 85%"}
{"level": "DEBUG", "message":
    "User 'tinytim' upgraded to Pro version"}

We want to scalably transform the records from one form to another - from strings (lines of the log file), to Python dicts. So let’s make a new generator function to connect them:

  1. def parse_log_records(lines):
  2. for line in lines:
  3. level, message = line.split(": ", 1)
  4. yield {"level": level, "message": message}

Now we can connect the two:

  1. # log_lines is a generator object
  2. log_lines = matching_lines_from_file("log.txt", "WARNING:")
  3. for record in parse_log_records(log_lines):
  4. # record is a dict
  5. print(record)

Of course, parse_log_records() can be used on its own:

  1. with open("log.txt") as handle:
  2. for record in parse_log_records(handle):
  3. print(record)

matching_lines_from_file() and parse_log_records() are like building blocks. Properly designed, they can be used to build different data processing streams. I call this scalable composability. It goes beyond designing composable functions and types. Ask yourself how you can make the components scalable, and whatever is assembled out of them scalable too.

Let’s discuss a particular design point. Both matching_lines_from_file() and parse_log_records() produce an iterator. (Or, more specifically, a generator object). But they have a discrepancy on the input side: parse_log_records() accepts an iterator, but matching_lines_from_file() requires a path to a file to read from. This means matching_lines_from_file() combines two functions: read lines of text from a file, then filter those lines based on some criteria.

Combining functions like this is often what you want in realistic code. But when designing components to flexibly compose together, inconsistent interfaces like this can be limiting. Let’s break up the services in matching_lines_from_file() into two generator functions:

  1. def lines_from_file(path):
  2. with open(path) as handle:
  3. for line in handle:
  4. yield line.rstrip('\n')
  5. def matching_lines(lines, pattern):
  6. for line in lines:
  7. if pattern in line:
  8. yield line

You can compose these like so:

  1. lines = lines_from_file("log.txt")
  2. matching = matching_lines(lines, 'WARNING:')
  3. for line in matching:
  4. print(line)

Or even redefine matching_lines_from_file() in terms of them:

  1. def matching_lines_from_file(pattern, path):
  2. lines = lines_from_file(path)
  3. matching = matching_lines(lines, pattern)
  4. for line in matching:
  5. yield line

Conceptually, this factored-out matching_lines does a filtering operation; all lines are read in, and a subset are yielded. parse_log_records() is different. One input record (a str line) maps to exactly one output record (a dict). Mathematically, it’s a mapping operation. Think of it as a transformer or adapter. lines_from_file() is in a third category; instead of taking a stream as input, it takes a completely different parameter. Since it still returns an iterator of records, think of it as a source. And any real program will eventually want to do something with that stream, consuming it without producing another iterator; call that a sink.

You need all these pieces to make a working program. When designing a chainable set of generator functions like this - or even better, a toolkit for constructing internal data pipelines - ask yourself whether each component is a sink, a source, or whether it does filtering, or mapping; or whether it’s some combination of these. Just asking yourself this question leads to a more usable, readable, and maintainable codebase. And if you’re making a library which others will use, you’re more likely to end up with a toolkit so powerfully flexible, people use it to build programs you never imagined.

I want you to notice something about parse_log_records(). As I said, it fits in the "mapping" category. And notice its mapping is one-to-one: one line of text becomes one dictionary. In other words, each record in the input - a str - becomes exactly one record in the output - a dict.

That isn’t always the case. Sometimes, your generator function needs to consume several input records to create one output record. Or the opposite: one input record yields several output records.

Here’s an example of the latter. Imagine a text file containing lines in a poem:[1]

all night our room was outer-walled with rain
drops fell and flattened on the tin roof
and rang like little disks of metal

Let’s create a generator function, words_in_text(), producing the words one at a time. First approach:

  1. # lines is an iterator of text file lines,
  2. # e.g. returned by lines_from_file()
  3. def words_in_text(lines):
  4. for line in lines:
  5. for word in line.split():
  6. yield word

This generator function takes a fan out approach. No input records are dropped, which means it doesn’t do any filtering; it’s still purely in the "mapping" category of generator functions. But the mapping isn’t one to one. Rather, one input record produces one or more output records. So, when you run the following code:

  1. poem_lines = lines_from_file("poem.txt")
  2. poem_words = words_in_text(poem_lines)
  3. for word in poem_words:
  4. print(word)

... it produces this output:

all
night
our
room
was
outer-walled
...

That first input record - "all night our room was outer-walled with rain" - yields eight words (output records). Ignoring any blank lines in the poem, every line of prose will produce at least one - probably several - words.

The idea of fanning out is interesting, but simple enough. It’s more complex when we go the opposite direction: fanning in. That means the generator function consumes more than one input record to produce each output record. Doing this successfully requires an awareness of the input’s structure, and you’ll typically need to encode some simple parsing logic.

Imagine a text file containing residential house sale data. Each record is a set of key-value pairs, one pair per line, with records separated by blank lines:

address: 1423 99th Ave
square_feet: 1705
price_usd: 340210

address: 24257 Pueblo Dr
square_feet: 2305
price_usd: 170210

address: 127 Cochran
square_feet: 2068
price_usd: 320500

To read this data into a form usable in our code, what we want is a generator function - let’s name it house_records() - which accepts a sequence of strings (lines) and parses them into convenient dictionaries:

>>> lines_of_house_data = lines_from_file("housedata.txt")
>>> houses = house_records(lines_of_house_data)
>>> # Fetch the first record.
... house = next(houses)
>>> house['address']
'1423 99th Ave'
>>> house = next(houses)
>>> house['address']
'24257 Pueblo Dr'

How would you create this? If practical, pause here, open up a code editor, and see if you can implement it.

Okay, time’s up. Here is one approach:

  1. def house_records(lines):
  2. record = {}
  3. for line in lines:
  4. if line == '':
  5. yield record
  6. record = {}
  7. continue
  8. key, value = line.split(': ', 1)
  9. record[key] = value
  10. yield record

Notice where the yield keywords appear. The last line of the for loop reads individual key-value pairs. Starting with an empty record dictionary, it’s populated with data until lines produces an empty line. That signals the current record is complete, so it’s yield-ed, and a new record dictionary created. The end of the very last record in housedata.txt is signaled not by an empty line, but by the end of the file; that’s why we need the final yield statement.

As defined, house_records() is a bit clunky if we’re normally reading from a text file. It makes sense to define a new generator function accepting just the path to the file:

  1. def house_records_from_file(path):
  2. lines = lines_from_file(path)
  3. for house in house_records(lines):
  4. yield house
  5. # Then in your program:
  6. for house in house_records_from_file("housedata.txt"):
  7. print(house["address"])

You may have noticed many of these examples have a bit of boilerplate, when one generator function internally calls another. The last two lines of house_records_from_file say:

  1. for house in house_records(lines):
  2. yield house

Python provides a shortcut, which lets you accomplish this in one line, with the yield from statement:

  1. def house_records_from_file(path):
  2. lines = lines_from_file(path)
  3. yield from house_records(lines)

Even though "yield from" is two words, semantically it’s like a single keyword, and distinct from yield. The yield from statement is used specifically in generator functions, when they yield values directly from another generator object (or, equivalently, by calling another generator function). Using it often simplifies your code, as you see in house_records_from_file(). Going back a bit, here’s how it works with matching_lines_from_file():

  1. def matching_lines_from_file(pattern, path):
  2. lines = lines_from_file("log.txt")
  3. yield from matching_lines(lines, pattern)

The formal name for what yield from does is "delegating to a sub-generator", and instills a deeper connection between the containing and inner generator objects. In particular, generator objects have certain methods - send, throw and close - for passing information back into the context of the running generator function. I won’t cover them in this edition of the book, as they are not currently widely used; you can learn more by reading PEPs 342 and 380.[2] If you do use them, yield from becomes necessary to enable the flow of information back into the scope of the running coroutine.

Python is Filled With Iterators

Let’s look at dictionaries:

  1. >>> calories = {
  2. ... "apple": 95,
  3. ... "slice of bacon": 43,
  4. ... "cheddar cheese": 113,
  5. ... "ice cream": 15, # You wish!
  6. ... }
  7. >>> items = calories.items()
  8. >>> type(items)
  9. <class 'dict_items'>

So what is this dict_items object returned by calories.items()? It turns out to be what Python calls a view. There is not any kind of base view type; rather, an object quacks like a dictionary view if it supports three things:

  • len(view) returns the number of items,
  • iter(view) returns an iterator over the key-value pairs, and
  • (key, value) in view returns True if that key-value pair is in the dictionary, else False.

In other words, a dictionary view is iterable, with a couple of bonus features. It also dynamically updates if its source dictionary changes:

  1. >>> items = calories.items()
  2. >>> len(items)
  3. 4
  4. >>> calories['orange'] = 50
  5. >>> len(items)
  6. 5
  7. >>> ('orange', 50) in items
  8. True
  9. >>> ('orange', 20) in items
  10. False

Dictionaries also have .keys() and .values(). Like .items(), they each return a view. But instead of key-value pairs, they only contain keys or values, respectively:

  1. >>> foods = calories.keys()
  2. >>> counts = calories.values()
  3. >>> 'yogurt' in foods
  4. False
  5. >>> 100 in counts
  6. False
  7. >>> calories['yogurt'] = 100
  8. >>> 'yogurt' in foods
  9. True
  10. >>> 100 in counts
  11. True

Iteration has snuck into many places in Python. The built-in range function returns an iterable:

  1. >>> seq = range(3)
  2. >>> type(seq)
  3. <class 'range'>
  4. >>> for n in seq: print(n)
  5. 0
  6. 1
  7. 2

The built-in map, filter, and zip functions all return iterators:

  1. >>> numbers = [1, 2, 3]
  2. >>> big_numbers = [100, 200, 300]
  3. >>> def double(n):
  4. ... return 2 * n
  5. >>> def is_even(n):
  6. ... return n % 2 == 0
  7. >>> mapped = map(double, numbers)
  8. >>> mapped
  9. <map object at 0x1013ac518>
  10. >>> for num in mapped: print(num)
  11. 2
  12. 4
  13. 6
  1. >>> filtered = filter(is_even, numbers)
  2. >>> filtered
  3. <filter object at 0x1013ac668>
  4. >>> for num in filtered: print(num)
  5. 2
  1. >>> zipped = zip(numbers, big_numbers)
  2. >>> zipped
  3. <zip object at 0x1013a9608>
  4. >>> for pair in zipped: print(pair)
  5. (1, 100)
  6. (2, 200)
  7. (3, 300)

Notice that mapped is something called a "map object", rather than a list of the results of the calculation; and similar for filtered and zipped. These are all iterators - giving you all the benefits of iteration, built into the language.

The Iterator Protocol

This optional section explains Python’s iterator protocol in formal detail, giving you a precise and low-level understanding of how generators, iterators, and iterables all work. For the day-to-day coding of most programmers, it’s not nearly as important as everything else in this chapter. That said, you need this information to implement your own, custom iterable collection types. Personally, I also find knowing the protocol helps me reason through iteration-related issues and edge cases; by knowing these details, I’m able to quickly troubleshoot and fix certain bugs that might otherwise eat up my afternoon. If this all sounds valuable to you, keep reading; otherwise, feel free to skip to the next chapter.

As mentioned, Python makes a distinction between iterators, versus objects that are iterable. The difference is subtle to begin with, and frankly it doesn’t help that the two words sound nearly identical. Keep clear in your mind that "iterator" and "iterable" are distinct but related concepts, and the following will be easier to understand.

Informally, an iterator is something you can pass to next(), or use exactly once in a for loop. More formally, an object in Python is an iterator if follows the iterator protocol. And an object follows the iterator protocol if it meets the following criteria:

  • It defines a method named __next__, called with no arguments.
  • Each time __next__() is called, it produces the next item in the sequence.
  • Until all items have been produced. Then, subsequent calls to __next__() raise StopIteration.
  • It also defines a boilerplate method named __iter__, called with no arguments, and returning the same iterator. Its body is literally return self.

Any object with these methods can call itself a Python iterator. You are not intended to call the __next__() method directly. Instead, you will use the built-in next() function. To understand better, here is how you might write your own next() function:

  1. _NO_DEFAULT = object()
  2. def next(it, default=_NO_DEFAULT):
  3. try:
  4. return it.__next__()
  5. except StopIteration:
  6. if default is _NO_DEFAULT:
  7. raise
  8. return default

(As a side note, notice how it creates a unique sentinel value, _NO_DEFAULT, rather defaulting to a built-in value like None. A sentinel value is a value that exists solely for signaling in the algorithm, and is meant to never overlap with a possible value for real data. This way you can pass any value to default that you like without conflict.)

Now, all the above is for the "iterator". Let’s explain the other word, "iterable". Informally, an object is iterable if it knows how to create an iterator over its contents, which you can access with the built-in iter() function. More formally, a Python container object is iterable if it meets one of these two criteria:

  • It defines a method called __iter__(), which creates and returns an iterator over the elements in the container; or
  • it follows the sequence protocol. This means it defines __getitem__ - the magic method for square brackets - and lets you reference foo[0], foo[1], etc., raising an IndexError once you go past the last element.

(Notice "iterator" is a noun, while "iterable" is usually an adjective. This can help you remember which is which.)

When implementing your own container type, you probably want to make it iterable, so you and others can use it in a for loop. Depending on the nature of the container, it’s often easiest to implement the sequence protocol. As an example, consider this simple UniqueList type, which is a kind of hybrid between a list and a set:

  1. class UniqueList:
  2. def __init__(self, items):
  3. self.items = []
  4. for item in items:
  5. self.append(item)
  6. def append(self, item):
  7. if item not in self.items:
  8. self.items.append(item)
  9. def __getitem__(self, index):
  10. return self.items[index]

Use it like this:

  1. >>> u = UniqueList([3,7,2,9,3,4,2])
  2. >>> u.items
  3. [3, 7, 2, 9, 4]
  4. >>> u[3]
  5. 9
  6. >>> u[42]
  7. Traceback (most recent call last):
  8. File "<stdin>", line 1, in <module>
  9. File "<stdin>", line 10, in __getitem__
  10. IndexError: list index out of range

The __getitem__ method implements square-bracket access; basically, Python translates u[3] into u.__getitem__(3). We’ve programmed this object’s square brackets to operate much like a normal list, in that the initial element is at index 0, and you get subsequent elements with subsequent integers, not skipping any. And when you go past the end, it raises IndexError. If an object has a __getitem__ method behaving in just this way, iter() knows how to create an iterator over it:

  1. >>> u_iter = iter(u)
  2. >>> type(u_iter)
  3. <class 'iterator'>
  4. >>> for num in u_iter: print(num)
  5. 3
  6. 7
  7. 2
  8. 9
  9. 4

Notice we get a lot of this behavior for free, simply because we’re using an actual list internally (and thus delegating much of the __getitem__ logic to it). That’s a clue for you, whenever you make a custom collection that acts like a list - or one of the other standard collection types. If your object internally stores its data in one of the standard data types, you’ll often have an easier time mimicking its behavior.

Implementing the sequence protocol - i.e., writing a __getitem__ method which acts like a list’s - is one way to make your class iterable. The other involves writing an __iter__ method. When called with no arguments, it must return some object which follows the iterator protocol, described above. In the worst case, you’ll need to implement something like the SquaresIterator class from earlier in this chapter, with its own __next__ and __iter__ methods. But usually you don’t have to work that hard, because you can simply return a generator object instead. That means __iter__ is a generator function itself, or it internally calls some other generator function, returning its value.

Both iterables and iterators must have an __iter__ method. Both are called with no argument, and both return an iterator object. The only difference: the one for the iterator returns its self, while an iterable’s will create and return a new iterator. And if you call it twice, you get two different iterators.

This similarity is intentional, to simplify control code that can accept either iterators or iterables. Here’s the mental model you can safely follow: when Python’s runtime encounters a for loop, it will start by invoking iter(sequence). This always returns an iterator: either sequence itself, or (if sequence is only iterable) the iterator created by sequence.__iter__().

Iterables are everywhere in Python. Almost all built-in collection types are iterable: list, tuple, and set, and even dict. (Though you’ll probably want to use dict.items() - a simple for x in some_dict will iterate just over the keys).

In your own custom collection classes, sometimes the easiest way to implement __iter__() actually involves using iter(). For instance, this will not work:

  1. class BrokenInLoops:
  2. def __init__(self):
  3. self.items = [7, 3, 9]
  4. def __iter__(self):
  5. return self.items

If you try it, you get a TypeError:

  1. >>> items = BrokenInLoops()
  2. >>> for item in items:
  3. ... print(item)
  4. Traceback (most recent call last):
  5. File "<stdin>", line 1, in <module>
  6. TypeError: iter() returned non-iterator of type 'list'

It doesn’t work because __iter__() is supposed to return an iterator, but a list object is not an iterator; it is simply iterable. You can fix this with a one-line change: use iter() to create an iterator object inside of __iter__(), and return that object:

  1. class WorksInLoops:
  2. def __init__(self):
  3. self.items = [7, 3, 9]
  4. def __iter__(self):
  5. # This class is identical to BrokenInLoops,
  6. # except for this next line.
  7. return iter(self.items)

This makes WorksInLoops itself iterable, because __iter__ returns an actual iterator object - making WorksInLoops follow the iterator protocol correctly. That __iter__ method generates a fresh iterator each time:

  1. >>> items = WorksInLoops()
  2. >>> for item in items:
  3. ... print(item)
  4. 7
  5. 3
  6. 9

Next Chapter: Creating Collections with Comprehensions

Previous Chapter: Doing More With Python