Scaling With Generators
This for loop seems simple:
- for item in items:
- do_something_with(item)
And yet, miracles hide here. As you probably know, the act of efficiently going through a collection, one element at a time, is called iteration. But few understand how Python’s iteration system really works… how deep and well-thought-out it is. This chapter makes you one of those people, giving you the ability to naturally write highly scalable Python applications… able to handle ever-larger data sets in performant, memory-efficient ways.
Iteration is also core to one of Python’s most powerful tools: the generator function. Generator functions are not just a convenient way to create useful iterators. They enable some exquisite patterns of code organization, in a way that - by their very nature - intrinsically encourage excellent coding habits.
This chapter is special, because understanding it threatens to make you a permanently better programmer in every language. Mastering Python generators tends to do that, because of the distinctions and insights you gain along the way. Let’s dive in.
Iteration in Python
Python has a built-in function called iter(). When you pass it a
collection, you get back an iterator object:
- >>> numbers = [7, 4, 11, 3]
- >>> iter(numbers)
- <list_iterator object at 0x10219dc50>
Just as in other languages, a Python iterator produces the values in a sequence, one at a time. You probably know an iterator is like a moving pointer over the collection:
- >>> numbers_iter = iter(numbers)
- >>> for num in numbers_iter: print(num)
- 7
- 4
- 11
- 3
You don’t normally need to do this. If you instead write for num in
numbers, what Python effectively does under the hood is call iter()
on that collection. This happens automatically. Whatever object it
gets back is used as the iterator for that for loop:
- # This...
- for num in numbers:
- print(num)
-
- # ... is effectively just like this:
- numbers_iter = iter(numbers)
- for num in numbers_iter:
- print(num)
An iterator over a collection is a separate object, with its own
identity - which you can verify with id():
- >>> # id() returns a unique number for each object.
- ... # Different objects will always have different IDs.
- >>> id(numbers)
- 4330133896
- >>> id(numbers_iter)
- 4330216640
How does iter() actually get the iterator? It can do this in several
ways, but one relies on a magic method called
__iter__. This is a method any class (including yours) may
define; when called with no arguments, it must return a fresh iterator
object. Lists have it, for example:
- >>> numbers.__iter__
- <method-wrapper '__iter__' of list object at 0x10130e4c8>
- >>> numbers.__iter__()
- <list_iterator object at 0x1013180f0>
Python makes a distinction between objects
which are iterators, and objects which are iterable. We say an
object is iterable if and only if you can pass it to iter(), and
get a ready-to-use iterator. If that object has an __iter__
method, iter() will call it to get the iterator. Python lists and
tuples are iterable. So are strings, which is why you can write for
char in my_str: to iterate over my_str's characters. Any
container you might use in a for loop is iterable.
A for loop is the most common way to
step through a sequence. But sometimes your code needs to step through in
a more fine-grained way. For this, use the built-in function
next(). You normally call it with a single argument, which is an
iterator. Each time you call it, next(my_iterator) fetches and
returns the next element:
- >>> names = ["Tom", "Shelly", "Garth"]
- >>> # Create a fresh iterator...
- >>> names_it = iter(names)
- >>> next(names_it)
- 'Tom'
- >>> next(names_it)
- 'Shelly'
- >>> next(names_it)
- 'Garth'
What happens if you call next(names_it) again? next() will raise a
special built-in exception, called StopIteration:
- >>> next(names_it)
- Traceback (most recent call last):
- File "<stdin>", line 1, in <module>
- StopIteration
This is part of Python’s iterator
protocol. Raising this specific exception is, by design, how an
iterator signals the sequence is done. You rarely have to raise or
catch this exception yourself, though we’ll see some patterns later
where it’s useful to do so. A good mental model for how a for loop
works is to imagine it calling next() each time through the loop,
exiting when StopIteration gets raised.
When using next() yourself, you can provide a second argument, for
the default value. If you do, next() will return that instead of
raising StopIteration at the end:
- >>> names = ["Tom", "Shelly", "Garth"]
- >>> new_names_it = iter(names)
- >>> next(new_names_it, "Rick")
- 'Tom'
- >>> next(new_names_it, "Rick")
- 'Shelly'
- >>> next(new_names_it, "Rick")
- 'Garth'
- >>> next(new_names_it, "Rick")
- 'Rick'
- >>> next(new_names_it)
- Traceback (most recent call last):
- File "<stdin>", line 1, in <module>
- StopIteration
- >>> next(new_names_it, "Jane")
- 'Jane'
Now, let’s consider a different situation. What if you aren’t working with a simple sequence of constants? What if you are calculating or reading or otherwise obtaining the sequence elements as you go along?
Let’s start with a simple example (so it’s easy to reason about). Suppose you need to write a function creating a list of square numbers, which will be processed by other code:
- def fetch_squares(max_root):
- squares = []
- for n in range(max_root):
- squares.append(n**2)
- return squares
-
- MAX = 5
- for square in fetch_squares(MAX):
- do_something_with(square)
This works. But there is potential problem lurking here. Can you spot it?
Here’s one: what if MAX is not 5, but 10,000,000? Or
10,000,000,000? Or more? Your memory footprint is pointlessly
dreadful: the code here creates a massive list, uses it once, then
throws it away. On top of that, the second for loop cannot even
start until the entire list of squares has been fully calculated.
If some poor human is using this program, they’ll wonder if the
program is stuck.
Even worse: What if you aren’t doing arithmetic to get each element - which is fast and cheap - but making a truly expensive calculation? Or making an API call over the network? Or reading from a database? Your program is sluggish, perhaps unresponsive, and might even crash with an out-of-memory error. Its users will think you’re a terrible programmer.
The solution is to create an iterator to start with, lazily computing each value only when needed. Then each cycle through the loop happens just in time.
For the record, here is how you create an equivalent iterator class, which fully complies with Python’s iterator protocol:
- class SquaresIterator:
- def __init__(self, max_root_value):
- self.max_root_value = max_root_value
- self.current_root_value = 0
- def __iter__(self):
- return self
- def __next__(self):
- if self.current_root_value >= self.max_root_value:
- raise StopIteration
- square_value = self.current_root_value ** 2
- self.current_root_value += 1
- return square_value
-
- # You can use it like this:
- for square in SquaresIterator(5):
- print(square)
Holy crap, that’s horrible. There’s got to be a better way.
Good news: there’s a better way. It’s called a generator function, and you’re going to love it!
Generator Functions
Python provides a tool called the generator function, which… well, it’s hard to describe everything it gives you in one sentence. Of its many talents, I’ll first focus on how it’s a very useful shortcut for creating iterators.
A generator function looks a lot like a regular function. But instead
of saying return, it uses a new and different keyword:
yield. Here’s a simple example:
- def gen_nums():
- n = 0
- while n < 4:
- yield n
- n += 1
Use it in a for loop like this:
- >>> for num in gen_nums():
- ... print(num)
- 0
- 1
- 2
- 3
Let’s go through and understand this. When you call gen_nums() like
a function, it immediately returns a generator object:
- >>> sequence = gen_nums()
- >>> type(sequence)
- <class 'generator'>
The generator function is
gen_nums() - what we define and then call. A function is a generator
function if and only if it uses "yield" instead of "return". The
generator object is what that generator function returns when called
- sequence, in this case.
MEMORIZE THIS FACT: A generator function will ALWAYS return a generator object. It can never return anything else.
This generator object is an iterator, which means you can iterate
through it using next() or a for loop:
- >>> sequence = gen_nums()
- >>> next(sequence)
- 0
- >>> next(sequence)
- 1
- >>> next(sequence)
- 2
- >>> next(sequence)
- 3
- >>> next(sequence)
- Traceback (most recent call last):
- File "<stdin>", line 1, in <module>
- StopIteration
- >>> # Or in a for loop:
- ... for num in gen_nums(): print(num)
- ...
- 0
- 1
- 2
- 3
The flow of code works like this: when next() is called the first
time, or the for loop first starts, the body of gen_nums starts
executing at the beginning, returning the value to the right of the
yield.
So far, this is much like a regular function. But the next time
next() is called - or, equivalently, the next time through the for
loop - the function doesn’t start at the beginning again. It starts on
the line after the yield statement. Look at the source of gen_nums()
again:
- def gen_nums():
- n = 0
- while n < 4:
- yield n
- n += 1
gen_nums is more general than a function or subroutine. It’s actually
a coroutine. You see, a regular function can have several exit
points (otherwise known as return statements). But it has only one
entry point: each time you call a function, it always starts at the
first line of the function body.
A coroutine is like a function, except it has several possible
entry points. It starts with the first line, like a normal
function. But when it "returns", the coroutine isn’t exiting, so much
as pausing. Subsequent calls with next() - or equivalently, the
next time through the for loop - start at that yield statement
again, right where it left off; the re-entry point is the line after
the yield statement.
And that’s the key: Each yield statement simultaneously defines an exit
point, and a re-entry point.
For generator objects, each time a new value is requested,
the flow of control picks up on the line after the yield
statement. In this case, the next line increments the variable n,
then continues with the while loop.
Notice we do not raise StopIteration anywhere in the body of
gen_nums(). When the function body finally exits - after it exits
the while loop, in this case - the generator object automatically
raises StopIteration.
Again: each yield statement simultaneously defines an exit point,
and a re-entry point. In fact, you can have multiple yield
statements in a generator:
- def gen_extra_nums():
- n = 0
- while n < 4:
- yield n
- n += 1
- yield 42 # Second yield
Here’s the output when you use it:
>>> for num in gen_extra_nums(): ... print(num) 0 1 2 3 42
The second yield is reached after the while loop exits. When the
function reaches the implicit return at the end, the iteration
stops. Reason through the code above, and convince yourself it makes
sense.
Let’s revisit the earlier example, of cycling through a sequence of squares. This is how we first did it:
- def fetch_squares(max_root):
- squares = []
- for n in range(max_root):
- squares.append(n**2)
- return squares
-
- MAX = 5
- for square in fetch_squares(MAX):
- do_something_with(square)
As an exercise, pause here, open up a new Python file, and see if you
can write a gen_squares generator function that accomplishes the
same thing.
Done? Great. Here’s what it looks like:
- >>> def gen_squares(max_num):
- ... for num in range(max_num):
- ... yield num ** 2
- ...
- >>> MAX = 5
- >>> for square in gen_squares(MAX):
- ... print(square)
- 0
- 1
- 4
- 9
- 16
Now, this gen_squares() had a potential problem. Can you spot it?
And once you do: IS it an actual problem? Why or why not?
Take a moment to think about it…
Okay, time’s up:
I’m referring to the range() function. What does
that return, exactly?
Because imagine range() returns a list. If that’s the case, and MAX
is huge, that creates a huge list inside your generator
function. Completely destroying its scalability. On the other hand,
if range() itself is iterable… then no problem.
Good news: it turns out range() is in fact iterable. Whew!
The larger point: Generator functions are ONLY as scalable as their LEAST scalable line of code.
Generator functions potentially have a small memory footprint, but only if you code intelligently. When writing generator functions, be watchful for hidden bottlenecks.
Now, strictly speaking, we don’t need generator functions for
iteration. We just want them, because they make certain patterns of
scalability far easier. Now that we’re in a position to understand
it, let’s look at the SquaresIterator class again - the same one you
saw above:
- # Same code we saw earlier.
- class SquaresIterator:
- def __init__(self, max_root_value):
- self.max_root_value = max_root_value
- self.current_root_value = 0
- def __iter__(self):
- return self
- def __next__(self):
- if self.current_root_value >= self.max_root_value:
- raise StopIteration
- square_value = self.current_root_value ** 2
- self.current_root_value += 1
- return square_value
-
- # You can use it like this:
- for square in SquaresIterator(5):
- print(square)
Each value is obtained by invoking its __next__() method, until it
raises StopIteration. This produces the same output; but look at the
source for the SquaresIterator class, and compare it to the source
for the generator above. Which is easier to read? Which is easier to
maintain? And when requirements change, which is easier to modify
without introducing errors? Most people find the generator solution
easier and more natural.
Authors often use the word "generator" by itself, to mean either the generator function, or the generator object returned when you call it. Typically the writer thinks it’s obvious by the context which they are referring to; sometimes it is, but often not. Sometimes the writer is not even clear on the distinction to begin with. But it’s important: just as there is a big difference between a function, and the value it returns when you call it, so is there a big difference between the generator function, and the generator object it returns.
In your own thought and speech, I encourage you to only use the phrases "generator function" and "generator object", so you are always clear inside yourself, and in your communication. (Which also helps your teammates be more clear.) The only exception: when you truly mean "generator functions and objects", referring to the language feature as a broad concept, then it’s okay to just say "generators". I’ll lead by example in this book.
Generator Patterns and Scalable Composability
Here’s a little generator function:
- def matching_lines_from_file(path, pattern):
- with open(path) as handle:
- for line in handle:
- if pattern in line:
- yield line.rstrip('\n')
matching_lines_from_file() demonstrates several important practices
for modern Python, and is worth studying. It does simple substring
matching on lines of a text file, yielding lines containing that
substring.
The first line opens a read-only file object, called handle. If you
haven’t been opening your file objects using with statements, start
today. The main benefit is that once the with block is exited, the
file object is automatically closed - even if an exception causes a
premature exit. It’s similar to:
- try:
- handle = open(path)
- # read from handle
- finally:
- handle.close()
(The try/finally is explained in the exceptions chapter.) Next we
have for line in handle. This useful idiom, which not many people
seem to know about, is a special case for text files. Each iteration
through the for loop, a new line of text will be read from the
underlying text file, and placed in the line variable.
Sometimes people foolishly take another approach, which I have to warn you about:
- # Don't do this!!
- for line in handle.readlines():
.readlines() (plural) reads in the entire file, parses it into
lines, and returns a list of strings - one string per line. By now,
you realize how this destroys the generator
function’s scalability.
Another approach you will sometimes see, which is scalable, is to
use the file object’s .readline() method (singular), which manually
returns lines one at a time:
- # .readline() reads and returns a single line of text,
- # or returns the empty string at end-of-file.
- line = handle.readline()
- while line != '':
- # do something with line
- # ...
- # At the end of the loop, read the next line.
- line = handle.readline()
But simply writing for line in handle is clearer and easier.
After that, it’s straightforward: matching lines have any trailing
\n-character stripped, and are yielded to the consumer. When writing
generator functions, you want to ask yourself "what is the maximum
memory footprint of this function, and how can I minimize it?" You
can think of scalability as inversely proportional to this
footprint. For matching_lines_from_file(), it will be about equal to
the size of the longest line in the text file. So it’s appropriate for
the typical human-readable text file, whose lines are small.
(It’s also possible to point it to, say, a ten-terabyte text file consisting of exactly one line. If you expect something like that, you’ll need a different approach.)
Now, suppose a log file contains lines like this:
WARNING: Disk usage exceeding 85% DEBUG: User 'tinytim' upgraded to Pro version INFO: Sent email campaign, completed normally WARNING: Almost out of beer
... and you exercise matching_lines_from_file() like so:
for line in matching_lines_from_file("log.txt","WARNING:"):
print(line)
That yields these records:
WARNING: Disk usage exceeding 85% WARNING: Almost out of beer
Suppose your application needs that data in dict form:
{"level": "WARNING", "message": "Disk usage exceeding 85%"}
{"level": "DEBUG", "message":
"User 'tinytim' upgraded to Pro version"}
We want to scalably transform the records from one form to another - from strings (lines of the log file), to Python dicts. So let’s make a new generator function to connect them:
- def parse_log_records(lines):
- for line in lines:
- level, message = line.split(": ", 1)
- yield {"level": level, "message": message}
Now we can connect the two:
- # log_lines is a generator object
- log_lines = matching_lines_from_file("log.txt", "WARNING:")
- for record in parse_log_records(log_lines):
- # record is a dict
- print(record)
Of course, parse_log_records() can be used on its own:
- with open("log.txt") as handle:
- for record in parse_log_records(handle):
- print(record)
matching_lines_from_file() and parse_log_records() are like building
blocks. Properly designed, they can be used to build different data
processing streams. I call this scalable composability.
It goes beyond designing composable functions and types.
Ask yourself how you can make the components scalable, and whatever
is assembled out of them scalable too.
Let’s discuss a particular design point. Both
matching_lines_from_file() and parse_log_records() produce an
iterator. (Or, more specifically, a generator object). But they have a
discrepancy on the input side: parse_log_records() accepts an
iterator, but matching_lines_from_file() requires a path to a file to
read from. This means matching_lines_from_file() combines two
functions: read lines of text from a file, then filter those lines
based on some criteria.
Combining functions like this is often what you want in realistic
code. But when designing components to flexibly compose together,
inconsistent interfaces like this can be limiting. Let’s break up the
services in matching_lines_from_file() into two generator functions:
- def lines_from_file(path):
- with open(path) as handle:
- for line in handle:
- yield line.rstrip('\n')
-
- def matching_lines(lines, pattern):
- for line in lines:
- if pattern in line:
- yield line
You can compose these like so:
- lines = lines_from_file("log.txt")
- matching = matching_lines(lines, 'WARNING:')
- for line in matching:
- print(line)
Or even redefine matching_lines_from_file() in terms of them:
- def matching_lines_from_file(pattern, path):
- lines = lines_from_file(path)
- matching = matching_lines(lines, pattern)
- for line in matching:
- yield line
Conceptually, this factored-out matching_lines does a filtering
operation; all lines are read in, and a subset are
yielded. parse_log_records() is different. One input record (a str
line) maps to exactly one output record (a dict). Mathematically,
it’s a mapping operation. Think of it as a transformer or
adapter. lines_from_file() is in a third category; instead of taking
a stream as input, it takes a completely different parameter. Since it
still returns an iterator of records, think of it as a source. And
any real program will eventually want to do something with that
stream, consuming it without producing another iterator; call that a
sink.
You need all these pieces to make a working program. When designing a chainable set of generator functions like this - or even better, a toolkit for constructing internal data pipelines - ask yourself whether each component is a sink, a source, or whether it does filtering, or mapping; or whether it’s some combination of these. Just asking yourself this question leads to a more usable, readable, and maintainable codebase. And if you’re making a library which others will use, you’re more likely to end up with a toolkit so powerfully flexible, people use it to build programs you never imagined.
I want you to notice something about parse_log_records(). As I said,
it fits in the "mapping" category. And notice its mapping is
one-to-one: one line of text becomes one dictionary. In other words,
each record in the input - a str - becomes exactly one record in
the output - a dict.
That isn’t always the case. Sometimes, your generator function needs to consume several input records to create one output record. Or the opposite: one input record yields several output records.
Here’s an example of the latter. Imagine a text file containing lines in a poem:[1]
all night our room was outer-walled with rain drops fell and flattened on the tin roof and rang like little disks of metal
Let’s create a generator function, words_in_text(), producing the
words one at a time. First approach:
- # lines is an iterator of text file lines,
- # e.g. returned by lines_from_file()
- def words_in_text(lines):
- for line in lines:
- for word in line.split():
- yield word
This generator function takes a fan out approach. No input records are dropped, which means it doesn’t do any filtering; it’s still purely in the "mapping" category of generator functions. But the mapping isn’t one to one. Rather, one input record produces one or more output records. So, when you run the following code:
- poem_lines = lines_from_file("poem.txt")
- poem_words = words_in_text(poem_lines)
- for word in poem_words:
- print(word)
... it produces this output:
all night our room was outer-walled ...
That first input record - "all night our room was outer-walled with rain" - yields eight words (output records). Ignoring any blank lines in the poem, every line of prose will produce at least one - probably several - words.
The idea of fanning out is interesting, but simple enough. It’s more complex when we go the opposite direction: fanning in. That means the generator function consumes more than one input record to produce each output record. Doing this successfully requires an awareness of the input’s structure, and you’ll typically need to encode some simple parsing logic.
Imagine a text file containing residential house sale data. Each record is a set of key-value pairs, one pair per line, with records separated by blank lines:
address: 1423 99th Ave square_feet: 1705 price_usd: 340210 address: 24257 Pueblo Dr square_feet: 2305 price_usd: 170210 address: 127 Cochran square_feet: 2068 price_usd: 320500
To read this data into a form usable in our code, what we want is a
generator function - let’s name it house_records() - which accepts
a sequence of strings (lines) and parses them into convenient dictionaries:
>>> lines_of_house_data = lines_from_file("housedata.txt")
>>> houses = house_records(lines_of_house_data)
>>> # Fetch the first record.
... house = next(houses)
>>> house['address']
'1423 99th Ave'
>>> house = next(houses)
>>> house['address']
'24257 Pueblo Dr'
How would you create this? If practical, pause here, open up a code editor, and see if you can implement it.
Okay, time’s up. Here is one approach:
- def house_records(lines):
- record = {}
- for line in lines:
- if line == '':
- yield record
- record = {}
- continue
- key, value = line.split(': ', 1)
- record[key] = value
- yield record
Notice where the yield keywords appear. The last line of the for
loop reads individual key-value pairs. Starting with an empty record
dictionary, it’s populated with data until lines produces an empty
line. That signals the current record is complete, so it’s
yield-ed, and a new record dictionary created. The end of the
very last record in housedata.txt is signaled not by an empty line,
but by the end of the file; that’s why we need the final yield
statement.
As defined, house_records() is a bit clunky if we’re normally reading
from a text file. It makes sense to define a new generator function
accepting just the path to the file:
- def house_records_from_file(path):
- lines = lines_from_file(path)
- for house in house_records(lines):
- yield house
-
- # Then in your program:
- for house in house_records_from_file("housedata.txt"):
- print(house["address"])
You may have noticed many of these examples have a bit of
boilerplate, when one generator function internally calls another.
The last two lines of house_records_from_file say:
- for house in house_records(lines):
- yield house
Python provides a shortcut, which lets you accomplish this in one
line, with the yield from statement:
- def house_records_from_file(path):
- lines = lines_from_file(path)
- yield from house_records(lines)
Even though "yield from" is two words, semantically it’s like a single
keyword, and distinct from yield. The yield from statement is used
specifically in generator functions, when they yield values directly
from another generator object (or, equivalently, by calling another
generator function). Using it often simplifies your code, as you see
in house_records_from_file(). Going back a bit, here’s how it works
with matching_lines_from_file():
- def matching_lines_from_file(pattern, path):
- lines = lines_from_file(path)
- yield from matching_lines(lines, pattern)
The formal name for what yield from does is "delegating to a
sub-generator", and instills a deeper connection between the
containing and inner generator objects. In particular, generator
objects have certain methods - send, throw and close - for
passing information back into the context of the running
generator function. I won’t cover them in this edition of the book, as
they are not currently widely used; you can learn more by reading PEPs
342 and 380.[2] If you
do use them, yield from becomes necessary to enable the flow of
information back into the scope of the running coroutine.
Python is Filled With Iterators
Let’s look at dictionaries:
- >>> calories = {
- ... "apple": 95,
- ... "slice of bacon": 43,
- ... "cheddar cheese": 113,
- ... "ice cream": 15, # You wish!
- ... }
- >>> items = calories.items()
- >>> type(items)
- <class 'dict_items'>
So what is this dict_items
object returned by calories.items()? It turns out to be what Python
calls a view. There is not any kind of base view type; rather, an
object quacks like a dictionary view if it supports three things:
-
len(view)returns the number of items, -
iter(view)returns an iterator over the key-value pairs, and -
(key, value) in viewreturns True if thatkey-valuepair is in the dictionary, else False.
In other words, a dictionary view is iterable, with a couple of bonus features. It also dynamically updates if its source dictionary changes:
- >>> items = calories.items()
- >>> len(items)
- 4
- >>> calories['orange'] = 50
- >>> len(items)
- 5
- >>> ('orange', 50) in items
- True
- >>> ('orange', 20) in items
- False
Dictionaries also have .keys() and .values(). Like .items(),
they each return a view. But instead of key-value pairs, they only
contain keys or values, respectively:
- >>> foods = calories.keys()
- >>> counts = calories.values()
- >>> 'yogurt' in foods
- False
- >>> 100 in counts
- False
- >>> calories['yogurt'] = 100
- >>> 'yogurt' in foods
- True
- >>> 100 in counts
- True
Iteration has snuck into many places in Python. The built-in range
function returns an iterable:
- >>> seq = range(3)
- >>> type(seq)
- <class 'range'>
- >>> for n in seq: print(n)
- 0
- 1
- 2
The built-in map, filter, and zip functions all return
iterators:
- >>> numbers = [1, 2, 3]
- >>> big_numbers = [100, 200, 300]
- >>> def double(n):
- ... return 2 * n
- >>> def is_even(n):
- ... return n % 2 == 0
- >>> mapped = map(double, numbers)
- >>> mapped
- <map object at 0x1013ac518>
- >>> for num in mapped: print(num)
- 2
- 4
- 6
- >>> filtered = filter(is_even, numbers)
- >>> filtered
- <filter object at 0x1013ac668>
- >>> for num in filtered: print(num)
- 2
- >>> zipped = zip(numbers, big_numbers)
- >>> zipped
- <zip object at 0x1013a9608>
- >>> for pair in zipped: print(pair)
- (1, 100)
- (2, 200)
- (3, 300)
Notice that mapped is something called a "map object", rather than a
list of the results of the calculation; and similar for filtered and
zipped. These are all iterators - giving you all the benefits of
iteration, built into the language.
The Iterator Protocol
This optional section explains Python’s iterator protocol in formal detail, giving you a precise and low-level understanding of how generators, iterators, and iterables all work. For the day-to-day coding of most programmers, it’s not nearly as important as everything else in this chapter. That said, you need this information to implement your own, custom iterable collection types. Personally, I also find knowing the protocol helps me reason through iteration-related issues and edge cases; by knowing these details, I’m able to quickly troubleshoot and fix certain bugs that might otherwise eat up my afternoon. If this all sounds valuable to you, keep reading; otherwise, feel free to skip to the next chapter.
As mentioned, Python makes a distinction between iterators, versus objects that are iterable. The difference is subtle to begin with, and frankly it doesn’t help that the two words sound nearly identical. Keep clear in your mind that "iterator" and "iterable" are distinct but related concepts, and the following will be easier to understand.
Informally, an iterator is something you can pass to next(), or use
exactly once in a for loop. More formally, an object in Python is
an iterator if follows the iterator protocol. And an object follows
the iterator protocol if it meets the following criteria:
-
It defines a method named
__next__, called with no arguments. -
Each time
__next__()is called, it produces the next item in the sequence. -
Until all items have been produced. Then, subsequent calls to
__next__()raiseStopIteration. -
It also defines a boilerplate method named
__iter__, called with no arguments, and returning the same iterator. Its body is literallyreturn self.
Any object with these methods can call itself a Python iterator. You
are not intended to call the __next__() method directly. Instead,
you will use the built-in next() function. To understand better,
here is how you might write your own next() function:
- _NO_DEFAULT = object()
- def next(it, default=_NO_DEFAULT):
- try:
- return it.__next__()
- except StopIteration:
- if default is _NO_DEFAULT:
- raise
- return default
(As a side note, notice how it creates a unique sentinel value,
_NO_DEFAULT, rather defaulting to a built-in value like None. A
sentinel value is a value that exists solely for signaling in the
algorithm, and is meant to never overlap with a possible value for
real data. This way you can pass any value to default that you like
without conflict.)
Now, all the above is for the "iterator". Let’s explain the other
word, "iterable". Informally, an object is iterable if it knows how
to create an iterator over its contents, which you can access with the
built-in iter() function. More formally,
a Python container object is iterable if it meets one of these two criteria:
-
It defines a method called
__iter__(), which creates and returns an iterator over the elements in the container; or -
it follows the sequence
protocol. This means it defines
__getitem__- the magic method for square brackets - and lets you referencefoo[0],foo[1], etc., raising anIndexErroronce you go past the last element.
(Notice "iterator" is a noun, while "iterable" is usually an adjective. This can help you remember which is which.)
When implementing your own container type, you probably want to make
it iterable, so you and others can use it in a for loop. Depending
on the nature of the container, it’s often easiest to implement the
sequence protocol. As an example, consider this simple UniqueList
type, which is a kind of hybrid between a list and a set:
- class UniqueList:
- def __init__(self, items):
- self.items = []
- for item in items:
- self.append(item)
- def append(self, item):
- if item not in self.items:
- self.items.append(item)
- def __getitem__(self, index):
- return self.items[index]
Use it like this:
- >>> u = UniqueList([3,7,2,9,3,4,2])
- >>> u.items
- [3, 7, 2, 9, 4]
- >>> u[3]
- 9
- >>> u[42]
- Traceback (most recent call last):
- File "<stdin>", line 1, in <module>
- File "<stdin>", line 10, in __getitem__
- IndexError: list index out of range
The __getitem__ method implements square-bracket access;
basically, Python translates u[3] into u.__getitem__(3). We’ve
programmed this object’s square brackets to operate much like a normal
list, in that the initial element is at index 0, and you get
subsequent elements with subsequent integers, not skipping any. And
when you go past the end, it raises IndexError. If an object has a
__getitem__ method behaving in just this way, iter() knows how
to create an iterator over it:
- >>> u_iter = iter(u)
- >>> type(u_iter)
- <class 'iterator'>
- >>> for num in u_iter: print(num)
- 3
- 7
- 2
- 9
- 4
Notice we get a lot of this behavior for free, simply because we’re
using an actual list internally (and thus delegating much of the
__getitem__ logic to it). That’s a clue for you, whenever you make
a custom collection that acts like a list - or one of the other
standard collection types. If your object internally stores its data
in one of the standard data types, you’ll often have an easier time
mimicking its behavior.
Implementing the sequence protocol - i.e., writing a __getitem__
method which acts like a list’s - is one way to make your class
iterable. The other involves writing an __iter__ method. When
called with no arguments, it must return some object which follows the
iterator protocol, described above. In the worst case, you’ll need to
implement something like the SquaresIterator class from earlier in
this chapter, with its own __next__ and __iter__ methods. But
usually you don’t have to work that hard, because you can simply
return a generator object instead. That means __iter__ is a
generator function itself, or it internally calls some other generator
function, returning its value.
Both iterators must have an __iter__ method, as do some
iterables. Both are called with no argument, and both return an
iterator object. The only difference: the one for the iterator returns
its self, while an iterable’s will create and return a new
iterator. And if you call it twice, you get two different iterators.
This similarity is intentional, to simplify control code that can
accept either iterators or iterables. Here’s the mental model you can
safely follow: when Python’s runtime encounters a for loop, it will
start by invoking iter(sequence). This always returns an iterator:
either sequence itself, or (if sequence is only iterable) the
iterator created by sequence.__iter__().
Iterables are everywhere in Python. Almost all built-in collection
types are iterable: list, tuple, and set, and even dict.
(Though you’ll probably want to use dict.items() - a simple for x
in some_dict will iterate just over the keys).
In your own custom collection classes, sometimes the easiest way to
implement __iter__() actually involves using iter(). For
instance, this will not work:
- class BrokenInLoops:
- def __init__(self):
- self.items = [7, 3, 9]
- def __iter__(self):
- return self.items
If you try it, you get a TypeError:
- >>> items = BrokenInLoops()
- >>> for item in items:
- ... print(item)
- Traceback (most recent call last):
- File "<stdin>", line 1, in <module>
- TypeError: iter() returned non-iterator of type 'list'
It doesn’t work because __iter__() is supposed to return an iterator, but a
list object is not an iterator; it is simply iterable. You can fix
this with a one-line change: use iter() to create an iterator object
inside of __iter__(), and return that object:
- class WorksInLoops:
- def __init__(self):
- self.items = [7, 3, 9]
- def __iter__(self):
- # This class is identical to BrokenInLoops,
- # except for this next line.
- return iter(self.items)
This makes WorksInLoops itself iterable, because __iter__
returns an actual iterator object - making WorksInLoops follow the
iterator protocol correctly. That __iter__ method generates a
fresh iterator each time:
- >>> items = WorksInLoops()
- >>> for item in items:
- ... print(item)
- 7
- 3
- 9
Next Chapter: Creating Collections with Comprehensions
Previous Chapter: Doing More With Python