Creating Collections with Comprehensions

A list comprehension is a high level, declarative way to create a list in Python. They look like this:

  1. >>> squares = [ n*n for n in range(6) ]
  2. >>> print(squares)
  3. [0, 1, 4, 9, 16, 25]

This is exactly equivalent to the following:

  1. >>> squares = []
  2. >>> for n in range(6):
  3. ... squares.append(n*n)
  4. >>> print(squares)
  5. [0, 1, 4, 9, 16, 25]

Notice that in the first example, what you type is declaring what kind of list you want, while the second is specifying how to create it. That’s why we say it is high-level and declarative: it’s as if you are stating what kind of list you want created, and then let Python figure out how to build it.

Python lets you write other kinds of comprehensions other than lists. Here’s a simple dictionary comprehension, for example:

  1. >>> blocks = { n: "x" * n for n in range(5) }
  2. >>> print(blocks)
  3. {0: '', 1: 'x', 2: 'xx', 3: 'xxx', 4: 'xxxx'}

This is exactly equivalent to the following:

  1. >>> blocks = dict()
  2. >>> for n in range(5):
  3. ... blocks[n] = "x" * n
  4. >>> print(blocks)
  5. {0: '', 1: 'x', 2: 'xx', 3: 'xxx', 4: 'xxxx'}

The main benefits of comprehensions are readability and maintainability. Most people find them very readable; even developers encountering a comprehension for the first time will usually find their first guess about what it means to be correct. You can’t get more readable than that.

And there is a deeper, cognitive benefit: once you’ve practiced with them a bit, you will find you can write them with very little mental effort - keeping more of your attention free for other tasks.

Beyond lists and dictionaries, there are several other forms of comprehension you will learn about in this chapter. As you become comfortable with them, you will find them to be versatile and very Pythonic - meaning, you’ll find they fit well into many other Python idioms and constructs, lending new expressiveness and elegance to your code.

List Comprehensions

A list comprehension is the most widely used and useful kind of comprehension, and is essentially a way to create and populate a list. Its structure looks like:


EXPRESSION is any Python expression, though in useful comprehensions, the expression typically has some variable in it. That variable is stated in the VARIABLE field. SEQUENCE defines the source values the variable enumerates through, creating the final sequence of calculated values.

Here’s the simple example we glimpsed earlier:

  1. >>> squares = [ n*n for n in range(6) ]
  2. >>> type(squares)
  3. <class 'list'>
  4. >>> print(squares)
  5. [0, 1, 4, 9, 16, 25]

Notice the result is just a regular list. In squares, the expression is n*n; the variable is n; and the source sequence is range(6). The sequence is a range object; in fact, it can be any iterable…​ another list or tuple, a generator object, or something else.

The expression part can be anything that reduces to a value:

  • Arithmetic expressions like n+3
  • A function call like f(m), using m as the variable
  • A slice operation (like s[::-1], to reverse a string)
  • Method calls (, iterating over a sequence of objects)
  • And more.

Some complete examples:

  1. >>> # First define some source sequences...
  2. ... pets = ["dog", "parakeet", "cat", "llama"]
  3. >>> numbers = [ 9, -1, -4, 20, 11, -3 ]
  4. >>> # And a helper function...
  5. ... def repeat(s):
  6. ... return s + s
  7. ...
  8. >>> # Now, some list comprehensions:
  9. ... [ 2*m+3 for m in range(10, 20, 2) ]
  10. [23, 27, 31, 35, 39]
  11. >>> [ abs(num) for num in numbers ]
  12. [9, 1, 4, 20, 11, 3]
  13. >>> [ 10 - x for x in numbers ]
  14. [1, 11, 14, -10, -1, 13]
  15. >>> [ pet.lower() for pet in pets ]
  16. ['dog', 'parakeet', 'cat', 'llama']
  17. >>> [ "The " + pet for pet in sorted(pets) ]
  18. ['The cat', 'The dog', 'The llama', 'The parakeet']
  19. >>> [ repeat(pet) for pet in pets ]
  20. ['dogdog', 'parakeetparakeet', 'catcat', 'llamallama']

Notice how all these fit the same structure. They all have the keywords "for" and "in"; those are required in Python, for any kind of comprehension you may write. These are interleaved among three fields: the expression; the variable (i.e., the identifier from which the expression is composed); and the source sequence.

The order of elements in the final list is determined by the order of the source sequence. But you can filter out elements by adding an "if" clause:

  1. >>> def is_palindrome(s):
  2. ... return s == s[::-1]
  3. ...
  4. >>> pets = ["dog", "parakeet", "cat", "llama"]
  5. >>> numbers = [ 9, -1, -4, 20, 11, -3 ]
  6. >>> words = ["bib", "bias", "dad", "eye", "deed", "tooth"]
  7. >>>
  8. >>> [ n*2 for n in numbers if n % 2 == 0 ]
  9. [-8, 40]
  10. >>>
  11. >>> [pet.upper() for pet in pets if len(pet) == 3]
  12. ['DOG', 'CAT']
  13. >>>
  14. >>> [n for n in numbers if n > 0]
  15. [9, 20, 11]
  16. >>>
  17. >>> [word for word in words if is_palindrome(word)]
  18. ['bib', 'dad', 'eye', 'deed']

The structure is


where CONDITION is an expression that evaluates to True or False, depending on the variable.[3] Note that it can be either a function applied to the variable (is_palindrome(word)), or a more complex expression. Choosing to use a function can improve readability, and also let you apply filter logic whose code won’t fit on one line.

A list comprehension must always have the "for" word, even if the beginning expression is just the variable itself. For example, when we say:

  1. >>> [word for word in words if is_palindrome(word)]
  2. ['bib', 'dad', 'eye', 'deed']

Sometimes people think word for word in words seems redundant (it does), and try to shorten it…​ but that doesn’t work:

  1. >>> [word in words if is_palindrome(word)]
  2. File "<stdin>", line 1
  3. [word in words if is_palindrome(word)]
  4. ^
  5. SyntaxError: invalid syntax
  6. >>>

Formatting For Readability (And More)

Realistic list comprehensions tend to be too long to fit nicely on a single line. And they are composed of distinct logical parts, which can vary independently as the code evolves. This creates a couple of inconveniences, which are solved by a very convenient fact: Python’s normal rules of whitespace are suspended inside the square brackets. You can exploit this to make them more readable and maintainable, splitting them across multiple lines:

  1. def double_short_words(words):
  2. return [ word + word
  3. for word in words
  4. if len(word) < 5 ]

Another variation, which some people prefer:

  1. def double_short_words(words):
  2. return [
  3. word + word
  4. for word in words
  5. if len(word) < 5
  6. ]

What I’ve done here is split the comprehension across separate lines. You can, and should, do this with any substantial comprehension. It’s great for several reasons, the most important being the instant gain in readability. This comprehension has three separate ideas expressed inside the square brackets: the expression (word + word); the sequence (for word in words); and the filtering clause (if len(word) < 5). These are logically separate aspects, and by splitting them across different lines, it takes less cognitive effort for a human to read and understand than the one-line version. It’s effectively pre-parsed for you, as you read the code.

There’s another benefit: version control and code review diffs are more pin-pointed. Imagine you and I are on the same development team, working on this code base in different feature branches. In my branch, I change the expression to "word + word + word"; in yours, you change the threshold to "len(word) < 7". If the comprehension is on one line, version control tools will perceive this as a merge conflict, and whoever merges last will have to manually fix it.[4] But since this list comprehension is split across three lines, our source control tool can automatically merge both our branches. And if we’re doing code reviews like we should be, the reviewer can identify the precise change immediately, without having to scan the line and think.

Multiple Sources and Filters

You can have several for VAR in SEQUENCE clauses. This lets you construct lists based on pairs, triplets, etc., from two or more source sequences:

  1. >>> colors = ["orange", "purple", "pink"]
  2. >>> toys = ["bike", "basketball", "skateboard", "doll"]
  3. >>>
  4. >>> [ color + " " + toy
  5. ... for color in colors
  6. ... for toy in toys ]
  7. ['orange bike', 'orange basketball', 'orange skateboard',
  8. 'orange doll', 'purple bike', 'purple basketball',
  9. 'purple skateboard', 'purple doll', 'pink bike',
  10. 'pink basketball', 'pink skateboard', 'pink doll']

Every pair from the two sources, colors and toys, is used to calculate a value in the final list. That final list has 12 elements, the product of the lengths of the two source lists.

I want you to notice that the two for clauses are independent of each other; colors and toys are two unrelated lists. Using multiple for clauses can sometimes take a different form, where they are more interdependent. Consider this example:

  1. >>> ranges = [range(1,7), range(4,12,3), range(-5,9,4)]
  2. >>> [ float(num)
  3. ... for subrange in ranges
  4. ... for num in subrange ]
  5. [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 4.0, 7.0, 10.0, -5.0,
  6. -1.0, 3.0, 7.0]

The source sequence - "ranges" - is a list of range objects.[5] Now, this list comprehension has two for clauses again. But notice one depends on the other. The source of the second is the variable for the first!

It’s not like the colorful-toys example, whose for clauses are independent of each other. When chained together this way, order matters:

  1. >>> [ float(num)
  2. ... for num in subrange
  3. ... for subrange in ranges ]
  4. Traceback (most recent call last):
  5. File "<stdin>", line 2, in <module>
  6. NameError: name 'subrange' is not defined

Python parses the list comprehension from left to right. If the first clause is for num in subrange, at that point subrange is not defined. So you have to put for subrange in ranges first. You can chain more than two for clauses together like this; the first one will just need to reference a previously-defined source, and the others can use sources defined in the previous for clause, like subrange is defined.

Now, that’s for chained for clauses. If the clauses are independent, does the order matter at all? It does, just in a different way. What’s the difference between these two list comprehensions:

  1. >>> colors = ["orange", "purple", "pink"]
  2. >>> toys = ["bike", "basketball", "skateboard", "doll"]
  3. >>>
  4. >>> [ color + " " + toy
  5. ... for color in colors
  6. ... for toy in toys ]
  7. ['orange bike', 'orange basketball', 'orange skateboard',
  8. 'orange doll', 'purple bike', 'purple basketball',
  9. 'purple skateboard', 'purple doll', 'pink bike',
  10. 'pink basketball', 'pink skateboard', 'pink doll']
  11. >>>
  12. >>> [ color + " " + toy
  13. ... for toy in toys
  14. ... for color in colors ]
  15. ['orange bike', 'purple bike', 'pink bike', 'orange
  16. basketball', 'purple basketball', 'pink basketball',
  17. 'orange skateboard', 'purple skateboard', 'pink
  18. skateboard', 'orange doll', 'purple doll', 'pink doll']

The order here doesn’t matter in the sense it does for chained for clauses, where you must put things in a certain order, or your program won’t run. Here, you have a choice. And that choice does effect the order of elements in the final comprehension. The first element in each is "orange bike". And notice the second element is different. Think a moment, and ask yourself: why? Why is the first element the same in both comprehensions? And why is it only the second element that’s different?

It has to do with which sequence is held constant while the other varies. It’s the same logic that applies when nesting regular for loops:

  1. >>> # Nested one way...
  2. ... build_colors_toys = []
  3. >>> for color in colors:
  4. ... for toy in toys:
  5. ... build_colors_toys.append(color + " " + toy)
  6. >>> build_colors_toys[0]
  7. 'orange bike'
  8. >>> build_colors_toys[1]
  9. 'orange basketball'
  10. >>>
  11. >>> # And nested the other way.
  12. ... build_toys_colors = []
  13. >>> for toy in toys:
  14. ... for color in colors:
  15. ... build_toys_colors.append(color + " " + toy)
  16. >>> build_toys_colors[0]
  17. 'orange bike'
  18. >>> build_toys_colors[1]
  19. 'purple bike'

The second for clause in the list comprehension corresponds to the innermost for loop. Its values vary through their range more rapidly than the outer one.

In addition to using many for clauses, you can have more than one if clause, for multiple levels of filtering. Just write several of them in sequence:

  1. >>> numbers = [ 9, -1, -4, 20, 17, -3 ]
  2. >>> odd_positives = [
  3. ... num for num in numbers
  4. ... if num > 0
  5. ... if num % 2 == 1
  6. ... ]
  7. >>> print(odd_positives)
  8. [9, 17]

Here, I’ve placed each if clause on its own line, for readability - but I could have put both on one line. When you have more than one if clause, each element must meet the criteria of all of them to make it into the final list. In other words, if clauses are "and-ed" together, not "or-ed" together.

What if you want to do "or" - to include elements matching at least one of the if clause criteria, omitting only those not matching either? List comprehensions don’t allow you do to that directly. The comprehension mini-language is not as expressive as Python itself, and there are lists you might need to construct which cannot be expressed as a comprehension.

But sometimes you can cheat a bit by defining helper functions. For example, here’s how you can filter based on whether the number is a multiple of 2 or 3:

  1. >>> numbers = [ 9, -1, -4, 20, 11, -3 ]
  2. >>> def is_mult_of_2_or_3(num):
  3. ... return (num % 2 == 0) or (num % 3 == 0)
  4. ...
  5. >>> [
  6. ... num for num in numbers
  7. ... if is_mult_of_2_or_3(num)
  8. ... ]
  9. [9, -4, 20, -3]

We discuss this more in the "Limitations" section, later in the chapter.

You can use multiple for and if clauses together:

  1. >>> weights = [0.2, 0.5, 0.9]
  2. >>> values = [27.5, 13.4]
  3. >>> offsets = [4.3, 7.1, 9.5]
  4. >>>
  5. >>> [ (weight, value, offset)
  6. ... for weight in weights
  7. ... for value in values
  8. ... for offset in offsets
  9. ... if offset > 5.0
  10. ... if weight * value < offset ]
  11. [(0.2, 27.5, 7.1), (0.2, 27.5, 9.5), (0.2, 13.4, 7.1),
  12. (0.2, 13.4, 9.5), (0.5, 13.4, 7.1), (0.5, 13.4, 9.5)]

The only rule is that the first for clause must come before the first if clause. Other than that, you can interleave for and if clauses in any order, though most people seem to find it more readable to group all the for clauses together at first, then the if clauses together at the end.

Comprehensions and Generators

List comprehensions create lists:

  1. >>> squares = [ n*n for n in range(6) ]
  2. >>> type(squares)
  3. <class 'list'>

When you need a list, that’s great, but sometimes you don’t need a list, and you’d prefer something more scalable. It’s like the situation near the start of the generators chapter:

  1. # This again.
  2. NUM_SQUARES = 10*1000*1000
  3. many_squares = [ n*n for n in range(NUM_SQUARES) ]
  4. for number in many_squares:
  5. do_something_with(number)

The entire many_squares list must be fully created - all memory for it must be allocated, and every element calculated - before do_something_with is called even once. And memory usage goes through the roof.

You know one solution: write a generator function, and call it. But there’s an easier option: write a generator expression. This is the official name for it, but it really should be called a "generator comprehension". Syntactically, it looks just like a list comprehension - except you use parentheses instead of square brackets:

  1. >>> generated_squares = ( n*n for n in range(NUM_SQUARES) )
  2. >>> type(generated_squares)
  3. <class 'generator'>

This "generator expression" creates a generator object, in the exact same way a list comprehension creates a list. Any list comprehension you can write, you can use to create an equivalent generator object, just by swapping "(" and ")" for "[" and "]".

And you’re creating the object directly, without having to define a generator function to call. In other words, a generator expression is a convenient shortcut when you need a quick generator object:

  1. # This...
  2. many_squares = ( n*n for n in range(NUM_SQUARES) )
  3. # ... is EXACTLY EQUIVALENT to this:
  4. def gen_many_squares(limit):
  5. for n in range(limit):
  6. yield n * n
  7. many_squares = gen_many_squares(NUM_SQUARES)

As far as Python is concerned, there is no difference.

Everything you know about list comprehensions applies to generator expressions: multiple for clauses, if clauses, etc. You only need to type the parentheses.

In fact, sometimes you can even omit them. When passing a generator expression as an argument to a function, you will sometimes find yourself typing "((" followed by "))". In that situation, Python lets you omit the inner pair. Imagine, for example, you are sorting a list of customer email addresses, looking at only those customers whose status is "active":

  1. >>> # User is a class with "email" and "is_active" fields.
  2. ... # all_users is a list of User objects.
  3. >>> # Sorted list of active user's email addresses.
  4. ... # Passing in a generator expression.
  5. >>> sorted(( for user in all_users
  6. ... if user.is_active))
  7. ['', '', '']
  8. >>>
  9. >>> # Omitting the inner parentheses.
  10. ... # Still passing in a generator expression!
  11. >>> sorted( for user in all_users
  12. ... if user.is_active)
  13. ['', '', '']

Notice how readable and natural this is (or will be, once you’ve practiced a bit). One thing to watch out for: you can only inline a generator expression this way when passed to a function or method of one argument. Otherwise, you get a syntax error:

  1. >>>
  2. >>> # Reverse that list. Whoops...
  3. ... sorted( for user in all_users
  4. ... if user.is_active, reverse=True)
  5. File "<stdin>", line 2
  6. SyntaxError: Generator expression must be parenthesized if not sole argument

Python can’t unambiguously interpret what you mean here, so you must use the inner parentheses:

  1. >>> # Okay, THIS will get the reversed list.
  2. ... sorted(( for user in all_users
  3. ... if user.is_active), reverse=True)
  4. ['', '', '']

And of course, sometimes it’s more readable to assign the generator expression to a variable:

  1. >>> active_emails = (
  2. ... for user in all_users
  3. ... if user.is_active
  4. ... )
  5. >>> sorted(active_emails, reverse=True)
  6. ['', '', '']

Generator expressions without parentheses suggest a unified way of thinking about comprehensions, which link generator expressions and list comprehensions together. Here’s a generator expression for a sequence of squares:

  1. ( n**2 for n in range(10) )

And here it is again, passed to the built-in list() function:

  1. list( n**2 for n in range(10) )

And here it is as a list comprehension:

  1. [ n**2 for n in range(10) ]

When you understand generator expressions, it’s easy to see list comprehensions as a derivative data structure. And the same applies for dictionary and set comprehensions (covered next). With this insight, you start seeing new opportunities to use all of them in your own code, improving its readability, maintainability, and performance in the process.

Generator Expression or List Comprehension?

If generator expressions are so great, why would you use list comprehensions? Generally speaking, when deciding which to use, your code will be more scalable and responsive if you use a generator expression. Except, of course, when you actually need a list. There are several considerations.

First, if the sequence is unlikely to be very big - and by big, I mean a minimum of thousands of elements long - you probably won’t benefit from using a generator expression. That’s just not big enough for scalability to matter. They’re also immutable. If you need random access, or to go through the sequence twice, or you might need to append or remove elements, generator expressions won’t work.

This is especially important when writing methods or functions whose return value is a sequence. Do you return a generator expression, or a list comprehension? In theory, there’s no reason to ever return a list instead of a generator object; a list can be trivially created by passing it to list(). In practice, the interface may be such that the caller will really want an actual list. Also, if you are constructing the return value as a list within the function, it’s silly to return a generator expression over it - just return the actual list.

And if your intention is to create a library usable by people who may not be advanced Pythonistas, that can be an argument for returning lists. Almost all programmers are familiar with list-like data structures. But fewer are familiar with how generators work in Python, and may - quite reasonably - get confused when confronted with a generator object.

Dictionaries, Sets, and Tuples

Just like a list comprehension creates a list, a dictionary comprehension creates a dictionary. You saw an example at the beginning of this chapter; here’s another. Suppose you have this Student class:

  1. class Student:
  2. def __init__(self, name, gpa, major):
  3. = name
  4. self.gpa = gpa
  5. self.major = major

Given a list students of student objects, we can write a dictionary comprehension mapping student names to their GPAs:

  1. >>> { student.gpa for student in students }
  2. {'Jim Smith': 3.6, 'Ryan Spencer': 3.1,
  3. 'Penny Gilmore': 3.9, 'Alisha Jones': 2.5,
  4. 'Todd Reynolds': 3.4}

The syntax differs from that of list comprehensions in two ways. Instead of square brackets, you’re using curly brackets - which makes sense, since this creates a dictionary. The other difference is the expression field, whose format is "key: value", since a dict has key-value pairs. So the structure is


These are the only differences. Everything else you learned about list comprehensions applies, including filtering with if clauses:

  1. >>> def invert_name(name):
  2. ... first, last = name.split(" ", 1)
  3. ... return last + ", " + first
  4. ...
  5. >>> # Get "lastname, firstname" of high-GPA students.
  6. ... { invert_name( student.gpa
  7. ... for student in students
  8. ... if student.gpa > 3.5 }
  9. {'Smith, Jim': 3.6, 'Gilmore, Penny': 3.9}

You can create sets too. Set comprehensions look exactly like a list comprehension, but with curly braces instead of square brackets:

  1. >>> # A list of student majors...
  2. ... [ student.major for student in students ]
  3. ['Computer Science', 'Economics', 'Computer Science',
  4. 'Economics', 'Basket Weaving']
  5. >>> # And the same as a set:
  6. ... { student.major for student in students }
  7. {'Economics', 'Computer Science', 'Basket Weaving'}
  8. >>> # You can also use the set() built-in.
  9. ... set(student.major for student in students)
  10. {'Economics', 'Computer Science', 'Basket Weaving'}

(How does Python distinguish between a set and dict comprehension? Because the dict​'s expression is a key-value pair, while a set​'s has a single value.)

What about tuple comprehensions? This is fun: strictly speaking, Python doesn’t support them. However, you can pretend it does by using tuple():

  1. >>> tuple(student.gpa for student in students
  2. ... if student.major == "Computer Science")
  3. (3.6, 2.5)

This creates a tuple, but it’s not a tuple comprehension. You’re calling the tuple constructor, and passing it a single argument. What’s that argument? A generator expression! In other words, you’re doing this:

  1. >>> cs_students = (
  2. ... student.gpa for student in students
  3. ... if student.major == "Computer Science"
  4. ... )
  5. >>> type(cs_students)
  6. <class 'generator'>
  7. >>> tuple(cs_students)
  8. (3.6, 2.5)
  9. >>>
  10. >>> # Same as:
  11. ... tuple((student.gpa for student in students
  12. ... if student.major == "Computer Science"))
  13. (3.6, 2.5)
  14. >>> # But you can omit the inner parentheses.

tuple​'s constructor takes an iterator as an argument. The cs_students is a generator object (created by the generator expression), and a generator object is an iterator. So you can pretend like Python has tuple comprehensions, using "tuple(" as the opener and ")" as the close. In fact, this also gives you alternate ways to create dictionary and set comprehensions:

  1. >>> # Same as:
  2. ... # { student.gpa for student in students }
  3. >>> dict((, student.gpa)
  4. ... for student in students)
  5. {'Jim Smith': 3.6, 'Penny Gilmore': 3.9,
  6. 'Alisha Jones': 2.5, 'Ryan Spencer': 3.1,
  7. 'Todd Reynolds': 3.4}
  8. >>> # Same as:
  9. ... # { student.major for student in students }
  10. >>> set(student.major for student in students)
  11. {'Computer Science', 'Basket Weaving', 'Economics'}

Remember, when you pass a generator expression into a function, you can omit the inner parentheses. That’s why you can, for example, type

  1. tuple(f(x) for x in numbers)

instead of

  1. tuple((f(x) for x in numbers))

One last point. Generator expressions are a scalable analogue of list comprehensions; is there any such equivalent for dicts or sets? No, it turns out. If you need to lazily generate key-value pairs or unique elements, your best bet is to write a generator function.

[3] Technically, the condition doesn’t have to depend on the variable, but it’s hard to imagine building a useful list comprehension this way.

[4] I like to think future version control tools will automatically resolve this kind of situation. I believe it will require the tool to have some knowledge of the language syntax, so it can parse and reason about different clauses in a line of code.

[5] Refresher: The range built-in returns an iterator over a sequence of integers, and can be called with 1, 2, or 3 arguments. The most general form is range(start, stop, step), beginning at start, going up to but not including stop, in increments of step. Called with 2 arguments, the step-size defaults to one; with one argument, that argument is the stop, and the sequence starts at zero.

Next Chapter: Advanced Functions

Previous Chapter: Scaling With Generators