Pandas' secret mini-language
The DataFrame class (from Pandas) is a work of art. Even if you never "do data", priceless lessons can be gleaned by studying this class.
It starts simple enough. Usually you will create a DataFrame by ingesting from a CSV file or database table or something. But you can whip up a small one like this:
- import pandas as pd
-
- df = pd.DataFrame({
- 'A': [-137, 22, -3, 4, 5],
- 'B': [10, 11, 121, 13, 14],
- 'C': [3, 6, 91, 12, 15],
- })
This gives you a DataFrame with three columns, labeled A, B and C. With rows of data, like so:
- >>> print(df)
- A B C
- 0 -137 10 3
- 1 22 11 6
- 2 -3 121 91
- 3 4 13 12
- 4 5 14 15
The first thing to notice is that DataFrame is a class. Once upon a time, there was no such thing as a DataFrame. Someone imagined it, and then coded it up. And just look how it changed the world.
If that is not an argument for learning OOP, I do not know what is.
But this article is about something else. Because once you have a DataFrame, you can 'filter' out rows that match certain criteria, by typing magick chars inside square brackets:
- >>> positive_a = df[df.A > 0]
- >>> print(positive_a)
- A B C
- 1 22 11 6
- 3 4 13 12
- 4 5 14 15
Breaking this down:
- You can refer to column A of the DataFrame "df" by this handle: df.A
- You can use square brackets, e.g. "df[...]", to say: "give me a new DataFrame with only certain rows, skipping others."
- What rows are kept? Well, that gets defined by an expression like "df.A > 0" in the brackets. Meaning, in this case: keep those rows where the 'A' value is positive.
And it is interesting that you can go much more complex:
- # Relationships between multiple columns:
- df[df.A + df.B == 19]
- df[d.B - d.C >= 3]
-
- # Logical "and" and "or":
- df[(df.A > 0) & (df.B >= 12)]
- df[(df.A >= 3) | (df.B == 11)]
-
- # Equation-like filters, e.g.:
- df[df.C + 2 < df.B]
Which is super cool.
But it also makes no sense, when you think about it...
Because "df.A > 0" evals as boolean True or False. Right? Exactly one bit of information. So how in the name of Guido does this work?
It works because "df.A > 0" is not boolean. It sneakily does not return True, nor False. It evaluates to something else:
- >>> comparison = (df.A > 0)
- >>> type(comparison)
- <class 'pandas.core.series.Series'>
- >>> print(comparison)
- 0 False
- 1 True
- 2 False
- 3 True
- 4 True
- Name: A, dtype: bool
Well how about that. Rather than bool, its type is something called Series. Which is kinda like a list of bool's, one for each row. Now it makes more sense, doesn't it.
This is the core principle which allows DataFrame's square bracket trick to work.
And while I do not have room in this little essay to fully explain the details...
I will point out that it relies on Python's particular object model. Meaning, it relies on OOP. But also on details of how Python uniquely does OOP. And when you invest in mastering OOP in Python, it helps you make powerful classes like DataFrame too.
Something to think about.
Of course, this is an advanced technique. Which is why it is covered at the end of the OOP chapter in the Powerful Python book.
But see what good ideas all this sparks for you. How can you apply what you learned today in your own code?
(If you want a hint to how (df.A > 0) "returns" something other than a bool, search Python's docs for this magic method name: __gt__().)