Rebellious Magic Methods and Extending Python Syntax
Let me show you something really interesting. Imagine a library for
working with large data sets - I'll called it fakepandas
,
because on the surface it looks a little like the excellent Pandas data analysis
library. Fakepandas
has a class called
Dataset
:
- >>> from fakepandas import Dataset
- >>> ds = Dataset({
- ... 'A': [-137, 22, -3, 4, 5],
- ... 'B': [10, 11, 121, 13, 14],
- ... 'C': [3, 6, 91, 12, 15],
- ... })
- >>> ds.pprint()
- -------------------
- | A | B | C |
- -------------------
- | -137 | 10 | 3 |
- | 22 | 11 | 6 |
- | -3 | 121 | 91 |
- | 4 | 13 | 12 |
- | 5 | 14 | 15 |
- -------------------
Notice in the constructor, you're passing in a dictionary which maps column labels to the data for that column. So in your mind, you rotate each horizontal list into a vertical column - got that? Now here's the really interesting part:
- # Create a new Dataset called "positive_a".
- >>> positive_a = ds[ds.A > 0]
- >>> positive_a.pprint()
- ----------------
- | A | B | C |
- ----------------
- | 22 | 11 | 6 |
- | 4 | 13 | 12 |
- | 5 | 14 | 15 |
- ----------------
- >>> big_b = ds[ds.B >= 14]
- >>> big_b.pprint()
- -----------------
- | A | B | C |
- -----------------
- | -3 | 121 | 91 |
- | 5 | 14 | 15 |
- -----------------
- >>> coupled = ds[ds.A + ds.B < 20]
- >>> coupled.pprint()
- ------------------
- | A | B | C |
- ------------------
- | -137 | 10 | 3 |
- | 4 | 13 | 12 |
- | 5 | 14 | 15 |
- ------------------
In an expression like ds[ds.A > 0]
, the part in the
brackets becomes a filter; you get a new Dataset
with
only the matching rows.
But when you think about it, that's a little
weird. Boolean expressions are supposed to evaluate to either
True
or False
. That means everything I typed
above should become either ds[True]
or
ds[False]
at runtime... encoding no information about the
rows you actually want.
So what's going on? How in the world does this work?
Magic Methods
The secret relies on Python's "magic methods". You won't find that phrase in the official Python docs; it's a term the wider community invented to describe methods your objects can define, which hook into Python's built-in operators and behavior. For example, imagine a class that represents angles in degrees:
- class Angle:
- def __init__(self, degrees):
- # Wrap it around so that the value is always
- # in the interval of 0 and 359.999...
- self.degrees = degrees % 360
- # Less-than
- def __lt__(self, other):
- return self.degrees < other.degrees
- # Less-than-or-equals
- def __le__(self, other):
- return self.degrees <= other.degrees
- # Greater-than
- def __gt__(self, other):
- return self.degrees > other.degrees
- # Greater-than-or-equals
- def __ge__(self, other):
- return self.degrees >= other.degrees
- # Equals
- def __eq__(self, other):
- return self.degrees == other.degrees
__lt__
, __ge__
, etc. are magic methods. When a class defines them,
objects of that class can be used with the normal comparison
operators, like
"<
" and ">=
":
- >>> a = Angle(20)
- >>> b = Angle(380)
- >>> c = Angle(45)
- >>>
- >>> b == a
- True
- >>> c >= a
- True
- >>> b < c
- True
- >>> a < b
- False
There are many
more than these three. All such magic methods are surrounded by a
pair of underscores; when speaking, people often say "dunder foo" to
mean __foo__
, because (with __gt__
, for
example) "dunder gee tee" takes much less time to say than "underscore
underscore gee tee underscore underscore".
Normally these comparison magic methods - __lt__
,
__ge__
, __eq__
, and so on - are programmed
to return either True or False. But as far as Python's runtime is
concerned, they can return anything. In fact, you can have magic
methods return an instance of a class you define. Let's see how
we can exploit this for the Dataset
class.
The Dataset Class
The constructor is straightforward enough. Its single argument is a dictionary, mapping column labels (strings) to the data in that column (as a list):
- class Dataset:
- def __init__(self, data: dict):
- self.data = data
- self.length = num_rows(data)
- self.labels = sorted(data.keys())
num_rows
is a helper function returning the number of
rows in the columns. (It also validates all columns have the same
number of rows, raising a ValueError
if they're
inconsistent. I'll show you the source later.) Notice we auto-sort the
column labels, which will make certain things easier.
Now, to make an expression like ds[ds.A > 0]
work, we must
make ds.A
meaningful. We do that with the
__getattr__
magic method:
- class Dataset:
- # ...
- def __getattr__(self, label):
- if label not in self.data:
- raise AttributeError("'{}' object has no attribute '{}'".format(self.__class__.__name__, label))
- return LabelReference(label)
When we say df.A
, Python essentially translates that to
df.__getattr__("A")
. Generally speaking, when you use
__getattr__
,
you will want to only accept certain values, and raise
AttributeError
otherwise. In this case, since "A" is a
valid column label, df.A
returns an instance of a class
called LabelReference
:
- >>> ds.A
- <fakepandas.LabelReference object at 0x1014180f0>
LabelReference
represents a column label, and the
operations that work on it. The code for LabelReference
looks like this:
- class LabelReference:
- def __init__(self, label: str):
- self.label = label
- def __gt__(self, value):
- return Comparison(self.label, value, operator.gt)
Notice how LabelReference
's rebellious dunder-gee-tee
method does not return a boolean. It instead returns a
decidedly non-boolean object, of a type called
Comparison
. This creative mis-use is exactly what lets
this whole thing work!
Comparisons
In essence, Comparison
represents
lazily comparing a specific row's value to some threshold:
- class Comparison:
- def __init__(self, label, value, operate):
- self.label = label
- self.value = value
- self.operate = operate
- def apply(self, data, row_number):
- other_value = data[self.label][row_number]
- return self.operate(other_value, self.value)
Breaking this down, for ds.A > 0
:
self.label
is the column label. "A", in this case.self.value
is what each row in the column is being compared to. In this case, that's zero.self.operate
is a function that takes two arguments, and returns either True or False. In this case, that would beoperator.gt
in the standard library, which is the function version of ">".- Applying this to the actual rows in the data set is done by
calling
Comparison.apply
, which give the final True/False result for a particular row. We'll see where that is called later.
To review: ds.A > 0
is translated by Python into
ds.__getitem__("A").__gt__(0)
. The value returned by
ds.__getitem__("A")
is of type
LabelReference
; when we invoke __gt__(0)
on
that object, what we get back is of type
Comparision
.
Now all that's left is to make the square brackets work on the
Dataset
object. We do that with a magic method called
__getitem__
. If you're not familiar with it, it works
like this:
- >>> class Petstore:
- ... def __init__(self, inventory: dict):
- ... # A dict mapping pet species (str)
- ... # to number in the store (int).
- ... self.inventory = inventory
- ... def __getitem__(self, pet: str):
- ... # Return how many of that pet we have.
- ... return self.inventory[pet]
- ...
- >>> pet_store = Petstore({
- ... "turtle": 3,
- ... "dog": 7,
- ... "cat": 2,
- ... "elephant": 1,
- ... })
- >>>
- >>> num_turtles = pet_store["turtle"]
- >>> print("We have {} turtles in stock.".format(num_turtles))
- We have 3 turtles in stock.
In other words, Python automatically translates
pet_store["turtle"]
into
pet_store.__getitem__("turtle")
. Neat, huh? This lets us
put the last Dataset
piece into place:
- # In class Dataset:
- def __getitem__(self, comparison):
- filtered_data = dict((label, [])
- for label in self.labels)
- # Internal helper function.
- def append_row(row_number):
- for label in self.labels:
- value = self.data[label][row_number]
- filtered_data[label].append(value)
- # Now add in rows.
- for row_number in range(self.length):
- if comparison.apply(self.data, row_number):
- append_row(row_number)
- return Dataset(filtered_data)
This is where we use the apply
method, towards the end
of this code block. Importantly, its form is very general - which
means we can support more complex expressions simply by creating more
sophisticated comparison classes, with their own apply
methods.
Richer Syntax
With this foundation in place for greater-than comparisons, we can
easily add the others: less-than, greater-than-equals, etc. All we
have to do is add more methods to LabelReference
:
- # The full version, supporting
- # ==, <, >, <= and >=.
- class LabelReference:
- def __init__(self, label: str):
- self.label = label
- def __gt__(self, value):
- return Comparison(self.label, value, operator.gt)
- def __lt__(self, value):
- return Comparison(self.label, value, operator.lt)
- def __le__(self, value):
- return Comparison(self.label, value, operator.le)
- def __ge__(self, value):
- return Comparison(self.label, value, operator.ge)
- def __eq__(self, value):
- return Comparison(self.label, value, operator.eq)
But this is just the start. We can go much further in the interface we provide. Using the same principles we've covered so far, we can express things like:
- Relationships between multiple columns:
ds[ds.A + ds.B == 19]
ds[d.B - d.C >= 3]
- Logical "and" and "or":
ds[(ds.A > 0) & (ds.B >= 12)]
ds[(ds.A >= 3) | (ds.B == 11)]
- Equation-like filters, e.g.
ds[ds.C + 2 < ds.B]
These are all demonstrated in the full source of fakepandas. Having read this far, you now can understand the rest on your own. Take a look at the source, and let me know what you think.