Rebellious Magic Methods and Extending Python Syntax

Let me show you something really interesting. Imagine a library for working with large data sets - I'll called it fakepandas, because on the surface it looks a little like the excellent Pandas data analysis library. Fakepandas has a class called Dataset:

>>> from fakepandas import Dataset
>>> ds = Dataset({
...     'A': [-137, 22, -3, 4, 5],
...     'B': [10, 11, 121, 13, 14],
...     'C': [3, 6, 91, 12, 15],
... })
>>> ds.pprint()
-------------------
|    A |   B |  C |
-------------------
| -137 |  10 |  3 |
|   22 |  11 |  6 |
|   -3 | 121 | 91 |
|    4 |  13 | 12 |
|    5 |  14 | 15 |
-------------------

Notice in the constructor, you're passing in a dictionary which maps column labels to the data for that column. So in your mind, you rotate each horizontal list into a vertical column - got that? Now here's the really interesting part:

# Create a new Dataset called "positive_a".
>>> positive_a = ds[ds.A > 0]
>>> positive_a.pprint()
----------------
|  A |  B |  C |
----------------
| 22 | 11 |  6 |
|  4 | 13 | 12 |
|  5 | 14 | 15 |
----------------
>>> big_b = ds[ds.B >= 14]
>>> big_b.pprint()
-----------------
|  A |   B |  C |
-----------------
| -3 | 121 | 91 |
|  5 |  14 | 15 |
-----------------
>>> coupled = ds[ds.A + ds.B < 20]
>>> coupled.pprint()
------------------
|    A |  B |  C |
------------------
| -137 | 10 |  3 |
|    4 | 13 | 12 |
|    5 | 14 | 15 |
------------------

In an expression like ds[ds.A > 0], the part in the brackets becomes a filter; you get a new Dataset with only the matching rows.

But when you think about it, that's a little weird. Boolean expressions are supposed to evaluate to either True or False. That means everything I typed above should become either ds[True] or ds[False] at runtime... encoding no information about the rows you actually want.

So what's going on? How in the world does this work?

Magic Methods

The secret relies on Python's "magic methods". You won't find that phrase in the official Python docs; it's a term the wider community invented to describe methods your objects can define, which hook into Python's built-in operators and behavior. For example, imagine a class that represents angles in degrees:

class Angle:
    def __init__(self, degrees):
        # Wrap it around so that the value is always
        # in the interval of 0 and 359.999...
        self.degrees = degrees % 360
    # Less-than
    def __lt__(self, other):
        return self.degrees < other.degrees
    # Less-than-or-equals
    def __le__(self, other):
        return self.degrees <= other.degrees
    # Greater-than
    def __gt__(self, other):
        return self.degrees > other.degrees
    # Greater-than-or-equals
    def __ge__(self, other):
        return self.degrees >= other.degrees
    # Equals
    def __eq__(self, other):
        return self.degrees == other.degrees

__lt__, __ge__, etc. are magic methods. When a class defines them, objects of that class can be used with the normal comparison operators, like "<" and ">=":

>>> a = Angle(20)
>>> b = Angle(380)
>>> c = Angle(45)
>>>
>>> b == a
True
>>> c >= a
True
>>> b < c
True
>>> a < b
False

There are many more than these three. All such magic methods are surrounded by a pair of underscores; when speaking, people often say "dunder foo" to mean __foo__, because (with __gt__, for example) "dunder gee tee" takes much less time to say than "underscore underscore gee tee underscore underscore".

Normally these comparison magic methods - __lt__, __ge__, __eq__, and so on - are programmed to return either True or False. But as far as Python's runtime is concerned, they can return anything. In fact, you can have magic methods return an instance of a class you define. Let's see how we can exploit this for the Dataset class.

The Dataset Class

The constructor is straightforward enough. Its single argument is a dictionary, mapping column labels (strings) to the data in that column (as a list):

class Dataset:
    def __init__(self, data: dict):
        self.data = data
        self.length = num_rows(data)
        self.labels = sorted(data.keys())

num_rows is a helper function returning the number of rows in the columns. (It also validates all columns have the same number of rows, raising a ValueError if they're inconsistent. I'll show you the source later.) Notice we auto-sort the column labels, which will make certain things easier.

Now, to make an expression like ds[ds.A > 0] work, we must make ds.A meaningful. We do that with the __getattr__ magic method:

class Dataset:
    # ... 
    def __getattr__(self, label):
        if label not in self.data:
            raise AttributeError("'{}' object has no attribute '{}'".format(self.__class__.__name__, label))
        return LabelReference(label)

When we say df.A, Python essentially translates that to df.__getattr__("A"). Generally speaking, when you use __getattr__, you will want to only accept certain values, and raise AttributeError otherwise. In this case, since "A" is a valid column label, df.A returns an instance of a class called LabelReference:

>>> ds.A
<fakepandas.LabelReference object at 0x1014180f0>

LabelReference represents a column label, and the operations that work on it. The code for LabelReference looks like this:

class LabelReference:
    def __init__(self, label: str):
        self.label = label
    def __gt__(self, value):
        return Comparison(self.label, value, operator.gt)

Notice how LabelReference's rebellious dunder-gee-tee method does not return a boolean. It instead returns a decidedly non-boolean object, of a type called Comparison. This creative mis-use is exactly what lets this whole thing work!

Comparisons

In essence, Comparison represents lazily comparing a specific row's value to some threshold:

class Comparison:
    def __init__(self, label, value, operate):
        self.label = label
        self.value = value
        self.operate = operate
    def apply(self, data, row_number):
        other_value = data[self.label][row_number]
        return self.operate(other_value, self.value)

Breaking this down, for ds.A > 0:

self.label is the column label. "A", in this case.
self.value is what each row in the column is being compared to. In this case, that's zero.
self.operate is a function that takes two arguments, and returns either True or False. In this case, that would be operator.gt in the standard library, which is the function version of ">".
Applying this to the actual rows in the data set is done by calling Comparison.apply, which give the final True/False result for a particular row. We'll see where that is called later.

To review: ds.A > 0 is translated by Python into ds.__getitem__("A").__gt__(0). The value returned by ds.__getitem__("A") is of type LabelReference; when we invoke __gt__(0) on that object, what we get back is of type Comparision.

Now all that's left is to make the square brackets work on the Dataset object. We do that with a magic method called __getitem__. If you're not familiar with it, it works like this:

>>> class Petstore:
...     def __init__(self, inventory: dict):
...         # A dict mapping pet species (str)
...         # to number in the store (int).
...         self.inventory = inventory
...     def __getitem__(self, pet: str):
...         # Return how many of that pet we have.
...         return self.inventory[pet]
...
>>> pet_store = Petstore({
...     "turtle": 3,
...     "dog": 7,
...     "cat": 2,
...     "elephant": 1,
... })
>>>
>>> num_turtles = pet_store["turtle"]
>>> print("We have {} turtles in stock.".format(num_turtles))
We have 3 turtles in stock.

In other words, Python automatically translates pet_store["turtle"] into pet_store.__getitem__("turtle"). Neat, huh? This lets us put the last Dataset piece into place:

# In class Dataset:
    def __getitem__(self, comparison):
        filtered_data = dict((label, [])
                             for label in self.labels)
        # Internal helper function.
        def append_row(row_number):
            for label in self.labels:
                value = self.data[label][row_number]
                filtered_data[label].append(value)
        # Now add in rows.
        for row_number in range(self.length):
            if comparison.apply(self.data, row_number):
                append_row(row_number)
        return Dataset(filtered_data)

This is where we use the apply method, towards the end of this code block. Importantly, its form is very general - which means we can support more complex expressions simply by creating more sophisticated comparison classes, with their own apply methods.

Richer Syntax

With this foundation in place for greater-than comparisons, we can easily add the others: less-than, greater-than-equals, etc. All we have to do is add more methods to LabelReference:

# The full version, supporting
# ==, <, >, <= and >=.
class LabelReference:
    def __init__(self, label: str):
        self.label = label
    def __gt__(self, value):
        return Comparison(self.label, value, operator.gt)
    def __lt__(self, value):
        return Comparison(self.label, value, operator.lt)
    def __le__(self, value):
        return Comparison(self.label, value, operator.le)
    def __ge__(self, value):
        return Comparison(self.label, value, operator.ge)
    def __eq__(self, value):
        return Comparison(self.label, value, operator.eq)

But this is just the start. We can go much further in the interface we provide. Using the same principles we've covered so far, we can express things like:

Relationships between multiple columns:
- ds[ds.A + ds.B == 19]
- ds[d.B - d.C >= 3]
Logical "and" and "or":
- ds[(ds.A > 0) & (ds.B >= 12)]
- ds[(ds.A >= 3) | (ds.B == 11)]
Equation-like filters, e.g. ds[ds.C + 2 < ds.B]

These are all demonstrated in the full source of fakepandas. Having read this far, you now can understand the rest on your own. Take a look at the source, and let me know what you think.

White Paper Training For Teams