CherryPicker¶

Flatten complex data.

cherrypicker aims to make common ETL tasks (filtering data and restructuring it into flat tables) easier, by taking inspiration from jQuery and applying it in a Pythonic way to generic data objects.

pip install cherrypicker

cherrypicker provides a chainable filter and extraction interface to allow you to easily pick out objects from complex structures and place them in a flat table. It fills a similar role to jQuery in JavaScript, enabling you to navigate complex structures without the need for lots of complex nested for loops or list comprehensions.

Behold…

>>> from cherrypicker import CherryPicker
>>> import json
>>> with open('climate.json', 'r') as fp:
...     data = json.load(fp)
>>> picker = CherryPicker(data)
>>> picker['id', 'city'].get()
[[1, 'Amsterdam'], [2, 'Athens'], [3, 'Atlanta GA'], ...]

This example is equivalent to the list comprehension [[item[‘id’], item[‘city’]] for item in data]. cherrypicker really starts to become useful when you combine it with filtering:

>>> picker(city='B*')['id', 'city'].get()
[[6, 'Bangkok'], [7, 'Barcelona'], [8, 'Beijing'], ...]

The equivalent list comprehension would be: [[item[‘id’], item[‘city’]] for item in data if item[‘city’].startswith(‘B’)]. As you can see, CherryPicker does filtering and extraction with chained operations rather than list comprehensions. Filtering is done with parentheses () and extraction is done with square brackets []. Chaining can make it easier to build complex operations:

>>> picker(city='B*')['info'](
...     population=lambda n: n > 2000000,
...     area=lambda a: a < 2000
... )['area', 'population'].get()
[[1568, 8300000], [891, 3700000], [203, 2800000]]

Note that the above example searches for a population > 2000000 and an area of < 2000. If you wanted to search for population > 2000000 or an area of < 2000, simply add an extra how='any' parameter along with your predicates:

>>> picker(city='B*')['info'](
...     population=lambda n: n > 2000000,
...     area=lambda a: a < 2000
...     how='any'
... )['area', 'population'].get()
[[1568, 8300000], [102, 1615000], [16808, 21540000], ...]

This job is already getting too unwieldy for list comprehensions; to achieve the example above in another way we may wish to use a for loop:

table = []
for item in data:
    if item['city'].startswith('B'):
        info = item['info']
        if info['population'] > 2000000 or info['area'] < 2000:
            table.append(info['area'], info['population'])

Without cherrypicker, the amount of code we need to write increases pretty rapidly! There are many different types of filtering predicate you can use with cherrypicker, including exact matches, wildcards, regex and custom functions. Read all about them in the Filtering documentation.

Of course, it would be nice if we could extract data in the example above from both the base level and the info sub-level of each item and put them into a flat table, ready to load into your favourite data analysis package. We can do this conveniently in cherrypicker with the flat property of our picker which gives us a flattened view of the data. Let’s say that each item in our data list has a city name and a list of average low/high temperatures for each month of the year:

[
    {
        "id": 1,
        "city": "Amsterdam",
        "country": "Netherlands",
        "monthlyAvg": [
            {
                "high": 7,
                "low": 3,
                "dryDays": 19,
                "snowDays": 4,
                "rainfall": 68
            },
            {
                "high": 6,
                "low": 3,
                "dryDays": 13,
                "snowDays": 2,
                "rainfall": 47
            },
            ...
        ]
    }
]

By flattening the data before filtering/extracting, we can get the name and monthly temperatures alongside each other:

>>> picker.flat(
...     monthlyAvg_0_high=lambda tmp: tmp > 30
... )['city', 'monthlyAvg_0_high'].get()
[['Bangkok', 33], ['Brasilia', 31], ['Ho Chi Minh City', 33], ...]

>>> picker.flat(
...     monthlyAvg_0_high=lambda tmp: tmp < 0
... )['city', 'monthlyAvg_0_high'].get()
[['Calgary', -1], ['Montreal', -4], ['Moscow', -4], ...]

If you would like to customise the flattening behaviour, use the cherrypicker.CherryPickerMapping.flatten() method instead:

>>> picker.flat[0].get()
{'id': 1, 'city': 'Amsterdam', 'country': 'Netherlands', 'monthlyAvg_0_high': 7, ...}

>>> picker.flatten(delim='.')[0].get()
{'id': 1, 'city': 'Amsterdam', 'country': 'Netherlands', 'monthlyAvg.0.high': 7, ...}

One final point to note is that cherrypicker understands data by looking at its interfaces rather than its types. This means that it isn’t just limited to JSON data: as long as it can act like a dict or list, you can start cherrypicking from it!

Parallel Processing¶

As well as making complex queries easier, CherryPicker also allows you to easily use parallel processing to crunch through large datasets quickly:

>>> picker = CherryPicker(data, n_jobs=4)
>>> picker(city='B*')['id', 'city'].get()

Everything is the same as before, except you supply an n_jobs parameter to specify the number of CPUs you wish to use (a value of -1 will mean all CPUs are used).

Note that for small datasets, you will probably get better performance without parallel processing, as the benefits of using multiple CPUs will be outweighed by the overhead of setting up multiple processes. For large datasets with long lists though, parallel processing can significantly speed up your operations.

CherryPicker¶

Parallel Processing¶

Indices and tables¶