Skip to main content

Converting a Django ForeignKey to a GenericForeignKey

I’m currently working on a Django project called the SWAP that contains lots of disease outbreak time series data. Previously, the model looked something like this (I’m using Django 1.9 with Python 3.4+):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from django.db import models
 
class Outbreak(models.Model):
    ...
 
class TimeStep(models.Model):
    outbreak = models.ForeignKey(Outbreak)
    timestamp = models.DateTimeField()
    count = models.PositiveIntegerField(blank=True, null=True)
 
    class Meta:
        ordering = ['timestamp']
 
    def __str__(self):
        return '{} - {}'.format(self.timestamp, self.count if self.count else '[empty]')

We’re working on incorporating disease spread forecasts into the project. Forecasts actually require the same sort of time series infrastructure that outbreaks do. It seemed wasteful to create a new TimeStep class just for forecasts, so I started doing some research and quickly stumbled upon generic relations.

Django’s documentation on this topic is somewhat lacking, so it took a good amount of digging around to figure out how to structure things and migrate my existing models. The following resources were really useful in my search:

The last link was the most useful because it discussed how to create the migrations; however, it was written for South, which is now outdated because South has been absorbed into Django as the Migrations framework. I’m writing this post to describe how I migrated a model using a ForeignKey to a GenericForeignKey.

1. Add the necessary fields

Two primary fields are needed in the TimeStep class, content_type and object_id:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from django.contrib.contenttypes.models import ContentType
from django.db import models
 
class TimeStep(models.Model):
    outbreak = models.ForeignKey(Outbreak)
 
    content_type = models.ForeignKey(ContentType, null=True, blank=True)
    object_id = models.PositiveIntegerField(null=True, blank=True)
 
    timestamp = models.DateTimeField()
    count = models.PositiveIntegerField(blank=True, null=True)
 
    class Meta:
        ordering = ['timestamp']
 
    def __str__(self):
        return '{} - {}'.format(self.timestamp, self.count if self.count else '[empty]')

A GenericForeignKey is essentially a tuple, (content_type, object_id). The content type and object ID are all Django needs to perform a lookup. We’ll actually add the GenericForeignKey field later.

Note that null=True, blank=True is added to the new fields temporarily so that the migration step doesn’t complain about them being blank. We’ll remove those requirements later.

Now, make and run the migration:

./manage.py makemigrations
./manage.py migrate

2. Populate the new fields using a data migration

The fields now exist, but they’re empty. We need to create a data migration. To do this, use the following command to create an empty migration, replacing swap with your project’s name:

./manage.py makemigrations --empty swap

For me, this created a new file under swap/migrations/ called 0039_auto_20160307_1408.py. The final migration looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# -*- coding: utf-8 -*-
# Generated by Django 1.9.2 on 2016-03-07 14:08
from __future__ import unicode_literals
 
from django.db import migrations
 
 
def migrate_foreign_key(apps, schema_editor):
    """
        Data migration to populate the GenericForeignKey fields.
    """
    TimeStep = apps.get_model('swap', 'TimeStep')
    ContentType = apps.get_model('contenttypes', 'ContentType')
 
    outbreak_content_type = ContentType.objects.get(app_label='swap', model='outbreak')
 
    for timestep in TimeStep.objects.all():
        timestep.content_type = outbreak_content_type
        timestep.object_id = timestep.outbreak.pk
        timestep.save()
 
 
class Migration(migrations.Migration):
 
    dependencies = [
        ('swap', '0038_auto_20160307_1408'),
    ]
 
    operations = [
        migrations.RunPython(migrate_foreign_key),
    ]

migrate_foreign_key is a very simple function that properly specifies the content_type of all TimeStep objects as an Outbreak. It then copies each outbreak’s primary key into the object_id.

Note that the call to apps.get_model is necessary. You cannot just import your model and use it. As the official docs say, this is so that we get the correct versioned model in this context.

Run the migration:

./manage.py migrate

3. Cleanup

Finally, I need to do 3 things:

  1. Remove the old ForeignKey
  2. Add the new GenericForeignKey
  3. Remove the null and blank requirements

After I do these 3 things, my TimeStep model now looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from django.contrib.contenttypes.fields import GenericForeignKey
from django.contrib.contenttypes.models import ContentType
from django.db import models
 
class TimeStep(models.Model):
    content_type = models.ForeignKey(ContentType, on_delete=models.CASCADE)
    object_id = models.PositiveIntegerField()
    content_object = GenericForeignKey('content_type', 'object_id')
 
    timestamp = models.DateTimeField()
    count = models.PositiveIntegerField(blank=True, null=True)
 
    class Meta:
        ordering = ['timestamp']
 
    def __str__(self):
        return '{} - {}'.format(self.timestamp, self.value if self.value else '[empty]')

Finally, make and run the migrations:

./manage.py makemigrations
./manage.py migrate

Running the migration may cause Django to prompt you to fill in values for the content_type and object_id fields. If it does, there should be an option to not fill in a value due to a data migration; select that option.

I added a couple new methods to TimeStep to make querying easier:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from django.contrib.contenttypes.fields import GenericForeignKey
from django.contrib.contenttypes.models import ContentType
from django.db import models
 
class Outbreak(models.Model):
    ...
 
class ForecastSeries(models.Model):
    ...
 
class TimeStep(models.Model):
    content_type = models.ForeignKey(ContentType, on_delete=models.CASCADE)
    object_id = models.PositiveIntegerField()
    content_object = GenericForeignKey('content_type', 'object_id')
 
    timestamp = models.DateTimeField()
    count = models.PositiveIntegerField(blank=True, null=True)  # we allow for a null blank count so that we can allow the analyst to specify the start/end of the interval
 
    # store the content types for each type of object that will use TimeStep
    _content_types = dict()
    _content_types[Outbreak] = ContentType.objects.get_for_model(Outbreak)
    _content_types[ForecastSeries] = ContentType.objects.get_for_model(ForecastSeries)
 
    class Meta:
        ordering = ['timestamp']
 
    def __str__(self):
        return '{} - {}'.format(self.timestamp, self.value if self.value else '[empty]')
 
    @staticmethod
    def get_outbreak_timesteps(outbreak):
        """
            Get a particular Outbreak's time series. You *cannot* do `TimeStep.objects.filter(outbreak=o)` because this class
            uses a GenericForeignKey.
        """
        return TimeStep.objects.filter(content_type=TimeStep._content_types[Outbreak], object_id=outbreak.pk)
 
    @staticmethod
    def get_forecast_series_timesteps(forecast_series):
        """
            Get a particular ForecastSeries' time series.
        """
        return TimeStep.objects.filter(content_type=TimeStep._content_types[ForecastSeries], object_id=forecast_series.pk)

Because I can no longer do TimeStep.objects.filter(outbreak=o), querying is a tad more complex, but it doesn’t have to be. These methods allow me to do TimeStep.get_outbreak_timesteps(o) or TimeStep.get_forecast_series_timesteps(f) to pull a particular model’s time series.

That’s it! Nice and easy. Let me know if you find this useful or have any questions.

Installing the requirements for Pillow 3 on Debian

I’m working on installing Mezzanine, a CMS written in Django, for a project I’m working on. Mezzanine requires Pillow, an imaging library for Python. Pillow requires/recommends a number of libraries. It took me a little bit to figure out how to get (mostly) everything working on Debian 8.2 (Jessie). Here’s a command to install all of Pillow’s requirements in one fell swoop:

sudo aptitude install libjpeg62-turbo-dev libopenjpeg-dev libfreetype6-dev libtiff5-dev liblcms2-dev libwebp-dev tk8.6-dev

When I run pip install -v pillow, I see this in the output:

PIL SETUP SUMMARY
--------------------------------------------------------------------
version  Pillow 3.0.0
platform linux 3.4.2 (default, Oct  8 2014, 10:45:20)
 [GCC 4.9.1]
--------------------------------------------------------------------
*** TKINTER support not available
--- JPEG support available
*** OPENJPEG (JPEG2000) support not available
--- ZLIB (PNG/ZIP) support available
--- LIBTIFF support available
--- FREETYPE2 support available
--- LITTLECMS2 support available
--- WEBP support available
--- WEBPMUX support available
--------------------------------------------------------------------
To add a missing option, make sure you have the required
library, and set the corresponding ROOT variable in the
setup.py script.

Unfortunately, it doesn’t seem like Pillow recognizes TCL/TK (despite the fact that I installed tk8.6-dev, which includes tcl8.6-dev), so I can’t get Tkinter support working. I also installed OpenJPEG via libopenjpeg-dev, but the version in Debian seems to be too old:

$ aptitude show libopenjpeg-dev
Package: libopenjpeg-dev                 
State: installed
Automatically installed: no
Multi-Arch: same
Version: 1:1.5.2-3
Priority: extra
Section: libdevel
Maintainer: Debian PhotoTools Maintainers <pkg-phototools-devel@lists.alioth.debian.org>
Architecture: amd64
Uncompressed Size: 111 k
Depends: libopenjpeg5 (= 1:1.5.2-3)
Description: development files for OpenJPEG, a JPEG 2000 image library - dev
 OpenJPEG is a library for handling the JPEG 2000 image compression format. JPEG 2000 is a wavelet-based image compression standard and permits progressive transmission by pixel and resolution accuracy for progressive downloads of an encoded image. It supports lossless and lossy compression, supports higher compression than JPEG 1991, and has resilience to
 errors in the image. 
 
 This is the development package
Homepage: http://www.openjpeg.org

Tags: devel::library, role::devel-lib

The Pillow docs state that version 2.0.0 and 2.1.0 are supported, so 1.5.2-3 must be too old.

If someone knows how to get Tkinter or OpenJPEG working, please let me know in the comments! I don’t think it’ll matter much in the end, but it’d be nice to have all of Pillow’s functionality available.

UPDATE: I was able to get Tkinter working with the help of the Pillow devs. Issue #1473 has the full discussion, but the main takeaway is that I had to install python3-tk, which enables Tkinter support.

Additionally, the Pillow docs actually contain a Building on Linux section that I missed before. It more or less echoes what I lay out in this blog post. This is the final command I had to use:

sudo aptitude install libjpeg62-turbo-dev libopenjpeg-dev libfreetype6-dev libtiff5-dev liblcms2-dev libwebp-dev tk8.6-dev python3-tk

Unfortunately, OpenJPEG still isn’t supported, but that’s just because Pillow requires a newer version than is contained in the Debian repos; build it from source if you need it, and you should be good to go.

yelpapi updated with Phone Search API support

Recently, Yelp added a new API, a Phone Search API. This allows the user to look up businesses by phone number. I just finished adding Phone Search API support to my yelpapi Python project. After looking around, none of the other Yelp v2.0 API implementations support the Phone Search API yet. It also appears as though all the other Yelp API implementations still rely on pre-defined classes to represent search results, so I’m sure it’ll be a while before they add support for the new API.

pyHarmonySearch now supports Python 3+

As promised yesterday, pyHarmonySearch, my open source pure Python implementation of the harmony search algorithm, now fully supports Python 3. As with yelpapi, it was actually a really simple process. Only a few lines of code needed to change.

Also of note, pyHarmonySearch now properly handles KeyboardInterrupt exceptions. pyHarmonySearch uses Python’s multiprocessing.Pool to run multiple searches simultaneously. multiprocessing.Pool doesn’t natively handle KeyboardInterrupt exceptions, so special care must be given to ensure proper termination of the pool. The solution I used comes from this Stack Overflow question.

yelpapi now supports Python 3+

My Yelp v2.0 API Python implementation, yelpapi, is now fully Python 3+ compliant. In my work, I still mostly use Python 2.7 (although I’m starting to think very seriously about migrating to Python 3), so that’s what I develop for. However, I got the urge today to make yelpapi Python 3+ compliant to reach a broader audience. Turns out it was pretty easy. I really only had to make a handful of small changes (see the commit log for exact details). If you find this project useful, be sure to let me know!

Tomorrow, I’ll probably spend a little time making pyHarmonySearch Python 3+ compatible.

Using requests_oauthlib in yelpapi

Just yesterday, I announced a new open source project called yelpapi that implements the Yelp v2.0 API in Python. I was doing a little looking around (digging around the requests documentation), and I discovered the official requests extension for OAuth, requests-oauthlib. requests-oauthlib inherits from requests and has all of the same power and functionality, plus OAuth support. It just so happens that Yelp’s API uses OAuth 1. I decided to migrate from using a combination of requests and python-oauth2 to the simpler requests-oauthlib. The end result is that yelpapi is now 20 lines shorter. requests-oauthlib is a really slick way to deal with OAuth.

Additionally, I migrated yelpapi, pyxDamerauLevenshtein, and pyHarmonySearch from distutils to setuptools for installation. setuptools offers some nice additions to distutils, such as install_requires, a directive that pip uses to ensure you have all dependencies. I’m bundling ez_setup.py to manage setuptools installation if necessary.

Introducing yelpapi, a pure Python implementation of the Yelp v2.0 API!

I just released yelpapi on GitHub. yelpapi is a pure Python implementation of the Yelp v2.0 API. The reason I created yelpapi is because I wanted to create an implementation that was completely flexible with regards to both input (i.e., it uses **kwargs) and output (i.e., it returns a dynamically created dict from the JSON). Most API implementations tend to (in my opinion) be over-designed by specifying classes and functions for every possible input/output. I choose to go with a much simpler view; I recognize that the programmer is capable of understanding and designing the data that need to be passed to the service but just doesn’t want to deal with the hassle of dealing with networking and error-handling. The result is that I was able to implement the entire Yelp v2.0 API in only 125 lines (most of which is white space or a comment). Additionally, this means that my implementation is robust to many changes that Yelp might implement in the future. For example, if Yelp decides to add a new parameter to the search API, or if they decide to return results in a slightly different manner, my code won’t need to change.

My hope is that more people design robust API implementations like what I’ve done here. My Yelp API implementation design can pretty easily be extended to other APIs without much effort.

If you find this implementation useful, let me know!

GitHub: https://github.com/gfairchild/yelpapi
PyPI: https://pypi.python.org/pypi/yelpapi

Simple Unix find/replace using Python

Find/replace in Unix isn’t very friendly. Sure, you can use sed, but it uses fairly nasty syntax that I always forget:

sed -i.bak s/STRING_TO_FIND/STRING_TO_REPLACE/g filename

I wanted something really simple that’s more user-friendly. I turn to Python:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/usr/bin/env python
 
"""
    Replace all instances of a string in the specified file.
"""
 
import argparse
import fileinput
 
#deal with command line arguments
argparser = argparse.ArgumentParser(description='Find/replace strings in a file.')
argparser.add_argument('file', type=str, help='file on which to perform the find/replace')
argparser.add_argument('find_string', type=str, help='string to find')
argparser.add_argument('replace_string', type=str, help='string that replaces find_string')
args = argparser.parse_args()
 
for line in fileinput.input(args.file, inplace=1):
    print line.replace(args.find_string, args.replace_string), #trailing comma prevents newline

That’s it. Toss this into a file called find_replace.py and optionally put it on your PATH. Here’s an example where I replace all instances of <br> with <br/> in an HTML file:

find_replace.py index.html "<br>" "<br/>"

Here’s an example where I use GNU Parallel to do the same find/replace on all HTML files in a directory:

find . -name *.html | parallel "find_replace.py {} '<br>' '<br/>'"

Much more user-friendly than sed!

This certainly works, and the code is incredibly simple, but fileinput is really geared towards reading lots of files. Perhaps more important is that there’s no error handling here. I could (and probably should) surround lines 17 and 18 with try-except, but I much prefer using with for file I/O. Unfortunately, with support for fileinput wasn’t added until Python 3.2 (I’m using 2.7). And personally, I think that while the inplace parameter is pretty cool, it’s dangerous because it’s not particularly intuitive. A better, although slightly longer, solution is to manually read in the file all at once, write out changes to a temp file, and then copy the temp file’s contents. Here’s a more “proper” solution:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#!/usr/bin/env python
 
"""
    Replace all instances of a string in the specified file.
"""
 
import argparse
import tempfile
from os import fsync
 
#deal with command line arguments
argparser = argparse.ArgumentParser(description='Find/replace strings in a file.')
argparser.add_argument('file', type=str, help='file on which to perform the find/replace')
argparser.add_argument('find_string', type=str, help='string to find')
argparser.add_argument('replace_string', type=str, help='string that replaces find_string')
args = argparser.parse_args()
 
#open 2 files - args.file for reading, and a temporary file for writing
with open(args.file, 'r+') as input, tempfile.TemporaryFile(mode='w+') as output:
    #write replaced content to temp file
    for line in input:
        output.write(line.replace(args.find_string, args.replace_string))
    #write all cached content to disk - flush followed by fsync
    output.flush()
    fsync(output.fileno())
    #go back to beginning to copy data over
    input.seek(0)
    output.seek(0)
    #copy output lines to input
    for line in output:
        input.write(line)
    #remove any excess stuff from input
    input.truncate()

This code uses with, so error-handling is implicit, and it’s written specifically to handle a single file (unlike fileinput), so it should be more efficient.

Compared to sed, this doesn’t currently allow for regular expressions, but that would be fairly trivial to add in; perhaps an extra command-line argument indicating that find_string is a regular expression should be added.

pyxDamerauLevenshtein is now live on GitHub!

I’ve released another open source Python project on GitHub today called pyxDamerauLevenshtein. pyxDamerauLevenshtein implements the Damerau-Levenshtein (DL) edit distance algorithm for Python in Cython for high performance. The DL edit distance algorithm is a widely used approximate string matching (aka fuzzy matching) algorithm. It’s useful for determining if two strings are “similar enough” to each other.

This project is based on Michael Homer’s pure Python implementation. It runs in O(N*M) time using O(M) space and supports unicode characters. Since it’s implemented in C via Cython, it is two orders of magnitude faster than the equivalent pure Python implementation.

Let me know if you find this code useful!

pyHarmonySearch is now available on PyPI

I spent most of my Saturday refactoring pyHarmonySearch. Before, it was a little un-Pythonic. A call looked like this:

python harmony_search.py test_continuous_seed ObjectiveFunction

This was done because I originally wrote pyHarmonySearch with research in mind. I had harmony_search.py in the same directory as the objective function implementation, so it just made sense. However, since releasing pyHarmonySearch to the public, I knew I needed to refactor it to be more Pythonic. The same call now looks like this:

python test_continuous_seed.py

The harmony search implementation itself now lives in a class completely separated from everything else. This allowed me to write a setup.py file to install the code globally. Furthermore, it allowed me to publish the code on PyPI, the official Python package index. pyHarmonySearch can now be installed using pip, the package manager everyone uses for Python:

pip install pyHarmonySearch

https://pypi.python.org/pypi/pyHarmonySearch is the official location of pyHarmonySearch on PyPI. These changes really simplified how pyHarmonySearch is used, and I hope it’ll make it more user-friendly.

As a side note, I’m honestly really impressed with how easy it was to publish pyHarmonySearch on PyPI. It was literally as easy as:

python setup.py register
python setup.py sdist --formats=gztar upload
python setup.py bdist_wininst upload
python setup.py build --plat-name=win32 bdist_wininst upload

This uploads the source as well as 32-bit and 64-bit installers for Windows. Bravo, Python, bravo.