Skip to main content

yelpapi now supports Python 3+

My Yelp v2.0 API Python implementation, yelpapi, is now fully Python 3+ compliant. In my work, I still mostly use Python 2.7 (although I’m starting to think very seriously about migrating to Python 3), so that’s what I develop for. However, I got the urge today to make yelpapi Python 3+ compliant to reach a broader audience. Turns out it was pretty easy. I really only had to make a handful of small changes (see the commit log for exact details). If you find this project useful, be sure to let me know!

Tomorrow, I’ll probably spend a little time making pyHarmonySearch Python 3+ compatible.

Using requests_oauthlib in yelpapi

Just yesterday, I announced a new open source project called yelpapi that implements the Yelp v2.0 API in Python. I was doing a little looking around (digging around the requests documentation), and I discovered the official requests extension for OAuth, requests-oauthlib. requests-oauthlib inherits from requests and has all of the same power and functionality, plus OAuth support. It just so happens that Yelp’s API uses OAuth 1. I decided to migrate from using a combination of requests and python-oauth2 to the simpler requests-oauthlib. The end result is that yelpapi is now 20 lines shorter. requests-oauthlib is a really slick way to deal with OAuth.

Additionally, I migrated yelpapi, pyxDamerauLevenshtein, and pyHarmonySearch from distutils to setuptools for installation. setuptools offers some nice additions to distutils, such as install_requires, a directive that pip uses to ensure you have all dependencies. I’m bundling ez_setup.py to manage setuptools installation if necessary.

Introducing yelpapi, a pure Python implementation of the Yelp v2.0 API!

I just released yelpapi on GitHub. yelpapi is a pure Python implementation of the Yelp v2.0 API. The reason I created yelpapi is because I wanted to create an implementation that was completely flexible with regards to both input (i.e., it uses **kwargs) and output (i.e., it returns a dynamically created dict from the JSON). Most API implementations tend to (in my opinion) be over-designed by specifying classes and functions for every possible input/output. I choose to go with a much simpler view; I recognize that the programmer is capable of understanding and designing the data that need to be passed to the service but just doesn’t want to deal with the hassle of dealing with networking and error-handling. The result is that I was able to implement the entire Yelp v2.0 API in only 125 lines (most of which is white space or a comment). Additionally, this means that my implementation is robust to many changes that Yelp might implement in the future. For example, if Yelp decides to add a new parameter to the search API, or if they decide to return results in a slightly different manner, my code won’t need to change.

My hope is that more people design robust API implementations like what I’ve done here. My Yelp API implementation design can pretty easily be extended to other APIs without much effort.

If you find this implementation useful, let me know!

GitHub: https://github.com/gfairchild/yelpapi
PyPI: https://pypi.python.org/pypi/yelpapi

Simple Unix find/replace using Python

Find/replace in Unix isn’t very friendly. Sure, you can use sed, but it uses fairly nasty syntax that I always forget:

sed -i.bak s/STRING_TO_FIND/STRING_TO_REPLACE/g filename

I wanted something really simple that’s more user-friendly. I turn to Python:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/usr/bin/env python
 
"""
    Replace all instances of a string in the specified file.
"""
 
import argparse
import fileinput
 
#deal with command line arguments
argparser = argparse.ArgumentParser(description='Find/replace strings in a file.')
argparser.add_argument('file', type=str, help='file on which to perform the find/replace')
argparser.add_argument('find_string', type=str, help='string to find')
argparser.add_argument('replace_string', type=str, help='string that replaces find_string')
args = argparser.parse_args()
 
for line in fileinput.input(args.file, inplace=1):
    print line.replace(args.find_string, args.replace_string), #trailing comma prevents newline

That’s it. Toss this into a file called find_replace.py and optionally put it on your PATH. Here’s an example where I replace all instances of <br> with <br/> in an HTML file:

find_replace.py index.html "<br>" "<br/>"

Here’s an example where I use GNU Parallel to do the same find/replace on all HTML files in a directory:

find . -name *.html | parallel "find_replace.py {} '<br>' '<br/>'"

Much more user-friendly than sed!

This certainly works, and the code is incredibly simple, but fileinput is really geared towards reading lots of files. Perhaps more important is that there’s no error handling here. I could (and probably should) surround lines 17 and 18 with try-except, but I much prefer using with for file I/O. Unfortunately, with support for fileinput wasn’t added until Python 3.2 (I’m using 2.7). And personally, I think that while the inplace parameter is pretty cool, it’s dangerous because it’s not particularly intuitive. A better, although slightly longer, solution is to manually read in the file all at once, write out changes to a temp file, and then copy the temp file’s contents. Here’s a more “proper” solution:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#!/usr/bin/env python
 
"""
    Replace all instances of a string in the specified file.
"""
 
import argparse
import tempfile
from os import fsync
 
#deal with command line arguments
argparser = argparse.ArgumentParser(description='Find/replace strings in a file.')
argparser.add_argument('file', type=str, help='file on which to perform the find/replace')
argparser.add_argument('find_string', type=str, help='string to find')
argparser.add_argument('replace_string', type=str, help='string that replaces find_string')
args = argparser.parse_args()
 
#open 2 files - args.file for reading, and a temporary file for writing
with open(args.file, 'r+') as input, tempfile.TemporaryFile(mode='w+') as output:
    #write replaced content to temp file
    for line in input:
        output.write(line.replace(args.find_string, args.replace_string))
    #write all cached content to disk - flush followed by fsync
    output.flush()
    fsync(output.fileno())
    #go back to beginning to copy data over
    input.seek(0)
    output.seek(0)
    #copy output lines to input
    for line in output:
        input.write(line)
    #remove any excess stuff from input
    input.truncate()

This code uses with, so error-handling is implicit, and it’s written specifically to handle a single file (unlike fileinput), so it should be more efficient.

Compared to sed, this doesn’t currently allow for regular expressions, but that would be fairly trivial to add in; perhaps an extra command-line argument indicating that find_string is a regular expression should be added.

Securing against BEAST/CRIME/BREACH attacks

July 11, 2016 update: Simplify your life and just use Let’s Encrypt. It’s brain dead simple to use and automatically configures everything for you. The default security settings are essentially identical to Mozilla’s intermediate compatibility TLS settings (see options-ssl-apache.conf).

October 18, 2014 update: This information is outdated. Mozilla’s Security/Server Side TLS guide is much more comprehensive and should be used instead. It addresses BEAST, CRIME, BREACH, and POODLE and is consistently updated as new vulnerabilities are discovered.

I maintain a domain that requires SSL. It’s been using the standard 1024-bit keys that OpenSSL generates with standard Apache VirtualHost entries. After the various TLS exploits that have been revealed over the last few years, I spent some time looking into locking down my site.

First, I generate strong RSA keys. Very strong. 2048-bit keys are the current standard, but I opted for 4096-bit keys. No attack has been shown on 2048-bit keys, and 4096-bit keys have slightly more overhead, but I don’t mind; luckily, Linode (my host) just recently upgraded all CPUs. Security is all I care about, and a little CPU overhead is worth it.

First, I create the 4096-bit key/cert that Apache uses for self-signed certs:

cd /etc/apache
sudo mkdir ssl
cd ssl
sudo openssl req -x509 -nodes -days 365 -newkey rsa:4096 -keyout [private_key_name].key -out [certificate_name].pem
sudo chmod 600 *

Then, I instruct Apache to use them. My VirtualHost file looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
<VirtualHost [IPv4_address] [IPv6_address]:80>
	ServerName [domain].com
	Redirect permanent / https://[domain].com/
<VirtualHost>
 
<VirtualHost [IPv4_address] [IPv6_address]:443>
	ServerAdmin [admin_email]
	ServerName [domain].com
	DocumentRoot /srv/www/[domain].com/public_html/
	ErrorLog /srv/www/[domain].com/logs/error.log
	CustomLog /srv/www/[domain].com/logs/access.log combined
 
	SSLEngine On
	SSLCertificateFile /etc/apache2/ssl/[certificate_name].pem
	SSLCertificateKeyFile /etc/apache2/ssl/[private_key_name].key
	SSLHonorCipherOrder On
	SSLCipherSuite ECDHE-RSA-AES128-SHA256:AES128-GCM-SHA256:RC4:HIGH:!MD5:!aNULL:!EDH
	SSLProtocol -ALL +TLSv1
	SSLCompression Off
</VirtualHost>

That’s it. Restart Apache, and that’s all it takes. It’s the last few lines that really lock it down:

16
17
18
19
SSLHonorCipherOrder On
SSLCipherSuite ECDHE-RSA-AES128-SHA256:AES128-GCM-SHA256:RC4:HIGH:!MD5:!aNULL:!EDH
SSLProtocol -ALL +TLSv1
SSLCompression Off

These lines specify the cipher suite (suite of encryption and authentication algorithms the browser and server are allowed to use) as well as the SSL protocols used.

The SSLHonorCipherOrder and SSLCipherSuite recommendation comes from http://blog.ivanristic.com/2011/10/mitigating-the-beast-attack-on-tls.html (Ivan Ristić was the original developer of mod_security and is very active in the SSL world). These lines tell the browser which cipher suites, in order, it should prefer. As browser security improves (most browsers are still lagging behind in TLS 1.2 support, for example), this list/ordering will likely change to support stronger cipher suites.

The SSLProtocol line is a common one for only allowing TLS v1 or higher. SSL v2 is flawed in several serious ways and should be disallowed. SSL v3 is considered less secure than TLS v1+. All modern browsers support TLS v1, so I’m not alienating any users here.

The SSLCompression line is important for preventing the BREACH and CRIME attacks, which take advantage of SSL compression. This line only affects Apache 2.2+.

Finally, when all is said and done, you can visit Qualys SSL Labs to test the security of your site. If you’re using a self-signed certificate like mine, you’ll always get a failing grade. This is because the certificate isn’t trusted. This isn’t a big deal for my purposes; what’s important are the protocol support, key exchange, and cipher support ratings. Using the configuration above, I currently get at least 90 on these three of these ratings.

Ivan’s recent post, Configuring Apache, Nginx, and OpenSSL for Forward Secrecy, should also be noted here. Of special note is the section on RC4 vs BEAST:

Today, only TLS 1.2 with GCM suites offer fully robust security. All other suites suffer from one problem or another (e.g, RC4, Lucky 13, BEAST), but most are difficult to exploit in practice. Because GCM suites are not yet widely supported, most communication today is carried out using one of the slightly flawed cipher suites. It is not possible to do better if you’re running a public web site.

The one choice you can make today is whether to prioritize RC4 in most cases. If you do, you will be safe against the BEAST attack, but vulnerable to the RC4 attacks. On the other hand, if you remove RC4, you will be vulnerable against BEAST, but the risk is quite small. Given that both issues are relatively small, the choice isn’t clear.

However, the trend is clear. Over time, RC4 attacks are going to get better, and the number of users vulnerable to the BEAST attack is going to get smaller.

The reason I don’t use Ivan’s new suggestions is because these suggestions require Apache 2.4+. I’m using Ubuntu 12.04 LTS, which ships with Apache 2.2. When 14.04 LTS comes out, then I’ll likely transition to his crypto scheme.

2010 Census KML updates

Just a few days ago, I published some 2010 Census KML files. Today, I realized I left off the Placemark name field. This field is primarily useful when browsing in Google Earth:

Google Earth 2010 Census KML example

QGIS doesn’t have (obvious) functionality to map an attribute to the Placemark name field, but ogr2ogr, a command-line GIS format conversion tool, does. As an example, converting the 2010 CBSA shapefile is very simple:

ogr2ogr -f KML CBSA_2010Census_DP1.kml CBSA_2010Census_DP1.shp -dsco NameField=NAMELSAD10

NAMELSAD10 is determined by looking at the attributes in QGIS.

I’ve also decided to add a KML file for tracts to add one more level in the hierarchy. Again, the larger files (the ZCTA and tract files) won’t currently open in Google Earth. These files are very big when uncompressed (1.5gb+). Google Earth, at least currently, is a 32-bit-only program. It simply doesn’t have access to the memory necessary to deal with large KML files. QGIS, though, can display them just fine.

Let me know if you find these useful!

2010 Census KML files

A while back I generated some KML files and SQLite databases containing ZCTA and county boundaries as well as some selected attributes (some crude population statistics) for the 2000 U.S. census:

There were some issues with these data files – primarily, I didn’t handle multi-polygons correctly (that is, a single entity – ZCTA, county, etc. – can be made up of multiple polygons). Population data were correct, but boundary data weren’t. I’ve decided today to take down these erroneous 2000 Census KML files. I’ve created new KML files based on the 2010 census to replace them. Here’s how I generated these:

  1. I downloaded the 2010 Census Demographic Profile 1 state, county, ZCTA, and CBSA shapefiles from the U.S. Census Bureau’s website: https://www.census.gov/geo/maps-data/data/tiger-data.html
  2. I converted each shapefile to KML using Quantum GIS.
  3. I zipped up each KML file to create a much smaller KMZ file.

There are several advantages this method has over my previous method of manually creating KML files from Census data (thank you Census Bureau for greatly improving your data for the 2010 Census!):

  1. Each KML file will have, in most cases, all the demographic information you’ll need (population counts of race, gender, age, etc.).
  2. Each KML file will have correct and complete boundary information, including multi-polygons.

There are a couple disadvantages compared to shapefiles:

  1. KML files have to be read sequentially and can be slow to read.
  2. Google Earth has a hard time dealing with really big KML files (such as the ZCTA KML file). On my work machine (OS X 10.7.5, i7 Sandy Bridge, 8gb ram), Google Earth (v7.1.1.1888) crashes. QGIS, though, can read and display the KML files just fine.

Without further ado, here are the KML files. Let me know if you find these useful!

2010 state KML: http://www.gfairchild.com/public/State_2010Census_DP1.kmz (4.5mb)
2010 county KML: http://www.gfairchild.com/public/County_2010Census_DP1.kmz (41mb)
2010 ZCTA KML: http://www.gfairchild.com/public/ZCTA_2010Census_DP1.kmz (293mb)
2010 CBSA KML: http://www.gfairchild.com/public/CBSA_2010Census_DP1.kmz (17mb)

Remember that a KMZ file is just a zipped-up KML file.

Each KML file contains the same attributes. Attributes are shorthand (e.g., DP0010033 is total male population age 60-64), so a lookup table is necessary to decipher them. http://www.gfairchild.com/public/DP_TableDescriptions.xls contains a description of each attribute.

pyxDamerauLevenshtein is now live on GitHub!

I’ve released another open source Python project on GitHub today called pyxDamerauLevenshtein. pyxDamerauLevenshtein implements the Damerau-Levenshtein (DL) edit distance algorithm for Python in Cython for high performance. The DL edit distance algorithm is a widely used approximate string matching (aka fuzzy matching) algorithm. It’s useful for determining if two strings are “similar enough” to each other.

This project is based on Michael Homer’s pure Python implementation. It runs in O(N*M) time using O(M) space and supports unicode characters. Since it’s implemented in C via Cython, it is two orders of magnitude faster than the equivalent pure Python implementation.

Let me know if you find this code useful!

pyHarmonySearch is now available on PyPI

I spent most of my Saturday refactoring pyHarmonySearch. Before, it was a little un-Pythonic. A call looked like this:

python harmony_search.py test_continuous_seed ObjectiveFunction

This was done because I originally wrote pyHarmonySearch with research in mind. I had harmony_search.py in the same directory as the objective function implementation, so it just made sense. However, since releasing pyHarmonySearch to the public, I knew I needed to refactor it to be more Pythonic. The same call now looks like this:

python test_continuous_seed.py

The harmony search implementation itself now lives in a class completely separated from everything else. This allowed me to write a setup.py file to install the code globally. Furthermore, it allowed me to publish the code on PyPI, the official Python package index. pyHarmonySearch can now be installed using pip, the package manager everyone uses for Python:

pip install pyHarmonySearch

https://pypi.python.org/pypi/pyHarmonySearch is the official location of pyHarmonySearch on PyPI. These changes really simplified how pyHarmonySearch is used, and I hope it’ll make it more user-friendly.

As a side note, I’m honestly really impressed with how easy it was to publish pyHarmonySearch on PyPI. It was literally as easy as:

python setup.py register
python setup.py sdist --formats=gztar upload
python setup.py bdist_wininst upload
python setup.py build --plat-name=win32 bdist_wininst upload

This uploads the source as well as 32-bit and 64-bit installers for Windows. Bravo, Python, bravo.

pyHarmonySearch supports Python’s multiprocessing module

Yesterday, I added support for Python’s multiprocessing module to pyHarmonySearch. I use multiprocessing to run multiple harmony searches simultaneously. Since harmony search is stochastic, different results may be returned on each run. Some runs will be luckier than others, so I figured it makes sense to take advantage of multiple cores by running multiple search iterations. The resulting solution is the best one found across all iterations.

I don’t have a rigorous proof demonstrating this, but I’ve seen better results in the test objective functions I’ve included on GitHub. On machines with many cores, better results could come at a minimal extra wall time cost. Furthermore, instead of running a single HS instance with a very large number of improvisations, it may be beneficial to simultaneously run multiple HS instances, each with a smaller number of improvisations. This is all conjecture, though, and it likely depends on the objective function being studied.