Adventures of a computer scientist.

My Ph.D. dissertation is live!

As I mentioned in my last post, I wanted to post when my dissertation became publicly available on Iowa’s website. It’s officially up!

Improving disease surveillance: sentinel surveillance network design and novel uses of Wikipedia

Definitely check it out, and let me know if you have any questions!

Call me Dr. Geoff

I don’t often write personal blog entries, but this warranted it. As of just a couple weeks ago, I am officially not a student anymore. I am not a student. I’ve been a student for, what, 25 years straight? To suddenly not be a student and have the freedoms (and salary) that come with that is jarring. And to top it off, not only am I not a student, but I now have a Ph.D. People can call me Dr. Geoff.

dissertation page 1 Here are some stats on my dissertation, titled Improving Disease Surveillance: Sentinel Surveillance Network Design and Novel Uses of Wikipedia:

  • page count: 151
  • word count: 34,573
  • character count (with spaces): 222,941
  • number of references: 198
  • number of tables: 10
  • number of figures: 16

The dissertation will be posted on The University of Iowa’s Institutional Repository some time soon. It’ll be open access. I’m really proud of it. Once it’s published, I’ll post a link here in case anyone wants to read it.

My defense couldn’t have gone better. All the publicity our Global Disease Monitoring and Forecasting with Wikipedia paper has gotten, which just so happens to be chapter 3 of my dissertation, couldn’t have been timed better. I may be a little biased, but chapter 4 of my dissertation, which uses some natural language processing techniques to elicit disease information from article content, is pretty damn cool stuff too. That paper should be submitted in about a month.

I’ll be sticking around Los Alamos National Laboratory (LANL). I’ve become quite fond of this place. I work with some amazingly talented people on some extremely cool work. I mean, we did a Reddit AMA that hit the front page! Besides that, LANL is located in a really neat little town that suits me perfectly; it has one of the best climates I’ve ever experienced, and it’s great for biking in the summer and snowboarding in the winter.

Overall, I feel like a tremendous stress has been lifted from my shoulders. I can now work with fewer distractions and more tenacity. Perhaps more importantly, I no longer feel guilty for doing fun things in my off time. Grad school has this inherent ability to make you feel guilty when you’re not working. It’s certainly nothing my advisor (Alberto Segre) or LANL mentor (Sara Del Valle) pushed on me; it’s just something all grad students feel. I’ve always maintained that’s it’s incredibly important to separate work from life, but when you’re in grad school, that’s often easier said than done.

During the decompression phase after my defense, I realized that I’ve never gone on a vacation. Sure, I’ve done little weekend snowboarding trips or backpacking trips, but I’ve never taken a real vacation. How could I? I’ve been a student practically since I was born! In the short term, I’m going to be snowboarding a lot. I’m going on a cruise with my sister and some good friends in late January. I want to travel a lot next summer; I’m thinking about a long motorcycle trip with my buddy Rajeev.

Whatever happens, I am done with school. Forever. Here’s to the next phase of life!

How to fix the Home/End/Page Up/Page Down keys for OS X terminal and vim

People all over the internet complain about Apple’s (incorrect) mapping of the Home, End, Page Up, and Page Down keys. I spend a lot of time in the terminal and in vim, and it’s important to me that these keys function properly. Here’s what I needed to do in order to get these keys working properly in the terminal and in vim:

  1. Open up the terminal preferences.
  2. Go to the Settings tab, and select the desired profile.
  3. Go to the profile’s Keyboard tab.
  4. Add (or edit) the Home key so that it sends this text to the shell: \033OH
  5. Add (or edit) the End key so that it sends this text to the shell: \033OF
  6. Add (or edit) the Page Up key so that it sends this text to the shell: \033[5~
  7. Add (or edit) the Page Down key so that it sends this text to the shell: \033[6~

There are some other commonly recommended sequences (e.g., \033[1~ instead of \033OH), but the sequences above are the only sequences I’ve found that work in both the terminal and vim.

How to change a remote repository URL in Git

I just ran into a situation where I needed to change a remote URL for a personal repository in Git. The project lived on a server at work, but I’m going to be going out of town for several weeks starting tomorrow. I need this project, and unfortunately, I can’t access it from home due to the work firewall.  What I decided to do is just move the repo to my personal server for now. Here’s how I did it (if it’s not obvious, I work over SSH).

First, I just wanted to see the current configuration:

~/Documents/project> git remote show origin
* remote origin
  Fetch URL: olduser@oldserver.com:/path/to/project.git
  Push  URL: olduser@oldserver.com:/path/to/project.git
  HEAD branch: master
  Remote branch:
    master tracked
  Local branch configured for 'git pull':
    master merges with remote master
  Local ref configured for 'git push':
    master pushes to master (up to date)

Next, I need to SSH into the new server and create a new bare repo into which I’ll push my project. Since I store my git projects in /srv/git, I need to make sure I give the appropriate ownership to the project.

~$ cd /srv/git/
/srv/git$ sudo mkdir project.git
/srv/git$ sudo chown newuser:newuser project.git/
/srv/git$ cd project.git/
/srv/git/project.git$ git init --bare
Initialized empty Git repository in /srv/git/project.git/

The new server is now ready. All that’s left is for me to change the remote repo URL of the project on my local machine and then just push the project to the new server.

~/Documents/project> git remote set-url origin newuser@newserver.com:/srv/git/project.git
~/Documents/project> git push
Counting objects: 37567, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (37556/37556), done.
Writing objects: 100% (37567/37567), 88.91 MiB | 3.76 MiB/s, done.
Total 37567 (delta 4931), reused 0 (delta 0)
To newuser@newserver.com:/srv/git/project.git
 * [new branch]      master -> master

That’s it! All pushes/pulls from now on will happen with the new server. Pretty easy!

eclipse-hasher updated and re-released on GitHub

Several years ago, I released an Eclipse plugin called Hasher. Hasher’s goal is to output values of common hash algorithms (MD5, SHA-1, SHA-256, SHA-384, and SHA-512 right now) of files selected in Eclipse. Hasher had fallen by the wayside and last worked under early version of Eclipse 3. I recently had a need for it for a personal project and decided to update it. Turns out, quite a bit has changed. Most notably, Eclipse actions are now deprecated in favor of Eclipse commands. Code using commands is much cleaner, but it’s quite a bit different, so I essentially had to rewrite the entire plugin. Also, I had some dependency issues that plagued me for far too long, but thanks to Stack Overflow, I was finally able to get things straightened out.

Hasher is now live on GitHub (https://github.com/gfairchild/eclipse-hasher), freshly tagged with v1.2. One of the things I learned during this rewrite is that there’s a lack of good examples and documentation out there for modern Eclipse plugins. I’m hopeful that Hasher can be useful to someone wanting to get into writing Eclipse plugins. Hasher is pretty simple right now, but it’s non-trivial (has external dependencies, interacts meaningfully with Eclipse – more than just Hello World). If you find yourself using it to learn, please let me know!

There’s still a to do list. I want to make the output prettier using a custom view. A tree view or a table view (or perhaps some hybrid) would probably be ideal. I don’t know how to do a custom view yet, though, so that’ll add to the learning process. Also, I want to make use of Eclipse’s Jobs API. Right now, I’m just manually creating a new thread to do computations. This works and leaves the UI free to do its work, but it’s not elegant and doesn’t take advantage of several nice features Eclipse offers for background jobs.

If you use Hasher, let me know what you think!

pyHarmonySearch now supports Python 3+

As promised yesterday, pyHarmonySearch, my open source pure Python implementation of the harmony search algorithm, now fully supports Python 3. As with yelpapi, it was actually a really simple process. Only a few lines of code needed to change.

Also of note, pyHarmonySearch now properly handles KeyboardInterrupt exceptions. pyHarmonySearch uses Python’s multiprocessing.Pool to run multiple searches simultaneously. multiprocessing.Pool doesn’t natively handle KeyboardInterrupt exceptions, so special care must be given to ensure proper termination of the pool. The solution I used comes from this Stack Overflow question.

yelpapi now supports Python 3+

My Yelp v2.0 API Python implementation, yelpapi, is now fully Python 3+ compliant. In my work, I still mostly use Python 2.7 (although I’m starting to think very seriously about migrating to Python 3), so that’s what I develop for. However, I got the urge today to make yelpapi Python 3+ compliant to reach a broader audience. Turns out it was pretty easy. I really only had to make a handful of small changes (see the commit log for exact details). If you find this project useful, be sure to let me know!

Tomorrow, I’ll probably spend a little time making pyHarmonySearch Python 3+ compatible.

Using requests_oauthlib in yelpapi

Just yesterday, I announced a new open source project called yelpapi that implements the Yelp v2.0 API in Python. I was doing a little looking around (digging around the requests documentation), and I discovered the official requests extension for OAuth, requests-oauthlib. requests-oauthlib inherits from requests and has all of the same power and functionality, plus OAuth support. It just so happens that Yelp’s API uses OAuth 1. I decided to migrate from using a combination of requests and python-oauth2 to the simpler requests-oauthlib. The end result is that yelpapi is now 20 lines shorter. requests-oauthlib is a really slick way to deal with OAuth.

Additionally, I migrated yelpapi, pyxDamerauLevenshtein, and pyHarmonySearch from distutils to setuptools for installation. setuptools offers some nice additions to distutils, such as install_requires, a directive that pip uses to ensure you have all dependencies. I’m bundling ez_setup.py to manage setuptools installation if necessary.

Introducing yelpapi, a pure Python implementation of the Yelp v2.0 API!

I just released yelpapi on GitHub. yelpapi is a pure Python implementation of the Yelp v2.0 API. The reason I created yelpapi is because I wanted to create an implementation that was completely flexible with regards to both input (i.e., it uses **kwargs) and output (i.e., it returns a dynamically created dict from the JSON). Most API implementations tend to (in my opinion) be over-designed by specifying classes and functions for every possible input/output. I choose to go with a much simpler view; I recognize that the programmer is capable of understanding and designing the data that need to be passed to the service but just doesn’t want to deal with the hassle of dealing with networking and error-handling. The result is that I was able to implement the entire Yelp v2.0 API in only 125 lines (most of which is white space or a comment). Additionally, this means that my implementation is robust to many changes that Yelp might implement in the future. For example, if Yelp decides to add a new parameter to the search API, or if they decide to return results in a slightly different manner, my code won’t need to change.

My hope is that more people design robust API implementations like what I’ve done here. My Yelp API implementation design can pretty easily be extended to other APIs without much effort.

If you find this implementation useful, let me know!

GitHub: https://github.com/gfairchild/yelpapi
PyPI: https://pypi.python.org/pypi/yelpapi

Simple Unix find/replace using Python

Find/replace in Unix isn’t very friendly. Sure, you can use sed, but it uses fairly nasty syntax that I always forget:

sed -i.bak s/STRING_TO_FIND/STRING_TO_REPLACE/g filename

I wanted something really simple that’s more user-friendly. I turn to Python:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/usr/bin/env python
 
"""
    Replace all instances of a string in the specified file.
"""
 
import argparse
import fileinput
 
#deal with command line arguments
argparser = argparse.ArgumentParser(description='Find/replace strings in a file.')
argparser.add_argument('file', type=str, help='file on which to perform the find/replace')
argparser.add_argument('find_string', type=str, help='string to find')
argparser.add_argument('replace_string', type=str, help='string that replaces find_string')
args = argparser.parse_args()
 
for line in fileinput.input(args.file, inplace=1):
    print line.replace(args.find_string, args.replace_string), #trailing comma prevents newline

That’s it. Toss this into a file called find_replace.py and optionally put it on your PATH. Here’s an example where I replace all instances of <br> with <br/> in an HTML file:

find_replace.py index.html "<br>" "<br/>"

Here’s an example where I use GNU Parallel to do the same find/replace on all HTML files in a directory:

find . -name *.html | parallel "find_replace.py {} '<br>' '<br/>'"

Much more user-friendly than sed!

This certainly works, and the code is incredibly simple, but fileinput is really geared towards reading lots of files. Perhaps more important is that there’s no error handling here. I could (and probably should) surround lines 17 and 18 with try-except, but I much prefer using with for file I/O. Unfortunately, with support for fileinput wasn’t added until Python 3.2 (I’m using 2.7). And personally, I think that while the inplace parameter is pretty cool, it’s dangerous because it’s not particularly intuitive. A better, although slightly longer, solution is to manually read in the file all at once, write out changes to a temp file, and then copy the temp file’s contents. Here’s a more “proper” solution:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#!/usr/bin/env python
 
"""
    Replace all instances of a string in the specified file.
"""
 
import argparse
import tempfile
from os import fsync
 
#deal with command line arguments
argparser = argparse.ArgumentParser(description='Find/replace strings in a file.')
argparser.add_argument('file', type=str, help='file on which to perform the find/replace')
argparser.add_argument('find_string', type=str, help='string to find')
argparser.add_argument('replace_string', type=str, help='string that replaces find_string')
args = argparser.parse_args()
 
#open 2 files - args.file for reading, and a temporary file for writing
with open(args.file, 'r+') as input, tempfile.TemporaryFile(mode='w+') as output:
    #write replaced content to temp file
    for line in input:
        output.write(line.replace(args.find_string, args.replace_string))
    #write all cached content to disk - flush followed by fsync
    output.flush()
    fsync(output.fileno())
    #go back to beginning to copy data over
    input.seek(0)
    output.seek(0)
    #copy output lines to input
    for line in output:
        input.write(line)
    #remove any excess stuff from input
    input.truncate()

This code uses with, so error-handling is implicit, and it’s written specifically to handle a single file (unlike fileinput), so it should be more efficient.

Compared to sed, this doesn’t currently allow for regular expressions, but that would be fairly trivial to add in; perhaps an extra command-line argument indicating that find_string is a regular expression should be added.