## eclipse-hasher updated and re-released on GitHub

Several years ago, I released an Eclipse plugin called Hasher. Hasher’s goal is to output values of common hash algorithms (MD5, SHA-1, SHA-256, SHA-384, and SHA-512 right now) of files selected in Eclipse. Hasher had fallen by the wayside and last worked under early version of Eclipse 3. I recently had a need for it for a personal project and decided to update it. Turns out, quite a bit has changed. Most notably, Eclipse actions are now deprecated in favor of Eclipse commands. Code using commands is much cleaner, but it’s quite a bit different, so I essentially had to rewrite the entire plugin. Also, I had some dependency issues that plagued me for far too long, but thanks to Stack Overflow, I was finally able to get things straightened out.

Hasher is now live on GitHub (https://github.com/gfairchild/eclipse-hasher), freshly tagged with v1.2. One of the things I learned during this rewrite is that there’s a lack of good examples and documentation out there for modern Eclipse plugins. I’m hopeful that Hasher can be useful to someone wanting to get into writing Eclipse plugins. Hasher is pretty simple right now, but it’s non-trivial (has external dependencies, interacts meaningfully with Eclipse – more than just Hello World). If you find yourself using it to learn, please let me know!

There’s still a to do list. I want to make the output prettier using a custom view. A tree view or a table view (or perhaps some hybrid) would probably be ideal. I don’t know how to do a custom view yet, though, so that’ll add to the learning process. Also, I want to make use of Eclipse’s Jobs API. Right now, I’m just manually creating a new thread to do computations. This works and leaves the UI free to do its work, but it’s not elegant and doesn’t take advantage of several nice features Eclipse offers for background jobs.

If you use Hasher, let me know what you think!

## aptitude install php5-sqlite

This is kind of a mental note, but in order to get SQLite to work with PHP5 under Ubuntu 10.04 LTS, it’s necessary to install the required libraries:

aptitude install php5-sqlite

I’ve been trying to figure out an error regarding my sentinel surveillance site calculator. It uses an SQLite database on the back-end (the same one I provide in this post), and the page was only half loading. As soon as it got to the SQLite calls, it’d just die. The code doesn’t run on my server, and I couldn’t view the logs, so it was kind of tricky to diagnose it. After moving the code over to my server, I very quickly discovered that it was just a lack of the proper PHP SQLite libraries causing the issue. Part of the problem is that the PHP documentation on SQLite3 is extremely vague and makes it sound like PHP ships with SQLite support, so I never thought that might be the issue. Had I just done a simple phpinfo() lookup, this would’ve been painfully obvious. Oops!

Also, I wholeheartedly endorse the Xerial SQLiteJDBC library written by Taro L. Saito for using SQLite databases in Java. It’s very fast, and I haven’t had any problems with bugs. And best of all, unlike Zentus’ SQLiteJDBC library, it’s regularly maintained and updated.

## New KML Census Population Updates

The 2000 Census KML files are deprecated in favor of the 2010 Census KML files.

In a few previous posts, I created a KML map of ZCTAs (ZIP codes, basically) that contained their boundaries and population counts (from the 2000 Census). This map contains data for all 50 states in the US plus the District of Columbia and Puerto Rico. These posts can be found here and here. A fellow grad student required county-level population data, so I figured I could help. Modifying my old Java code that created the KML files for ZCTAs would be easiest. I discovered a really interesting fact along the way that I wanted to share. At the end of this post, I’ll provide some new download links.

The census data isn’t perfect! As it turns out, several ZIP codes are listed under multiple states! Obviously, this can’t be true in real life as ZIP codes must be unique by definition, but it’s true in the data. If you go to http://www.census.gov/geo/www/cob/z52000.html, and analyze the ASCII formatted boundary data, this becomes clear if you know what to look for. Just as an example, it’s easy to see by downloading the Iowa and Illinois boundary files that both Iowa and Illinois apparently share the ZIP code 52761! A quick Google search will make it clear that this ZIP code belongs to Iowa and not Illinois.

This becomes problematic when the main objective is to study population data. When creating a KML file of ZCTAs and their boundaries and populations, it now becomes the case that some ZIP codes will be included multiple times, and as a result, so will the population counts for those ZIP codes! So, what does this mean? Well, it basically means that some of the state population counts in the KML file for ZCTAs I previously created are slightly off. For example, the 30286 people living in ZIP code 52761 are counted in both Iowa’s and Illinois’ total state population.

I’ve decided not to change/fix the ZCTA KML file. The reason for this is primarily that it’d be a royal pain in the ass. As it turns out, there are actually quite a few ZIP codes in the dataset listed multiple times. In order to fix this, I’d have to manually (by hand) go through each of these duplicated ZIP codes, Google them to see where they really are, and delete the incorrect ones. The other reason I won’t be fixing it is because a download I provide later in this post is really better than the KML files. I mention all of this as a warning to anyone using this KML file.

Now, onto the new stuff. Below, I provide a county-level population KML file. Like the ZCTA-level KML file, it contains the county boundaries, population counts, and state population counts. Note that the population counts in this KML file are accurate because the census department didn’t mess these up. As it turns out, in some cases, the same county is listed multiple times in the same state in the county boundary data. For example, Honolulu county is listed 19 times in the boundary data for Hawaii. I believe this is because the county requires multiple separate polygons to represent it (perhaps a body of water not part of the county divides it somehow?). For simplicity’s sake, I just picked the first instance of a county and used its boundaries. This means that while I do represent each of the counties in the US, I may not be drawing the entire boundary. For my purposes, this isn’t a big deal because I don’t particularly need the boundaries in the first place. All I need for my research is a point interior to the county and population counts. The boundaries are just included for simple visualization purposes. Note that this does not mess up the population counts!

And one final thing. This is probably the most useful census file I’ve generated yet: an SQLite database containing the ZCTA population counts and boundaries AND county population counts and boundaries. Unfortunately, this database does succumb to the problem of the ZCTA KML file (that duplicitous ZCTA problem), but the benefits are that if you’re doing any actual population calculations this is obviously going to be much easier and faster to use than the KML file. Here’s the schema:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 CREATE TABLE states ( id INTEGER NOT NULL, state TEXT NOT NULL, population INT NOT NULL, PRIMARY KEY (id) );   CREATE TABLE zctas ( id INTEGER NOT NULL, state_id INTEGER NOT NULL, zcta TEXT NOT NULL, population INTEGER NOT NULL, centroid_longitude REAL NOT NULL, centroid_latitude REAL NOT NULL, PRIMARY KEY (id), FOREIGN KEY (state_id) REFERENCES states(id) );   CREATE TABLE zcta_boundaries ( id INTEGER NOT NULL, zcta_id INTEGER NOT NULL, longitude REAL NOT NULL, latitude REAL NOT NULL, PRIMARY KEY (id), FOREIGN KEY (zcta_id) REFERENCES zctas(id) );   CREATE TABLE counties ( id INTEGER NOT NULL, state_id INTEGER NOT NULL, county TEXT NOT NULL, population INTEGER NOT NULL, centroid_longitude REAL NOT NULL, centroid_latitude REAL NOT NULL, PRIMARY KEY (id), FOREIGN KEY (state_id) REFERENCES states(id) );   CREATE TABLE county_boundaries ( id INTEGER NOT NULL, county_id INTEGER NOT NULL, longitude REAL NOT NULL, latitude REAL NOT NULL, PRIMARY KEY (id), FOREIGN KEY (county_id) REFERENCES counties(id) );

That’ll give you an idea of how it’s structured. And without further ado, here are the files:

Population ZCTA KML: http://www.gfairchild.com/public/populationZCTA.kmz (21mb) – Remember that a KMZ file is just a zipped up KML file.
Population County KML: http://www.gfairchild.com/public/populationCounty.kmz (4.9mb)
Population SQLite3 database: http://www.gfairchild.com/public/populationDB.7z (25mb) – This is a 7-Zip file, so you’ll need an archiver that can open 7-Zip archives to get to the database.

And yes, as soon as the 2010 census data is out in its entirety, I plan on updating these files. I’m hopeful they will have corrected any multiple-ZCTA issues with the most recent data, but only time will tell.

## Hasher Update (and how to obtain the hash of a file in Java)

I updated Hasher today. The plugin still functions the same, but I fixed a bug in the hash generation algorithm (thanks to Vladimir Kozlov for pointing the bug out to me). I also greatly increased modularity which in turn reduced the code size rather significantly. Here is how I calculate the hash of a file (the String parameter algorithm can be any algorithm listed at http://java.sun.com/javase/6/docs/technotes/guides/security/StandardNames.html#MessageDigest):

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 private static final char[] HEX_CHARS = "0123456789abcdef".toCharArray();   public static String generateHash(File file, String algorithm) throws NoSuchAlgorithmException, IOException { byte[] dataBytes = getFileContents(file); MessageDigest messageDigest = MessageDigest.getInstance(algorithm); messageDigest.update(dataBytes); byte[] digestBytes = messageDigest.digest();   // this chunk of code was taken from http://echochamber.me/viewtopic.php?f=11&t=16666&p=553936#p459685 char[] hash = new char[2 * digestBytes.length]; for (int i = 0; i < digestBytes.length; ++i) { hash[2 * i] = HEX_CHARS[(digestBytes[i] & 0xF0) >>> 4]; hash[2 * i + 1] = HEX_CHARS[digestBytes[i] & 0x0F]; }   return new String(hash); }   private static byte[] getFileContents(File file) throws IOException { FileInputStream fileInputStream = null;   try { fileInputStream = new FileInputStream(file); byte fileBytes[] = new byte[fileInputStream.available()]; fileInputStream.read(fileBytes); return fileBytes; } catch (IOException e) { throw e; } finally { if (fileInputStream != null) try { fileInputStream.close(); } catch (IOException e) { } finally { fileInputStream = null; } } }

## Another KML ZCTA Population Map

The 2000 Census KML files are deprecated in favor of the 2010 Census KML files.

I edited the KML ZCTA population map that I posted on 2/22/09. For my uses, it’s important that I be able to calculate distances between ZCTAs. To calculate these distances, I use the center of each ZCTA. To avoid complex centroid calculations, and since the Census Bureau’s ZCTA boundary data includes the centers so that this calculation isn’t necessary, I figured I’d include them.

I did this by changing the KML geometry from a single LinearRing to a MultiGeometry of LinearRings and Points.  Let me know if this is useful to you!

Also, if you’re curious, the distance formula for computing the distance (in miles) between two latitude/longitude points (in degrees) is fairly simple. Courtesy, http://www.meridianworlddata.com/Distance-Calculation.asp, here is the great circle distance formula:

$3963 \cdot arccos[sin(lat1) \cdot sin(lat2) + cos(lat1) \cdot cos(lat2) \cdot cos(long2 - long1)]$

Here’s what this new file looks like in Google Earth:

## KML Map with ZCTA Boundaries and Population Counts from the 2000 Census

The 2000 Census KML files are deprecated in favor of the 2010 Census KML files.

My research thus far in grad school has been concentrated around using computer science and GIS tools/techniques to help solve medical problems.  For one of the problems I’m working on, I need access to ZCTA boundaries and population counts for each ZCTA.  For those of you that don’t know, ZCTA stands for ZIP-code tabulation area; basically, it’s what the Census Bureau uses to split the US population up geographically.  ZCTAs typically correspond to ZIP codes, but unfortunately, that’s doesn’t always happen (there is a ZCTA 506HH, for example).  Some ZCTAs have populations of 0.  However, as far as I can tell, ZCTAs are the best thing we have for geographically representing the US population.

Anyway, what I was doing was using the shapefiles publicly available at http://www.census.gov/geo/www/cob/z52000.html to get the ZCTA boundary information.  I also used the 2000 census population data available from http://factfinder.census.gov/servlet/DownloadDatasetServlet?_lang=en. This is all fine and dandy, but reading shapefiles can be kind of a pain.  GeoTools works quite well, but it’s Java-only.  I’m in the process of converting a Java-based application into a PHP web-based application for some of my research.  PHP’s shapefile readers, at least from what I’ve heard, can be kind of hit or miss.  Also, having to get this data from multiple sources starts to make code look kinda nasty.  Ideally, since population information goes with ZCTAs, there should be a 1-stop shop that gives me both the boundary information and population counts for each ZCTA in the US.  And if it’s not too much to ask for, I’d like a human-readable text format.  So what do I do?

A KML map seemed like the most obvious answer.  It’s XML, so it’s easily readable, and it’s highly extensible and powerful.  And since Google’s name is behind it, it’s very widely supported.  What’s not to like?  Surely, a KML file like this exists, right?  Well, after quite a bit of Googling, I came to the conclusion that one didn’t exist.  So, I decided to make my own.  I wrote a Java program to read in the population information and ASCII coordinates of each ZCTA listed at the bottom of the Census Bureau’s page I linked to in the above paragraph and spit out a KML map.

Because this is so useful to me (and really, anyone doing anything that requires ZCTA-level population counts), I figured I’d upload it here and let anyone that wants it grab it.  This took me a decent chunk of time to get done, so maybe I can save someone else some time by sharing this.  If you use this, please let me know by leaving a comment.  Also, if you’re interested in the Java code I used to create the KML file, feel free to ask.