Adventures of a computer scientist.

Posts in category SQL

ZIP codes shouldn’t be represented as integers!

This is just a quick post to point out that ZIP codes should never be stored or represented as integers. ZIP codes should always be stored and represented as strings. Take the ZIP code 07755 (Oakhurst, NJ), for example. Any reasonable parsing algorithm will correct read in 07755, but 07755 will be immediately converted to 7755 as leading zeros mean nothing in the context of integers. The problem doesn’t become apparent until it becomes time to output the ZIP code (for example, in a KML file). 7755 is not a valid ZIP code.

A common place where this problem pops up is in web apps where user ZIP codes are stored. In Django, for example, you may be tempted to add the following field to a model:

zip = models.IntegerField(max_length=5)

While this will certainly work, if you ever need to display user ZIP codes (e.g., to let a user view their current address), leading zeros won’t be displayed. Instead, the correct way to represent a ZIP code would be:

zip = models.CharField(max_length=5)

Now, leading zeros won’t get cut off.

How to Rename Columns in an SQLite Database

I just solved a problem where I needed to rename some columns in an SQLite3 database. Unfortunately, SQLite3’s ALTER TABLE command doesn’t have the ability to rename columns in tables. Most Google results offered solutions such as creating a new table, copying the data over, and then dropping the old table. While this is fine for probably 99% of SQLite3 databases, the table I’m dealing with has almost 3 billion rows (yes, I have an SQLite3 database with 3×109 rows). Recreating the table (specifically, the indices on the columns) would take too long. It took me about 24 hours to create the indices on a supercomputer with 512gb of RAM, and I didn’t want to go through that again.

I stumbled on two separate posts that seemed to offer solutions, and I got it working using them. I just wanted to write up this blog entry to clarify a few things and to show this in action.

Backup your database! Seriously, back it up. The first time I tried this, I corrupted mine because I forgot to update the corresponding index. Were it not for the backup, I’d be spending another several days recreating it from scratch.

I’m on a *nix box, so I’m interacting with the database using sqlite3 from the command line. I think this is simplest and safest because you can double-check your progress as you go before you commit the changes to the database. As an example, I’ll show you the table I was working with:

CREATE TABLE people (
	id INTEGER PRIMARY KEY,
	state_id INTEGER NOT NULL,
	county INTEGER NOT NULL,
	age INTEGER NOT NULL,
	gender TEXT NOT NULL, -- M or F
	income INTEGER NOT NULL
);
CREATE INDEX people_county_index ON people(county);

What I realized is that I incorrectly named the county column. It should’ve been county_id. In order to change the name of the column, I must change the index as well. Here is how it’s done:

1
2
3
4
5
6
7
8
9
10
11
12
13
BEGIN; --start a transaction; this ensures that your changes won't be written until you COMMIT
.schema --this will show you your current schema
PRAGMA writable_schema=1; --this enables you to edit the schema
SELECT * FROM sqlite_master; --show the raw data SQLite saves related to your schema
UPDATE sqlite_master SET SQL=REPLACE(SQL, 'county', 'county_id') WHERE name='people'; --simple string replace
.schema --verify CREATE TABLE is correct
UPDATE sqlite_master SET SQL='CREATE INDEX people_county_id_index on people(county_id)' WHERE name='people_county_index'; --update index SQL
UPDATE sqlite_master SET name='people_county_id_index' WHERE name='people_county_index'; --update index name
.schema --verify index is correct
SELECT * FROM sqlite_master; --double-check all raw data once more
PRAGMA writable_schema=0; --disable editing the schema
COMMIT; --save the changes to the database
.exit

After you verify that things look right, run ANALYZE to update the query planner’s statistics. This hack seemed to mess up the query planner for my database, but ANALYZE fixed it. If you’re still paranoid, run PRAGMA integrity_check; after ANALYZE.

That’s all there is to it. sqlite_master is just a regular SQL table, so you can modify it like any other. Note that the call to replace(x, y, z) will replace all occurrences of y with z in the string x. This means that if “county” existed anywhere else in the CREATE TABLE people statement, it would be replaced. As a result, be careful when using replace().

Also, if anyone has any questions regarding big/huge/gigantic SQLite3 databases, feel free to ask. Mine is 280gb and contains about 3.5 billion records split between 4 tables. Performance is quite good, but it took quite a bit of work to get there, and there were a lot of hard lessons learned along the way.

SQL Query for Finding ZCTAs Present in Multiple States

Using the previously described 2010 US Census SQLite database I created, there are a whole slew of neat things that can be learned fairly easily. For example, I was curious which ZCTAs are contained in more than one state. Hammerite from Something Awful created a really nice SQL query for solving this problem in this post.

1
2
3
4
5
6
7
8
9
10
SELECT DISTINCT
	zctas.zcta, states.state
FROM
	states_zctas
	JOIN states_zctas AS states_zctas_different ON
		states_zctas.zcta_id = states_zctas_different.zcta_id AND
		states_zctas.state_id <> states_zctas_different.state_id
	JOIN zctas ON states_zctas.zcta_id = zctas.id
	JOIN states ON states_zctas.state_id = states.id
ORDER BY zctas.zcta ASC;

The results look like this:

zcta state
02861 Rhode Island
02861 Massachusetts
03579 New Hampshire
03579 Maine
20135 West Virginia
20135 Virginia
21912 Maryland
21912 Delaware
24604 West Virginia
24604 Virginia
30165 Georgia
30165 Alabama

So, as it turns out, there are 103 ZCTAs that are present in more than 1 state. Unsurprisingly, there’s no ZCTA present in more than two states.

Update to 2010 Census SQLite Population Database

About a week ago, I posted an SQLite database containing 2010 US Census population data. However, I discovered a slight problem recently and am releasing an update today. The problem is that I made the assumption that there is a one-to-many relationship between states and ZCTAs. In other words, I made the assumption that states contain many ZCTAs. While this relationship certainly exists, it’s actually too simple. The real relationship is many-to-many. There are a few instances of ZCTAs that are actually contained in multiple states (the ZCTA 30165, for example, is located in both Georgia and Alabama).

This updated schema accounts for this many-to-many relationship:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
CREATE TABLE states (
	id INTEGER NOT NULL,
	state TEXT NOT NULL,
	centroid_longitude REAL NOT NULL,
	centroid_latitude REAL NOT NULL,
	population_total INTEGER NOT NULL,
	population_male_total INTEGER NOT NULL,
	population_male_lt5 INTEGER NOT NULL,
	population_male_5to9 INTEGER NOT NULL,
	population_male_10to14 INTEGER NOT NULL,
	population_male_15to17 INTEGER NOT NULL,
	population_male_18to19 INTEGER NOT NULL,
	population_male_20 INTEGER NOT NULL,
	population_male_21 INTEGER NOT NULL,
	population_male_22to24 INTEGER NOT NULL,
	population_male_25to29 INTEGER NOT NULL,
	population_male_30to34 INTEGER NOT NULL,
	population_male_35to39 INTEGER NOT NULL,
	population_male_40to44 INTEGER NOT NULL,
	population_male_45to49 INTEGER NOT NULL,
	population_male_50to54 INTEGER NOT NULL,
	population_male_55to59 INTEGER NOT NULL,
	population_male_60to61 INTEGER NOT NULL,
	population_male_62to64 INTEGER NOT NULL,
	population_male_65to66 INTEGER NOT NULL,
	population_male_67to69 INTEGER NOT NULL,
	population_male_70to74 INTEGER NOT NULL,
	population_male_75to79 INTEGER NOT NULL,
	population_male_80to84 INTEGER NOT NULL,
	population_male_ge85 INTEGER NOT NULL,
	population_female_total INTEGER NOT NULL,
	population_female_lt5 INTEGER NOT NULL,
	population_female_5to9 INTEGER NOT NULL,
	population_female_10to14 INTEGER NOT NULL,
	population_female_15to17 INTEGER NOT NULL,
	population_female_18to19 INTEGER NOT NULL,
	population_female_20 INTEGER NOT NULL,
	population_female_21 INTEGER NOT NULL,
	population_female_22to24 INTEGER NOT NULL,
	population_female_25to29 INTEGER NOT NULL,
	population_female_30to34 INTEGER NOT NULL,
	population_female_35to39 INTEGER NOT NULL,
	population_female_40to44 INTEGER NOT NULL,
	population_female_45to49 INTEGER NOT NULL,
	population_female_50to54 INTEGER NOT NULL,
	population_female_55to59 INTEGER NOT NULL,
	population_female_60to61 INTEGER NOT NULL,
	population_female_62to64 INTEGER NOT NULL,
	population_female_65to66 INTEGER NOT NULL,
	population_female_67to69 INTEGER NOT NULL,
	population_female_70to74 INTEGER NOT NULL,
	population_female_75to79 INTEGER NOT NULL,
	population_female_80to84 INTEGER NOT NULL,
	population_female_ge85 INTEGER NOT NULL,
	PRIMARY KEY (id)
);
 
CREATE TABLE counties (
	id INTEGER NOT NULL,
	state_id INTEGER NOT NULL,
	county TEXT NOT NULL,
	centroid_longitude REAL NOT NULL,
	centroid_latitude REAL NOT NULL,
	population_total INTEGER NOT NULL,
	population_male_total INTEGER NOT NULL,
	population_male_lt5 INTEGER NOT NULL,
	population_male_5to9 INTEGER NOT NULL,
	population_male_10to14 INTEGER NOT NULL,
	population_male_15to17 INTEGER NOT NULL,
	population_male_18to19 INTEGER NOT NULL,
	population_male_20 INTEGER NOT NULL,
	population_male_21 INTEGER NOT NULL,
	population_male_22to24 INTEGER NOT NULL,
	population_male_25to29 INTEGER NOT NULL,
	population_male_30to34 INTEGER NOT NULL,
	population_male_35to39 INTEGER NOT NULL,
	population_male_40to44 INTEGER NOT NULL,
	population_male_45to49 INTEGER NOT NULL,
	population_male_50to54 INTEGER NOT NULL,
	population_male_55to59 INTEGER NOT NULL,
	population_male_60to61 INTEGER NOT NULL,
	population_male_62to64 INTEGER NOT NULL,
	population_male_65to66 INTEGER NOT NULL,
	population_male_67to69 INTEGER NOT NULL,
	population_male_70to74 INTEGER NOT NULL,
	population_male_75to79 INTEGER NOT NULL,
	population_male_80to84 INTEGER NOT NULL,
	population_male_ge85 INTEGER NOT NULL,
	population_female_total INTEGER NOT NULL,
	population_female_lt5 INTEGER NOT NULL,
	population_female_5to9 INTEGER NOT NULL,
	population_female_10to14 INTEGER NOT NULL,
	population_female_15to17 INTEGER NOT NULL,
	population_female_18to19 INTEGER NOT NULL,
	population_female_20 INTEGER NOT NULL,
	population_female_21 INTEGER NOT NULL,
	population_female_22to24 INTEGER NOT NULL,
	population_female_25to29 INTEGER NOT NULL,
	population_female_30to34 INTEGER NOT NULL,
	population_female_35to39 INTEGER NOT NULL,
	population_female_40to44 INTEGER NOT NULL,
	population_female_45to49 INTEGER NOT NULL,
	population_female_50to54 INTEGER NOT NULL,
	population_female_55to59 INTEGER NOT NULL,
	population_female_60to61 INTEGER NOT NULL,
	population_female_62to64 INTEGER NOT NULL,
	population_female_65to66 INTEGER NOT NULL,
	population_female_67to69 INTEGER NOT NULL,
	population_female_70to74 INTEGER NOT NULL,
	population_female_75to79 INTEGER NOT NULL,
	population_female_80to84 INTEGER NOT NULL,
	population_female_ge85 INTEGER NOT NULL,
	PRIMARY KEY (id),
	FOREIGN KEY (state_id) REFERENCES states(id)
);
 
CREATE TABLE zctas (
	id INTEGER NOT NULL,
	zcta TEXT NOT NULL,
	centroid_longitude REAL NOT NULL,
	centroid_latitude REAL NOT NULL,
	population_total INTEGER NOT NULL,
	population_male_total INTEGER NOT NULL,
	population_male_lt5 INTEGER NOT NULL,
	population_male_5to9 INTEGER NOT NULL,
	population_male_10to14 INTEGER NOT NULL,
	population_male_15to17 INTEGER NOT NULL,
	population_male_18to19 INTEGER NOT NULL,
	population_male_20 INTEGER NOT NULL,
	population_male_21 INTEGER NOT NULL,
	population_male_22to24 INTEGER NOT NULL,
	population_male_25to29 INTEGER NOT NULL,
	population_male_30to34 INTEGER NOT NULL,
	population_male_35to39 INTEGER NOT NULL,
	population_male_40to44 INTEGER NOT NULL,
	population_male_45to49 INTEGER NOT NULL,
	population_male_50to54 INTEGER NOT NULL,
	population_male_55to59 INTEGER NOT NULL,
	population_male_60to61 INTEGER NOT NULL,
	population_male_62to64 INTEGER NOT NULL,
	population_male_65to66 INTEGER NOT NULL,
	population_male_67to69 INTEGER NOT NULL,
	population_male_70to74 INTEGER NOT NULL,
	population_male_75to79 INTEGER NOT NULL,
	population_male_80to84 INTEGER NOT NULL,
	population_male_ge85 INTEGER NOT NULL,
	population_female_total INTEGER NOT NULL,
	population_female_lt5 INTEGER NOT NULL,
	population_female_5to9 INTEGER NOT NULL,
	population_female_10to14 INTEGER NOT NULL,
	population_female_15to17 INTEGER NOT NULL,
	population_female_18to19 INTEGER NOT NULL,
	population_female_20 INTEGER NOT NULL,
	population_female_21 INTEGER NOT NULL,
	population_female_22to24 INTEGER NOT NULL,
	population_female_25to29 INTEGER NOT NULL,
	population_female_30to34 INTEGER NOT NULL,
	population_female_35to39 INTEGER NOT NULL,
	population_female_40to44 INTEGER NOT NULL,
	population_female_45to49 INTEGER NOT NULL,
	population_female_50to54 INTEGER NOT NULL,
	population_female_55to59 INTEGER NOT NULL,
	population_female_60to61 INTEGER NOT NULL,
	population_female_62to64 INTEGER NOT NULL,
	population_female_65to66 INTEGER NOT NULL,
	population_female_67to69 INTEGER NOT NULL,
	population_female_70to74 INTEGER NOT NULL,
	population_female_75to79 INTEGER NOT NULL,
	population_female_80to84 INTEGER NOT NULL,
	population_female_ge85 INTEGER NOT NULL,
	PRIMARY KEY (id)
);
 
CREATE TABLE states_zctas (
	state_id INTEGER NOT NULL,
	zcta_id INTEGER NOT NULL,
	PRIMARY KEY (state_id, zcta_id),
	FOREIGN KEY (state_id) REFERENCES states(id),
	FOREIGN KEY (zcta_id) REFERENCES zctas(id)
);
 
-- need an index on (state_id, zcta_id) and (zcta_id, state_id)
CREATE INDEX states_zctas_index2 ON states_zctas(zcta_id, state_id);

I haven’t changed a lot, but a few things have changed. First, note the states_zctas table which now accounts for the many-to-many relationship between states and ZCTAs. I’ve also renamed the population fields in the tables. *_population is now population_* for clarity reasons.

Accessing data is a little trickier as an INNER JOIN is required. For example, to access the total population for each ZCTA in Iowa, this query does the trick:

1
2
3
4
5
SELECT zctas.zcta, zctas.population_total
FROM zctas, states
INNER JOIN states_zctas
	ON states_zctas.zcta_id=zctas.id AND states_zctas.state_id=states.id
WHERE states.state="Iowa";

Download: 2010CensusPopulation.7z (size: 3mb)

2010 Census SQLite Database

In a previous post, I described an SQLite database I generated containing 2000 US Census data population info. Now that the 2010 Census data are out, I decided to update this database. The 2010 Census data are nice and clean (much better than the 2000 data!), but in order to get access to all of the data I’ve generated here, you need lots of different files, and it’s kind of a pain to combine it all into a simple useable format (I’ve spent the better part of an entire day doing it, and I’m working on a PhD in computer science…). Hopefully, this’ll make someone’s life simpler.

First, the data I used to generate this database can be found in two places:

The data I present here contain state, county, and ZCTA (which is mostly the same as ZIP code) population info for every state in the United States plus Puerto Rico and Washington DC. I provide centroids (latitude, longitude) and population by age/sex for each geographic unit. This is the schema I used to create the database:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
CREATE TABLE states (
	id INTEGER NOT NULL,
	state TEXT NOT NULL,
	centroid_longitude REAL NOT NULL,
	centroid_latitude REAL NOT NULL,
	total_population INTEGER NOT NULL,
	male_total_population INTEGER NOT NULL,
	male_lt5_population INTEGER NOT NULL,
	male_5to9_population INTEGER NOT NULL,
	male_10to14_population INTEGER NOT NULL,
	male_15to17_population INTEGER NOT NULL,
	male_18to19_population INTEGER NOT NULL,
	male_20_population INTEGER NOT NULL,
	male_21_population INTEGER NOT NULL,
	male_22to24_population INTEGER NOT NULL,
	male_25to29_population INTEGER NOT NULL,
	male_30to34_population INTEGER NOT NULL,
	male_35to39_population INTEGER NOT NULL,
	male_40to44_population INTEGER NOT NULL,
	male_45to49_population INTEGER NOT NULL,
	male_50to54_population INTEGER NOT NULL,
	male_55to59_population INTEGER NOT NULL,
	male_60to61_population INTEGER NOT NULL,
	male_62to64_population INTEGER NOT NULL,
	male_65to66_population INTEGER NOT NULL,
	male_67to69_population INTEGER NOT NULL,
	male_70to74_population INTEGER NOT NULL,
	male_75to79_population INTEGER NOT NULL,
	male_80to84_population INTEGER NOT NULL,
	male_ge85_population INTEGER NOT NULL,
	female_total_population INTEGER NOT NULL,
	female_lt5_population INTEGER NOT NULL,
	female_5to9_population INTEGER NOT NULL,
	female_10to14_population INTEGER NOT NULL,
	female_15to17_population INTEGER NOT NULL,
	female_18to19_population INTEGER NOT NULL,
	female_20_population INTEGER NOT NULL,
	female_21_population INTEGER NOT NULL,
	female_22to24_population INTEGER NOT NULL,
	female_25to29_population INTEGER NOT NULL,
	female_30to34_population INTEGER NOT NULL,
	female_35to39_population INTEGER NOT NULL,
	female_40to44_population INTEGER NOT NULL,
	female_45to49_population INTEGER NOT NULL,
	female_50to54_population INTEGER NOT NULL,
	female_55to59_population INTEGER NOT NULL,
	female_60to61_population INTEGER NOT NULL,
	female_62to64_population INTEGER NOT NULL,
	female_65to66_population INTEGER NOT NULL,
	female_67to69_population INTEGER NOT NULL,
	female_70to74_population INTEGER NOT NULL,
	female_75to79_population INTEGER NOT NULL,
	female_80to84_population INTEGER NOT NULL,
	female_ge85_population INTEGER NOT NULL,
	PRIMARY KEY (id)
);
 
CREATE TABLE counties (
	id INTEGER NOT NULL,
	state_id INTEGER NOT NULL,
	county TEXT NOT NULL,
	centroid_longitude REAL NOT NULL,
	centroid_latitude REAL NOT NULL,
	total_population INTEGER NOT NULL,
	male_total_population INTEGER NOT NULL,
	male_lt5_population INTEGER NOT NULL,
	male_5to9_population INTEGER NOT NULL,
	male_10to14_population INTEGER NOT NULL,
	male_15to17_population INTEGER NOT NULL,
	male_18to19_population INTEGER NOT NULL,
	male_20_population INTEGER NOT NULL,
	male_21_population INTEGER NOT NULL,
	male_22to24_population INTEGER NOT NULL,
	male_25to29_population INTEGER NOT NULL,
	male_30to34_population INTEGER NOT NULL,
	male_35to39_population INTEGER NOT NULL,
	male_40to44_population INTEGER NOT NULL,
	male_45to49_population INTEGER NOT NULL,
	male_50to54_population INTEGER NOT NULL,
	male_55to59_population INTEGER NOT NULL,
	male_60to61_population INTEGER NOT NULL,
	male_62to64_population INTEGER NOT NULL,
	male_65to66_population INTEGER NOT NULL,
	male_67to69_population INTEGER NOT NULL,
	male_70to74_population INTEGER NOT NULL,
	male_75to79_population INTEGER NOT NULL,
	male_80to84_population INTEGER NOT NULL,
	male_ge85_population INTEGER NOT NULL,
	female_total_population INTEGER NOT NULL,
	female_lt5_population INTEGER NOT NULL,
	female_5to9_population INTEGER NOT NULL,
	female_10to14_population INTEGER NOT NULL,
	female_15to17_population INTEGER NOT NULL,
	female_18to19_population INTEGER NOT NULL,
	female_20_population INTEGER NOT NULL,
	female_21_population INTEGER NOT NULL,
	female_22to24_population INTEGER NOT NULL,
	female_25to29_population INTEGER NOT NULL,
	female_30to34_population INTEGER NOT NULL,
	female_35to39_population INTEGER NOT NULL,
	female_40to44_population INTEGER NOT NULL,
	female_45to49_population INTEGER NOT NULL,
	female_50to54_population INTEGER NOT NULL,
	female_55to59_population INTEGER NOT NULL,
	female_60to61_population INTEGER NOT NULL,
	female_62to64_population INTEGER NOT NULL,
	female_65to66_population INTEGER NOT NULL,
	female_67to69_population INTEGER NOT NULL,
	female_70to74_population INTEGER NOT NULL,
	female_75to79_population INTEGER NOT NULL,
	female_80to84_population INTEGER NOT NULL,
	female_ge85_population INTEGER NOT NULL,
	PRIMARY KEY (id),
	FOREIGN KEY (state_id) REFERENCES states(id)
);
 
CREATE TABLE zctas (
	id INTEGER NOT NULL,
	state_id INTEGER NOT NULL,
	zcta TEXT NOT NULL,
	centroid_longitude REAL NOT NULL,
	centroid_latitude REAL NOT NULL,
	total_population INTEGER NOT NULL,
	male_total_population INTEGER NOT NULL,
	male_lt5_population INTEGER NOT NULL,
	male_5to9_population INTEGER NOT NULL,
	male_10to14_population INTEGER NOT NULL,
	male_15to17_population INTEGER NOT NULL,
	male_18to19_population INTEGER NOT NULL,
	male_20_population INTEGER NOT NULL,
	male_21_population INTEGER NOT NULL,
	male_22to24_population INTEGER NOT NULL,
	male_25to29_population INTEGER NOT NULL,
	male_30to34_population INTEGER NOT NULL,
	male_35to39_population INTEGER NOT NULL,
	male_40to44_population INTEGER NOT NULL,
	male_45to49_population INTEGER NOT NULL,
	male_50to54_population INTEGER NOT NULL,
	male_55to59_population INTEGER NOT NULL,
	male_60to61_population INTEGER NOT NULL,
	male_62to64_population INTEGER NOT NULL,
	male_65to66_population INTEGER NOT NULL,
	male_67to69_population INTEGER NOT NULL,
	male_70to74_population INTEGER NOT NULL,
	male_75to79_population INTEGER NOT NULL,
	male_80to84_population INTEGER NOT NULL,
	male_ge85_population INTEGER NOT NULL,
	female_total_population INTEGER NOT NULL,
	female_lt5_population INTEGER NOT NULL,
	female_5to9_population INTEGER NOT NULL,
	female_10to14_population INTEGER NOT NULL,
	female_15to17_population INTEGER NOT NULL,
	female_18to19_population INTEGER NOT NULL,
	female_20_population INTEGER NOT NULL,
	female_21_population INTEGER NOT NULL,
	female_22to24_population INTEGER NOT NULL,
	female_25to29_population INTEGER NOT NULL,
	female_30to34_population INTEGER NOT NULL,
	female_35to39_population INTEGER NOT NULL,
	female_40to44_population INTEGER NOT NULL,
	female_45to49_population INTEGER NOT NULL,
	female_50to54_population INTEGER NOT NULL,
	female_55to59_population INTEGER NOT NULL,
	female_60to61_population INTEGER NOT NULL,
	female_62to64_population INTEGER NOT NULL,
	female_65to66_population INTEGER NOT NULL,
	female_67to69_population INTEGER NOT NULL,
	female_70to74_population INTEGER NOT NULL,
	female_75to79_population INTEGER NOT NULL,
	female_80to84_population INTEGER NOT NULL,
	female_ge85_population INTEGER NOT NULL,
	PRIMARY KEY (id),
	FOREIGN KEY (state_id) REFERENCES states(id)
);

Previously, I included boundary data (done very poorly as I made several mistakes regarding ZCTAs with multiple boundaries and ZCTAs present in multiple states). I also provided KML files containing this data. I’ve chosen not to do either of these with the 2010 data. All I provide here are population stats and the centroid for each geographic unit. It’s trivial to convert TIGER/Line® shapefiles to KML using free tools (such as QuantumGIS), and using the SQLite database I provide, it’d be fairly simple to add population attributes. Plus, it’s not super practical to use a KML file for attribute accesses; an SQLite database makes much more sense (and is much faster).

Without further ado:
Download: 2010CensusPopulation.7z (size: 2.9mb)

You’ll need a program like 7-Zip to extract the database. You can use a number of free programs to view the data (my favorite for Windows is SQLite Expert).

Feel free to contact me with any questions or comments you may have regarding this database. Hope you find it useful!

New KML Census Population Updates

The 2000 Census KML files are deprecated in favor of the 2010 Census KML files.

In a few previous posts, I created a KML map of ZCTAs (ZIP codes, basically) that contained their boundaries and population counts (from the 2000 Census). This map contains data for all 50 states in the US plus the District of Columbia and Puerto Rico. These posts can be found here and here. A fellow grad student required county-level population data, so I figured I could help. Modifying my old Java code that created the KML files for ZCTAs would be easiest. I discovered a really interesting fact along the way that I wanted to share. At the end of this post, I’ll provide some new download links.

The census data isn’t perfect! As it turns out, several ZIP codes are listed under multiple states! Obviously, this can’t be true in real life as ZIP codes must be unique by definition, but it’s true in the data. If you go to http://www.census.gov/geo/www/cob/z52000.html, and analyze the ASCII formatted boundary data, this becomes clear if you know what to look for. Just as an example, it’s easy to see by downloading the Iowa and Illinois boundary files that both Iowa and Illinois apparently share the ZIP code 52761! A quick Google search will make it clear that this ZIP code belongs to Iowa and not Illinois.

This becomes problematic when the main objective is to study population data. When creating a KML file of ZCTAs and their boundaries and populations, it now becomes the case that some ZIP codes will be included multiple times, and as a result, so will the population counts for those ZIP codes! So, what does this mean? Well, it basically means that some of the state population counts in the KML file for ZCTAs I previously created are slightly off. For example, the 30286 people living in ZIP code 52761 are counted in both Iowa’s and Illinois’ total state population.

I’ve decided not to change/fix the ZCTA KML file. The reason for this is primarily that it’d be a royal pain in the ass. As it turns out, there are actually quite a few ZIP codes in the dataset listed multiple times. In order to fix this, I’d have to manually (by hand) go through each of these duplicated ZIP codes, Google them to see where they really are, and delete the incorrect ones. The other reason I won’t be fixing it is because a download I provide later in this post is really better than the KML files. I mention all of this as a warning to anyone using this KML file.

Now, onto the new stuff. Below, I provide a county-level population KML file. Like the ZCTA-level KML file, it contains the county boundaries, population counts, and state population counts. Note that the population counts in this KML file are accurate because the census department didn’t mess these up. As it turns out, in some cases, the same county is listed multiple times in the same state in the county boundary data. For example, Honolulu county is listed 19 times in the boundary data for Hawaii. I believe this is because the county requires multiple separate polygons to represent it (perhaps a body of water not part of the county divides it somehow?). For simplicity’s sake, I just picked the first instance of a county and used its boundaries. This means that while I do represent each of the counties in the US, I may not be drawing the entire boundary. For my purposes, this isn’t a big deal because I don’t particularly need the boundaries in the first place. All I need for my research is a point interior to the county and population counts. The boundaries are just included for simple visualization purposes. Note that this does not mess up the population counts!

And one final thing. This is probably the most useful census file I’ve generated yet: an SQLite database containing the ZCTA population counts and boundaries AND county population counts and boundaries. Unfortunately, this database does succumb to the problem of the ZCTA KML file (that duplicitous ZCTA problem), but the benefits are that if you’re doing any actual population calculations this is obviously going to be much easier and faster to use than the KML file. Here’s the schema:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
CREATE TABLE states (
	id INTEGER NOT NULL,
	state TEXT NOT NULL,
	population INT NOT NULL,
	PRIMARY KEY (id)
);
 
CREATE TABLE zctas (
	id INTEGER NOT NULL,
	state_id INTEGER NOT NULL,
	zcta TEXT NOT NULL,
	population INTEGER NOT NULL,
	centroid_longitude REAL NOT NULL,
	centroid_latitude REAL NOT NULL,
	PRIMARY KEY (id),
	FOREIGN KEY (state_id) REFERENCES states(id)
);
 
CREATE TABLE zcta_boundaries (
	id INTEGER NOT NULL,
	zcta_id INTEGER NOT NULL,
	longitude REAL NOT NULL,
	latitude REAL NOT NULL,
	PRIMARY KEY (id),
	FOREIGN KEY (zcta_id) REFERENCES zctas(id)
);
 
CREATE TABLE counties (
	id INTEGER NOT NULL,
	state_id INTEGER NOT NULL,
	county TEXT NOT NULL,
	population INTEGER NOT NULL,
	centroid_longitude REAL NOT NULL,
	centroid_latitude REAL NOT NULL,
	PRIMARY KEY (id),
	FOREIGN KEY (state_id) REFERENCES states(id)
);
 
CREATE TABLE county_boundaries (
	id INTEGER NOT NULL,
	county_id INTEGER NOT NULL,
	longitude REAL NOT NULL,
	latitude REAL NOT NULL,
	PRIMARY KEY (id),
	FOREIGN KEY (county_id) REFERENCES counties(id)
);

That’ll give you an idea of how it’s structured. And without further ado, here are the files:

Population ZCTA KML: http://www.gfairchild.com/public/populationZCTA.kmz (21mb) – Remember that a KMZ file is just a zipped up KML file.
Population County KML: http://www.gfairchild.com/public/populationCounty.kmz (4.9mb)
Population SQLite3 database: http://www.gfairchild.com/public/populationDB.7z (25mb) – This is a 7-Zip file, so you’ll need an archiver that can open 7-Zip archives to get to the database.

And yes, as soon as the 2010 census data is out in its entirety, I plan on updating these files. I’m hopeful they will have corrected any multiple-ZCTA issues with the most recent data, but only time will tell.