In this post I will simultaneously have Fun With Data and Fun With Maps. I will use public APIs to turn my Isle of Alameda into a “choropleth“, a map which displays areas that are colored or patterned in relation to data.
To do this I will need to find boundaries within Alameda that I can associate with data of some kind. For this I turn to the ultimate source of geographical data within the United States: the U. S. Census Bureau. To do its work the Census Bureau divides the country into regions, states, counties, cities, tracts, and block groups and gathers data at each level. The main island of Alameda is divided into fourteen tracts which in turn are divided into fifty block groups.
All this data is free to the public and accessible via public APIs, but the government web sites are so sprawling and complex that most people access them through intermediary sites like Knight-Ridder’s Census Reporter. These sites do a great job at producing pre-made tables and choropleths, but I want to learn how to do it myself.
After much hunting I find raw tract boundaries in a downloadable CSV file from the Alameda County Data Sharing Initiative. Using the Census Reporter site to identify tract numbers, I reduce the 372 tracts in Alameda County to the 14 on the main island of Alameda. Each tract boundary is defined by a long list of longitudes and latitudes.
I can now pull up my own Alameda map outline in NodeBox. When I convert each list to X,Y coordinates and overlay the paths, they don’t quite fit at first. I spend hours muttering and pulling my hair until I realize that the formula I’m using does not properly account for the curvature of the earth. There are different ways of projecting coordinates onto a flat surface and when you get down to the street level you need to get everything exactly right – especially when overlaying boundaries from different sources.
I got my original island boundary from Open Street Map. OSM stores its data using the same geographic coordinate system (EPSG 4326) used by GPS devices, but uses a different projected coordinate system (EPSG 3857) when creating its map tiles. To convert you need to use a spherical pseudo-Mercator projection (not the true oblate ellipsoid Mercator projection). Confused? I was – as were many others before me. Fortunately the correct formula (in many different programming languages) appears on the Open Street Map Wiki:
Note: the constant 6,378,137 in the formula is the idealized radius of the earth in meters. Formulas without this value did not work for me.
With the right formula in place, the tract boundaries snap perfectly into position. The boundaries extend beyond the shore, but for now are sufficient to verify proper alignment:
Choropleths convey more information (and look cooler) when you divide the map into smaller pieces. So having cracked the code for tracts, I now turn my attention to the smallest unit used for census data: block groups.
Finding block group boundaries expressed in pure latitudes and longitudes proves to be more difficult. I finally turn to the government’s TIGER site (Topologically Integrated Geographic Encoding and Referencing), but here I run into another problem. The boundary data is only available in shp files, which require a powerful application called arcGIS. This is what the big boys use, but I want to draw the boundaries myself.
Parsing the data was a multi-step process. The original file contained 31,647 boundaries, one for each block fragment in Alameda County. Using zip code data, I identify the 14 tracts on the island and use that list to filter the data down to a mere 1298 block fragments. I then group those block fragments into 282 blocks and those blocks into 50 block groups. Here is the NodeBox network I made to do the parsing:
And here is what all those block fragments look like before I group them into block groups and trim them to fit the outline of the island:
The final step, trimming the boundaries, takes some time and patience. The basic technique is to take the intersection of one shape, like a block group, with a second shape, like the island outline. But when you look closely you see that some of the defined boundaries only approximate the true shape of the island, leaving little slivers of leftover space here and there. To fix this I have to increase the area by doing a union with an arbitrary rectangle and then do an intersection to trim it back to the exact shoreline.
There are also some peculiarities. A tiny triangular corner of the island actually resides in San Francisco County. Since this is uninhabited marshland it cannot affect the data so I add it to the nearest block group. Another group has an absurdly narrow tongue which sticks up along the median of a street separating two other groups, creating a distracting mess. I quietly trim it away. This is the kind of thing you have to do when cleaning any dataset. The difference here is that instead of correcting numbers, you are correcting shapes. Here is the final result, slightly exploded to better show each block group boundary:
Now that I finally have the boundaries of my choropleth it’s time to find data to color them with. The place to find that is a government site called the American Fact Finder. This site contains many different data sources or “programs”. In addition to the Decennial Census, there are housing surveys, commodity flow surveys, employer statistics, and much more.
I choose a source called the American Community Survey, the largest household survey the Census Bureau administers. Unlike the decennial census, it does not count everyone; it uses a statistical sample to estimate information based on surveys sent to 3.5 million households per year. So it’s not as accurate as the full census, but is more up to date. The estimates are quite reliable on a large scale, but can have a significant margin of error when applied at the block group level.
I select the 2015 5-year survey applied at the block group level in Alameda county. This reduces the available data to 342 separate tables including Median Age by Sex, Travel Time to Work, Household Size, School Enrollment, Median Income, Number of Bedrooms, Aggregate Rent, and on and on. You can only download forty tables at a time so I pick a few at random.
Using NodeBox I can easily read these tables, lookup individual values for each block group, and then color my block groups accordingly. Within minutes I am producing one choropleth after another:
These maps may not seem all that interesting to you, but for someone like me who has lived on this island for twenty-five years they are fascinating. I can imagine hundreds of possible investigations. But for now I bask in the sheer power of effortlessly turning any random spreadsheet I come across into a gleaming, perfect choropleth.
As a final flourish I will take this idea to the next level. Instead of coloring each block group a solid color, I can fill it with randomly scattered colored dots, one dot for each person on the island. There are 63,043 people living on the main island and I can assign each one a color based on their self-identified race. Our racial diversity is one of the things I like about Alameda; it’s nice to finally see it in a single image:
I hope you have enjoyed this experiment. Open source mapping has become an energetic worldwide movement with a supportive community and many powerful tools. I have packed this post with links to help you get started on your own projects; here are some more.