d3 + maps = webGIS

What is GIS

A geographic information system (GIS) is a system designed to capture, store, manipulate, analyze, manage, and present all types of spatial or geographical data. (wikipedia)

A GIS is great at finding places based on location, and helps to tie data to the location where that data occurred. Since it’s also a database, it can do things like grouping, aggregations and lookups.

Modern GIS Stack

postgres - relational db
postgis - geospatial extensions
express - api
geojson / topojson - geography/topology encoding schemes, GIS data structure
d3 - data-driven documents, visualization

Example

Create a visualization of HCC Risk scores across the country, aggregated at county level. What’s the HCC Risk Score? The basis of the HHS-HCC risk adjustment model is using health plan enrollee diagnoses (and demographics) to predict medical expenditure risk. It’s a model that predicts cost for that patient for the period covered by the score (in this case, 2013).

Data

Risk score data

We’ll use the Medicare physician utilization dataset for 2013 for this project. This dataset contains information the demographic make-up of patients for about 1M healthcare providers across the country. We’re interested in the HCC Risk Score field in this dataset, named avg_hcc_risk_score.

Geography data

Postgis comes pre-built with loading scripts to import TIGER data. We’ll use these scripts to load our postgis instance with all places in the US, including counties, states, blocks, block groups, and zip codes (ZCTAs).

We’ll use the TIGER data to roll-up zip code-level HCC scores to the county level, and then export the aggregated data.

TIGER = Topologically Integrated Geographic Encoding and Referencing

These are spatial extracts from the Census Bureau’s MAF/TIGER database, containing features such as roads, railroads, rivers, as well as legal and statistical geographic areas.

This is the Census Bureau’s best mapping data, and it lets us query US geographies using meta data (like zip code, or country name, or state abbreviation) and retrieve their respective geographies and relationships with other entities (like counties within a state).

ZCTA

ZIP Code Tabulation Areas (ZCTAs) are generalized areal representations of United States Postal Service (USPS) ZIP Code service areas.

The USPS ZIP Codes identify the individual post office or metropolitan area delivery station associated with mailing addresses. USPS ZIP Codes are not areal features but a collection of mail delivery routes.

Census geographies

Census geographies have a neat ID convention. You can take the first n characters and derrive the parent of the current geography. Let’s take Oregon as an Example

State - Oregon - FIPS 41
County - Multnomah County - FIPS 41 051
Tract (~4000 inhabitants) - Census Tract 20 (near my home, Inner SE Portland) - ID 41 051 002000
Block Group (~700 inhabitants) - Census Block Group 5 (even closer to my home) - ID 41 051 002000 5
Block (approx 40 blocks / block group on average!) - Census Block 5001 (wow, hello, neighbor) - ID 41 051 002000 5 001

Configuration

Install Postgres & Postgis
Configure TIGER Data

Steps

Loading and joining data

We need to get the zip codes associated with counties. We’ll use the TIGER-supplied ZCTA entities in place of zip codes. These are a generally pretty good representation of USPS Zip codes, and change only every 10 years (each census).

Create ZIP <> FIPS (County) Mapping

drop table if exists gis.zip_lookup;
create table gis.zip_lookup (
  zip      CHAR(5), -- zip
  statefp  CHAR(2), -- state code (FIPS)
  countyfp CHAR(5)  -- county code (FIPS)
);
-- create a look up table for counties and what zip codes each county contains
with zips as (
  select * from tiger.zcta5 zip
  where st_isvalid(zip.the_geom)='t'
), 
-- sometimes geometries are invalid, let's make sure we only consider valid geos
counties as (
  select * from tiger.county county
  where st_isvalid(county.the_geom)='t'
)
-- we don't really care about times where a zip code is in multiple counties, so we'll only
-- use the centroid of the zip for the lookup
insert into gis.zip_lookup
select zips.zcta5ce, counties.statefp, counties.cntyidfp from counties, zips
  where ST_Intersects(counties.the_geom, st_centroid(zips.the_geom));
-- create lookup indexes
create index on gis.zip_lookup(zip);
create index on gis.zip_lookup(countyfp);

Aggregate the HCC Risk scores by county (using our new lookup table)

Since we now have a lookup table for county <> zip mappings, and a table of patient demographic data, we’ll now select the average of HCC scores, grouped by county, using our new zip_lookup table.

-- check avg score by zip code
with zip_hcc_scores as (
    SELECT
      avg(avg_hcc_risk_score) hcc_risk_score,
      left(zip, 5) zip
    FROM cms_charges.physician_demographics
    GROUP BY left(zip, 5)
)
-- use aggregated by zip to further aggregate by county
select countyfp, avg(hcc_risk_score) from gis.zip_lookup zips
  join zip_hcc_scores hcc on hcc.zip = zips.zip
group by countyfp;

So we now have a county-aggregated list of HCC Scores, let’s add geographies to it!

-- get avg score by zip code
with zip_hcc_scores as (
    SELECT
      avg(avg_hcc_risk_score) hcc_risk_score,
      left(zip, 5) zip
    FROM cms_charges.physician_demographics
    GROUP BY left(zip, 5)
), 
-- let's aggregate by County FIPS code
county_hcc as (
    SELECT
      countyfp,
      avg(hcc_risk_score) hcc_risk_score
    FROM gis.zip_lookup zips
      JOIN zip_hcc_scores hcc ON hcc.zip = zips.zip
    GROUP BY countyfp
), 
-- another CTE, that uses the county_hcc mappings and adds geo to it
county_hcc_geo as (
    select 
        county_hcc.countyfp, 
        county_hcc.hcc_risk_score, 
        county.the_geom 
    from tiger.county
        join county_hcc on county_hcc.countyfp = tiger.county.cntyidfp
)

Output as GEOJSON

GeoJSON is a format for encoding a variety of geographic data structures. It’s JSON for geography. For us, it will be the protocol we’ll use to export from Postgres. Postgis comes with a handy way to express geometry and geography data as GeoJSON, and export it, directly within the database!

So, finally, we’ll take our HCC Risk data, aggregated by county, and export is as a GeoJSON FeatureCollection, making sure to take the necessary metadata to make our visualization useful (like county name), and even # of providers in each county.

-- add this to the end of the CTEs above.
-- NOTE: you should run this in the command-line, this will output a large file, which you should pipe into a text file:
-- psql -f this.sql > file.json
-- you will have to clean up the json file (get rid of default header)
SELECT row_to_json(fc)
 FROM ( 
     SELECT 'FeatureCollection' As type, array_to_json(array_agg(f)) As features
     FROM (SELECT 'Feature' As type
    , ST_AsGeoJSON(county_hcc_geo.the_geom)::json As geometry
    , row_to_json(
        (SELECT l FROM (SELECT countyfp, hcc_risk_score, 'county_name') As l)
      ) As properties
   FROM county_hcc_geo   
 ) As f 
) As fc;

Convert to TOPOJSON

TopoJSON is an extension to GeoJSON that encodes topology. Its primary benefit is that it greatly reduces the size of the geometry files by encoding same “arc” segments in a single place, and then using references to those segments wherever the segments appear. In our case, the GeoJSON file was 2.1MB, the TopoJSON file was 1.1MB. topojson -p -o counties.topo.json counties.geojson

Putting it all together in D3

With the TopoJSON document we created in the previous step, we’re now ready to generate our map. But first, a little about projections.

Projections

A projection is how spherical latitude and longitude points translate onto a 2d plane. There are a lot of projections out there, today we’ll use the Albers USA projection, which is a familiar one to most of us, and is supplied by D3 out of the box.

One of its benefits is that it places Alaska and Hawaii in the left bottom of the screen, making maps created with this projection easy to parse and compare. There are many projections in D3, explore them. Each projection may have different settings for scale and position, so be aware of that (in case the map is off the screen).

Loading Topojson, displaying it

We’re going to use queue.js to load our data asynchronously and draw the map once we have all the data in place. Notice we use d3.json for the .defer call to load our topojson data. d3.json will process this resource as JSON when it’s done loading.

Adding data

While we can map and colorize directly from TopoJSON, in this example, I wanted to load the data separately from the geography, so we can join the geographies with ANY data (as long as it’s grouped by FIPS code). I am loading 2 datasets in this example. The first is our FIPS/HCC Risk score CSV, the second is a list of states with their own FIPS codes, with abbreviations and names. We’ll use the second dataset to make our tooltips more user friendly.

// load topojson + hcc risk score csv
queue()
  .defer(d3.json, 'counties.topo.json')
  .defer(d3.csv, 'hcc.csv', function (d) {
    // load a look up table with fips / hcc codes (to use for choropleth)
    hccByFips.set(d.fips, +d.hcc);
  })
  .defer(d3.csv, 'states.csv', function (d) {
    // load a states table, so we can use FIPS to look up the state
    statesByFips.set(d.statefp, {
    abbr: d.abbrev,
    name: d.name
    });
  })
  .await(drawMap);

I am using d3.map to create two lookup tables (maps) – one for FIPS (County) <> HCC Risk Scores, and the second to FIPS (State) <> State details. See the Census ID conventions (above) for a little more color on this.

Add a legend

We’ll use the same color scale we created in the drawMap method to use to create the legend. First, we’ll use the range of the scale to feed data into the data method, and we’ll create an LI for each item.

Make our map into a choropleth map

A choropleth map is a thematic map in which areas are shaded or patterned in proportion to the measurement of the statistical variable being displayed on the map, such as population density or per-capita income, in our case, HCC Risk Score.

We’ll use the EXCELLENT colorbrewer2 set of colors to create our map. You should explore the various types of color schemes that Cynthia Brewer put together for making great looking, readable maps. Since we have a sequential domain for our data, we’ll use the sequential set of colors to represent it as our range.

Sometimes you might have divergent data – for example, voting trends, Democrat vs. Republican. For these cases, a diverging color scheme works best.

  // threshold scale, so we can fine-tune how the colors turn out.
  var colors2 = d3.scale.threshold()
    .domain([0.9, 1, 1.1, 1.2, 1.3, 1.5, 2, 4])
    .range(colorbrewer.YlOrRd[9]);
  // add paths to SVG, put the class "county" on them, and then hook them up to the event handler showCountyDetails
  .attr("fill", function (d) {
    var fillColor = colors2(hccByFips.get(d.properties["FIPS Code"]));
    return fillColor ? fillColor : 'transparent';
  })

Add tooltips

You could pull in the d3 tooltip library, I decided to simply use the on('click') handler to show a little message. You get the idea. Here, we’ll use the two lookup tables we created previously to show the County name, state name and the HCC Risk score.

function showCountyDetails (d) {
  if (!d && !d.properties) { return; }
  d3.select('#details').html(
    d.properties['County Name'] + ', ' + statesByFips.get(d.properties["FIPS Code"].substring(0,2)).name 
    + ' ' + ' average HCC Risk score is <b>' + (hccByFips.get(d.properties["FIPS Code"]) || 'Not available') + '</b>'
  );
  d3.selectAll("path").attr("class", "county");
  d3.select(this).attr("class", "selected");
}

HCC Risk scores by county, D3 Portland Meetup, Oct 27, 2015