A geographic information system (GIS) is a system designed to capture, store, manipulate, analyze, manage, and present all types of spatial or geographical data. (wikipedia)
A GIS is great at finding places based on location, and helps to tie data to the location where that data occurred. Since it’s also a database, it can do things like grouping, aggregations and lookups.
Create a visualization of HCC Risk scores across the country, aggregated at county level. What’s the HCC Risk Score? The basis of the HHS-HCC risk adjustment model is using health plan enrollee diagnoses (and demographics) to predict medical expenditure risk. It’s a model that predicts cost for that patient for the period covered by the score (in this case, 2013).
We’ll use the Medicare physician utilization dataset for 2013 for this project. This dataset contains information the demographic make-up of patients for about 1M healthcare providers across the country. We’re interested in the HCC Risk Score field in this dataset, named avg_hcc_risk_score
.
Postgis comes pre-built with loading scripts to import TIGER data. We’ll use these scripts to load our postgis instance with all places in the US, including counties, states, blocks, block groups, and zip codes (ZCTAs).
We’ll use the TIGER data to roll-up zip code-level HCC scores to the county level, and then export the aggregated data.
These are spatial extracts from the Census Bureau’s MAF/TIGER database, containing features such as roads, railroads, rivers, as well as legal and statistical geographic areas.
This is the Census Bureau’s best mapping data, and it lets us query US geographies using meta data (like zip code, or country name, or state abbreviation) and retrieve their respective geographies and relationships with other entities (like counties within a state).
ZIP Code Tabulation Areas (ZCTAs) are generalized areal representations of United States Postal Service (USPS) ZIP Code service areas.
The USPS ZIP Codes identify the individual post office or metropolitan area delivery station associated with mailing addresses. USPS ZIP Codes are not areal features but a collection of mail delivery routes.
Census geographies have a neat ID convention. You can take the first n characters and derrive the parent of the current geography. Let’s take Oregon as an Example
We need to get the zip codes associated with counties. We’ll use the TIGER-supplied ZCTA entities in place of zip codes. These are a generally pretty good representation of USPS Zip codes, and change only every 10 years (each census).
drop table if exists gis.zip_lookup;
create table gis.zip_lookup (
zip CHAR(5), -- zip
statefp CHAR(2), -- state code (FIPS)
countyfp CHAR(5) -- county code (FIPS)
);
-- create a look up table for counties and what zip codes each county contains
with zips as (
select * from tiger.zcta5 zip
where st_isvalid(zip.the_geom)='t'
),
-- sometimes geometries are invalid, let's make sure we only consider valid geos
counties as (
select * from tiger.county county
where st_isvalid(county.the_geom)='t'
)
-- we don't really care about times where a zip code is in multiple counties, so we'll only
-- use the centroid of the zip for the lookup
insert into gis.zip_lookup
select zips.zcta5ce, counties.statefp, counties.cntyidfp from counties, zips
where ST_Intersects(counties.the_geom, st_centroid(zips.the_geom));
-- create lookup indexes
create index on gis.zip_lookup(zip);
create index on gis.zip_lookup(countyfp);
Since we now have a lookup table for county <> zip mappings, and a table of patient demographic data, we’ll now select the average of HCC scores, grouped by county, using our new zip_lookup
table.
-- check avg score by zip code
with zip_hcc_scores as (
SELECT
avg(avg_hcc_risk_score) hcc_risk_score,
left(zip, 5) zip
FROM cms_charges.physician_demographics
GROUP BY left(zip, 5)
)
-- use aggregated by zip to further aggregate by county
select countyfp, avg(hcc_risk_score) from gis.zip_lookup zips
join zip_hcc_scores hcc on hcc.zip = zips.zip
group by countyfp;
So we now have a county-aggregated list of HCC Scores, let’s add geographies to it!
-- get avg score by zip code
with zip_hcc_scores as (
SELECT
avg(avg_hcc_risk_score) hcc_risk_score,
left(zip, 5) zip
FROM cms_charges.physician_demographics
GROUP BY left(zip, 5)
),
-- let's aggregate by County FIPS code
county_hcc as (
SELECT
countyfp,
avg(hcc_risk_score) hcc_risk_score
FROM gis.zip_lookup zips
JOIN zip_hcc_scores hcc ON hcc.zip = zips.zip
GROUP BY countyfp
),
-- another CTE, that uses the county_hcc mappings and adds geo to it
county_hcc_geo as (
select
county_hcc.countyfp,
county_hcc.hcc_risk_score,
county.the_geom
from tiger.county
join county_hcc on county_hcc.countyfp = tiger.county.cntyidfp
)
GeoJSON is a format for encoding a variety of geographic data structures. It’s JSON for geography. For us, it will be the protocol we’ll use to export from Postgres. Postgis comes with a handy way to express geometry and geography data as GeoJSON, and export it, directly within the database!
So, finally, we’ll take our HCC Risk data, aggregated by county, and export is as a GeoJSON FeatureCollection, making sure to take the necessary metadata to make our visualization useful (like county name), and even # of providers in each county.
-- add this to the end of the CTEs above.
-- NOTE: you should run this in the command-line, this will output a large file, which you should pipe into a text file:
-- psql -f this.sql > file.json
-- you will have to clean up the json file (get rid of default header)
SELECT row_to_json(fc)
FROM (
SELECT 'FeatureCollection' As type, array_to_json(array_agg(f)) As features
FROM (SELECT 'Feature' As type
, ST_AsGeoJSON(county_hcc_geo.the_geom)::json As geometry
, row_to_json(
(SELECT l FROM (SELECT countyfp, hcc_risk_score, 'county_name') As l)
) As properties
FROM county_hcc_geo
) As f
) As fc;
TopoJSON is an extension to GeoJSON that encodes topology. Its primary benefit is that it greatly reduces the size of the geometry files by encoding same “arc” segments in a single place, and then using references to those segments wherever the segments appear. In our case, the GeoJSON file was 2.1MB, the TopoJSON file was 1.1MB. topojson -p -o counties.topo.json counties.geojson
With the TopoJSON document we created in the previous step, we’re now ready to generate our map. But first, a little about projections.
A projection is how spherical latitude and longitude points translate onto a 2d plane. There are a lot of projections out there, today we’ll use the Albers USA projection, which is a familiar one to most of us, and is supplied by D3 out of the box.
One of its benefits is that it places Alaska and Hawaii in the left bottom of the screen, making maps created with this projection easy to parse and compare. There are many projections in D3, explore them. Each projection may have different settings for scale and position, so be aware of that (in case the map is off the screen).
We’re going to use queue.js to load our data asynchronously and draw the map once we have all the data in place. Notice we use d3.json
for the .defer
call to load our topojson data. d3.json
will process this resource as JSON when it’s done loading.
While we can map and colorize directly from TopoJSON, in this example, I wanted to load the data separately from the geography, so we can join the geographies with ANY data (as long as it’s grouped by FIPS code). I am loading 2 datasets in this example. The first is our FIPS/HCC Risk score CSV, the second is a list of states with their own FIPS codes, with abbreviations and names. We’ll use the second dataset to make our tooltips more user friendly.
// load topojson + hcc risk score csv
queue()
.defer(d3.json, 'counties.topo.json')
.defer(d3.csv, 'hcc.csv', function (d) {
// load a look up table with fips / hcc codes (to use for choropleth)
hccByFips.set(d.fips, +d.hcc);
})
.defer(d3.csv, 'states.csv', function (d) {
// load a states table, so we can use FIPS to look up the state
statesByFips.set(d.statefp, {
abbr: d.abbrev,
name: d.name
});
})
.await(drawMap);
I am using d3.map to create two lookup tables (maps) – one for FIPS (County) <> HCC Risk Scores, and the second to FIPS (State) <> State details. See the Census ID conventions (above) for a little more color on this.
We’ll use the same color scale we created in the drawMap method to use to create the legend. First, we’ll use the range of the scale to feed data into the data method, and we’ll create an LI for each item.
A choropleth map is a thematic map in which areas are shaded or patterned in proportion to the measurement of the statistical variable being displayed on the map, such as population density or per-capita income, in our case, HCC Risk Score.
We’ll use the EXCELLENT colorbrewer2 set of colors to create our map. You should explore the various types of color schemes that Cynthia Brewer put together for making great looking, readable maps. Since we have a sequential domain for our data, we’ll use the sequential set of colors to represent it as our range.
Sometimes you might have divergent data – for example, voting trends, Democrat vs. Republican. For these cases, a diverging color scheme works best.
// threshold scale, so we can fine-tune how the colors turn out.
var colors2 = d3.scale.threshold()
.domain([0.9, 1, 1.1, 1.2, 1.3, 1.5, 2, 4])
.range(colorbrewer.YlOrRd[9]);
// add paths to SVG, put the class "county" on them, and then hook them up to the event handler showCountyDetails
.attr("fill", function (d) {
var fillColor = colors2(hccByFips.get(d.properties["FIPS Code"]));
return fillColor ? fillColor : 'transparent';
})
You could pull in the d3 tooltip library, I decided to simply use the on('click')
handler to show a little message. You get the idea. Here, we’ll use the two lookup tables we created previously to show the County name, state name and the HCC Risk score.
function showCountyDetails (d) {
if (!d && !d.properties) { return; }
d3.select('#details').html(
d.properties['County Name'] + ', ' + statesByFips.get(d.properties["FIPS Code"].substring(0,2)).name
+ ' ' + ' average HCC Risk score is <b>' + (hccByFips.get(d.properties["FIPS Code"]) || 'Not available') + '</b>'
);
d3.selectAll("path").attr("class", "county");
d3.select(this).attr("class", "selected");
}