block by joyrexus 7439256

Explore crossfilter sans browser

A little intro and exploration of crossfilter.js based on the docs and this tutorial.

Intro

Crossfilter provides fast n-dimensional filtering and grouping of records, enabling live histograms. It’s designed for exploring large multivariate datasets, where interactions consist of grouping by a dimension followed by incremental filtering and reducing along another dimension.

In other words, it enables you to define dimensions on your dataset, by which you can then filter, sort, group and reduce the dataset. It’s internal indicing makes this all very fast and efficient.

While you can certainly use crossfilter in a browser, I think its handy to first experiment with it as a node module so you can focus on what it does without the overhead required to visualize its effects. It’s available as a node package and you can install with npm -g crossfilter.

For more on crossfilter, have a look at the …

Note that there’s a dimensional charting library based on D3 and Crossfilter called dc.js. Looks convenient if you’re working on a web-based dashboard and cool with the predefined chart types.

If you’re dealing with categorical data (e.g., surveys), try catcorr.js.

Requirements

To run this README file as literate coffeescript change the file extension from .md to either .litcoffee or .coffee.md and include …

… in the same directory.


A quick walkthrough

Below, we’ll be working through this tutorial.

Preliminaries

We’re going to use D3’s date-time methods for converting date strings into date objects. Since we’re only using this one piece of D3, we’ve pulled out and rolled our own time component:

d3 = require './d3-time-min.js'

Used below for testing:

{ok, eq, deepEqual} = require 'assert'
isEqual = deepEqual

Convenience method to convert an array of key/value maps to a map (i.e., associative array) for easier testing and reading:

toMap = (arr) ->
  obj = {}
  obj[d.key] = d.value for d in arr
  obj

sample = [ { key: "apple", value: 1 }, { key: "orange", value: 2 } ]

isEqual {apple: 1, orange: 2}, toMap(sample)

For our data records we’re going to be working with a list of US Presidents:

var presidents = [
  {
    "number":1,
    "president":"George Washington",
    "birth_year":1732,
    "death_year":1799,
    "took_office":"1789-04-30",
    "left_office":"1797-03-04",
    "party":"No Party"
  },{
  ...
  },{
    "number":44,
    "president":"Barack Obama",
    "birth_year":1961,
    "death_year":null,
    "took_office":"2009-01-20",
    "left_office":null,
    "party":"Democratic"
  }
];
presidents = require './presidents.json'

To do date comparisons we’re going to want to convert strings to date objects:

toDate = d3.time.format "%Y-%m-%d"
presidents.forEach (p) -> 
  p.took_office = toDate.parse(p.took_office)
  p.left_office = if p?.left_office \
    then toDate.parse(p.left_office) 
    else null

last = presidents.length - 1

obama = 
  number: 44
  president: 'Barack Obama'
  birth_year: 1961
  death_year: null
  took_office: toDate.parse("2009-01-20")
  left_office: null
  party: 'Democratic'

isEqual presidents[last], obama

First steps

Alrighty, let’s use the crossfilter force!

crossfilter = require 'crossfilter'

prez = crossfilter presidents

Check the size of our dataset:

ok prez.size() is 44

Facet presidents by political party:

byParty = prez.dimension (p) -> p.party

parties = byParty.group()

If we group by this dimension we should have six parties total:

ok parties.size() is 6

… viz.:

partyList = (p.key for p in parties.top(Infinity))

isEqual partyList, [ 
  'Republican',
  'Democratic',
  'Whig',
  'Democratic-Republican',
  'Federalist',
  'No Party' 
]

Note how we use the top method, which lists the top k parties in our group (sorted by count in descending order). When we pass Infinity as an argument, all items are returned.

Reducing

The parties grouping can be reduced in a variety of ways. Without an explicit reduction, you get a count of each group by default. That is, each entry in the grouping consists of a key-value pair, where the key is a group name (a party) and the value is the number of items in the group (the number of presidents in that party).

partyCount = toMap parties.all()

expected = 
  Republican: 18
  Democratic: 16
  Whig: 4
  'Democratic-Republican': 4
  Federalist: 1
  'No Party': 1

isEqual expected, partyCount

Now suppose we want to reduce the parties to total years in power. The presidential entries in the provided dataset do not contain an attribute for years in power, only the date they took and left office. These latter dates provide the information needed to calculate a president’s years in office:

yearsPrez = (begin, end) ->
  end ?= new Date()
  time = end - begin    # diff in milliseconds
  secs = time / 1000    # seconds in office
  mins = secs / 60
  hours = mins / 60
  days = hours / 24
  years = days / 365
  Math.round years

We can then use this function to reduce our grouping by parties to a sum of the years they held presidential office:

toYearsPrez = (d) -> yearsPrez(d.took_office, d.left_office)

totalYears = toMap parties.reduceSum(toYearsPrez).all()

expected = 
  Democratic: 89
  Republican: 88
  'Democratic-Republican': 28
  Whig: 8
  'No Party': 8
  Federalist: 4

isEqual expected, totalYears

Filtering

Let’s filter our presidents by those who ran in the Whig party and display the resulting list of presidents sorted by when they assumed office:

whigs = byParty.filter("Whig").top(Infinity)
sortByNum = crossfilter.quicksort.by (d) -> d.number

result = (p.president for p in sortByNum(whigs, lo=0, hi=whigs.length))

expected = [ 
  'William Henry Harrison',
  'John Tyler'
  'Zachary Taylor'
  'Millard Fillmore'
]

isEqual result, expected

Note that we have to sort the resulting whigs list above since the only order on this dimension is by party, whereas we want the final list to be sorted by presidential number.

OK, let’s get all parties back:

byParty.filterAll()

Coordinated Views

We can of course add a second dimension to work in conjunction with the first.

Let’s create a dimension for the year a president took office …

byDate = prez.dimension (p) -> p.took_office

… and filter out presidents starting before 1900:

dateRange = [new Date(1900, 1, 1), Infinity]

byDate.filter dateRange

We should find that there are 19 presidents that have taken office after 1900:

modernPrez = byDate.bottom(Infinity)
ok 19 is modernPrez.length

… the first being …

ok modernPrez[0].president is 'Theodore Roosevelt'

Note how parties (our byParty dimension) was also updated:

partyYears = toMap parties.all()

expected = 
  Republican: 59
  Democratic: 53
  Federalist: 0
  Whig: 0
  'No Party': 0
  'Democratic-Republican': 0

partyCount = toMap parties.reduceCount().all()

expected = 
  Republican: 11
  Democratic: 8
  Federalist: 0
  Whig: 0
  'No Party': 0
  'Democratic-Republican': 0

isEqual partyCount, expected

presidents.json