We need a simple vocabulary for talking about the manipulation of tabular data structures. That is, we want a clean and elegant way of describing how to munge tables. We want to slice and dice them with minimal pain. Implementation can follow once we settle on the right verbs for data manipulation operations and the right nouns for post-manipulated elements. Yes, we have relational algebra codified in SQL and its mungy extensions. So we’re just talking about a coherent distillation of those parts potentially applicable to any tabular data store.
What follows are extracts from Hadley Wickham’s dplyr project README file.
You need to split up a big data structure into homogeneous pieces, apply a function to each piece and then combine all the results back together. For example, you might want to:
select()
: focus on a subset of variablesfilter()
: focus on a subset of rowsmutate()
: add new columnssummarise()
: reduce each group to a smaller number of summary statisticsarrange()
: re-order the rowsAs well as verbs that work on a single table, we also want verbs describing how to operate on two tables at a time, viz. your standard joins:
inner_join(x, y)
: matching x + yleft_join(x, y)
: all x + matching ysemi_join(x, y)
: all x with match in yanti_join(x, y)
: all x without match in yWe might also want to provide head()
and print()
methods for summary display of our tables.
crossfilter - fast n-dimensional filtering and grouping of records in javascript.