Finding indicator dependencies

I had been meaning to look into R's metaprogramming features for a while now. I finally had a chance today (thanks Hadley!), and I used it to experiment towards a problem that had been in the back of my mind for a while: finding depencies within indicator definitions.

Below, I implement a find dependencies function, which takes a set of indicators, and finds dependencies within it. Indicators are fields within a dataset, some of which are already there, and some of which are newly created. The dependency finding problem is investigating which new indicators derive from which existing ones. We think of these relationships as dependencies: for an indicator such as pupil-to-teacher-ratio (defined as the number-of-pupils divided by the number-of-teachers), pupil-to-teacher-ratio is dependent on number-of-pupils and number-of-teachers. Below, we'll see the dependency lists for a whole set of indicators.

This is useful in all sorts of contexts, from missing data analysis to many optimizations and streaming tasks. Creating a way to write indicators that allows for dependency analysis is usually tricky; you often end up introducing a lot of overhead for dependency analysis that detracts from the domain-specific indicator definitions. R's metaprogramming features show a pretty neat approach, at least at an experimental level. Note that most of the knowledge needed here is thanks to Hadley Wickam's amazing Advanced R programming book, particularly the chapters on environments, non-standard evaluation, and metaprogramming.

First, The Indicators

First, lets define the indicators. You can skim these, perhaps noticing that we aren't writing much more than the pure definitions of each indicator. I left the infinite_as_NA function in, but its possible that could be pulled out of here into another layer.

# Quick function for converting infinite values to NA
infinite_as_NA <- function(x) {
    ifelse(is.infinite(x), NA, x)
}
require(stringr)  # for string detect
## Loading required package: stringr
# Example indicators (these are from a real example)
line_by_line_indicators = quote({
    is_primary = str_detect(facility_type, "primary")
    is_junior_secondary = str_detect(facility_type, "junior")
    pj = is_primary | is_junior_secondary
    src = "mopup"
    date_of_survey = as.character(as.Date(start))

    # INFRASTRUCTURE
    improved_water_supply = improved_water_supply.tap | improved_water_supply.protected_well | 
        improved_water_supply.rainwater | improved_water_supply.handpump
    improved_water_supply = improved_water_supply.tap | improved_water_supply.protected_well | 
        improved_water_supply.rainwater | improved_water_supply.handpump
    improved_sanitation = improved_sanitation.vip_latrine | improved_sanitation.pit_latrine_with_slab | 
        improved_sanitation.flush

    # PARTICIPATION
    male_to_female_student_ratio = num_students_male/num_students_female

    # INFRASTRUCTURE:WATER&SAN
    pupil_toilet_ratio_facility = infinite_as_NA(num_students_total/num_toilets_total)

    # INFRASTRUCTURE:BUILDING STRUCTURE
    power_access = (power_sources.generator | power_sources.solar_system | power_sources.grid)

    # INFRASTRUCTURE:LEARNING ENVIRONMENT
    pupil_classrm_ratio = num_students_total/num_classrms_total

    # ADEQUACY OF STAFFING
    pupil_tchr_ratio = num_students_total/num_tchrs_total
})

Find Dependencies

Here is the function to find dependencies (note: built off of one day's learning about meta-programming, if you notice a bug, do point it out).

## Find 'dependencies' in lhs = rhs type equations, where all names in rhs
## are dependences of lhs. Outputs in a named vector, where names are from
## lhs, and the values from rhs.  Example, pupil_teacher_ratio = num_pupils /
## num_tchrs will produce (in json-represented named vector):
## {'pupil_teacher_ratio': 'num_pupils', 'pupil_teacher_ratio' : 'num_tchrs'}
find_deps <- function(x) {
    if (is.atomic(x)) {
        # these are evaluated values; we don't want these
        character()
    } else if (is.name(x)) {
        # these are our unevaluated values; this is what we want
        as.character(x)
    } else if (is.call(x)) {
        if (identical(x[[1]], quote(`=`))) {
            # x looks like = lhs rhs; we recurse first on rhs (x[-1][-1])
            rhs <- unlist(lapply(x[-1][-1], find_deps))
            # and take the second element (x[[2]]) as lhs
            lhs <- as.character(x[[2]])
            if (length(rhs) == 0) 
                setNames(NA, lhs)  # no dependencies
 else setNames(rhs, rep(lhs, length(rhs)))
        } else {
            # for other function calls, we recurse only on everything but function name
            unlist(lapply(x[-1], find_deps))
        }
    }
}

The magic

And finally, here is the magic. A dictionary of dependencies. The dependent indicator is listed as the key, once per indicator that it depends on (which is the value). Notice src, which doesn't have any dependencies.

dependencies <- find_deps(line_by_line_indicators)
cat(RJSONIO::toJSON(dependencies, pretty = TRUE))
## {
##  "is_primary" : "facility_type",
##  "is_junior_secondary" : "facility_type",
##  "pj" : "is_primary",
##  "pj" : "is_junior_secondary",
##  "src" : null,
##  "date_of_survey" : "start",
##  "improved_water_supply" : "improved_water_supply.tap",
##  "improved_water_supply" : "improved_water_supply.protected_well",
##  "improved_water_supply" : "improved_water_supply.rainwater",
##  "improved_water_supply" : "improved_water_supply.handpump",
##  "improved_water_supply" : "improved_water_supply.tap",
##  "improved_water_supply" : "improved_water_supply.protected_well",
##  "improved_water_supply" : "improved_water_supply.rainwater",
##  "improved_water_supply" : "improved_water_supply.handpump",
##  "improved_sanitation" : "improved_sanitation.vip_latrine",
##  "improved_sanitation" : "improved_sanitation.pit_latrine_with_slab",
##  "improved_sanitation" : "improved_sanitation.flush",
##  "male_to_female_student_ratio" : "num_students_male",
##  "male_to_female_student_ratio" : "num_students_female",
##  "pupil_toilet_ratio_facility" : "num_students_total",
##  "pupil_toilet_ratio_facility" : "num_toilets_total",
##  "power_access" : "power_sources.generator",
##  "power_access" : "power_sources.solar_system",
##  "power_access" : "power_sources.grid",
##  "pupil_classrm_ratio" : "num_students_total",
##  "pupil_classrm_ratio" : "num_classrms_total",
##  "pupil_tchr_ratio" : "num_students_total",
##  "pupil_tchr_ratio" : "num_tchrs_total"
## }

Evaluation

Evaluation isn't super tricky either. Here it is, being performed on a real dataset:

e <- readRDS("~/Code/mop_up/data/in_process_data/education_mopup_outliercleaned.rds")
e2 <- within(e, eval(line_by_line_indicators))

## Some examples to show you the output was right:
sample_columns <- unique(c(na.omit(dependencies)[1:4], names(dependencies)[1:5]))
e2[1:10, sample_columns]
##      facility_type is_primary is_junior_secondary   pj   src
## 1  junior_sec_only      FALSE                TRUE TRUE mopup
## 2     primary_only       TRUE               FALSE TRUE mopup
## 3             <NA>         NA                  NA   NA mopup
## 4     primary_only       TRUE               FALSE TRUE mopup
## 5     primary_only       TRUE               FALSE TRUE mopup
## 6     primary_only       TRUE               FALSE TRUE mopup
## 7  junior_sec_only      FALSE                TRUE TRUE mopup
## 8     primary_only       TRUE               FALSE TRUE mopup
## 9     primary_only       TRUE               FALSE TRUE mopup
## 10    primary_only       TRUE               FALSE TRUE mopup