I had been meaning to look into R's metaprogramming features for a while now. I finally had a chance today (thanks Hadley!), and I used it to experiment towards a problem that had been in the back of my mind for a while: finding depencies within indicator definitions.
Below, I implement a find dependencies function, which takes a set of indicators, and finds dependencies within it. Indicators are fields within a dataset, some of which are already there, and some of which are newly created. The dependency finding problem is investigating which new indicators derive from which existing ones. We think of these relationships as dependencies: for an indicator such as pupil-to-teacher-ratio (defined as the number-of-pupils divided by the number-of-teachers), pupil-to-teacher-ratio is dependent on number-of-pupils and number-of-teachers. Below, we'll see the dependency lists for a whole set of indicators.
This is useful in all sorts of contexts, from missing data analysis to many optimizations and streaming tasks. Creating a way to write indicators that allows for dependency analysis is usually tricky; you often end up introducing a lot of overhead for dependency analysis that detracts from the domain-specific indicator definitions. R's metaprogramming features show a pretty neat approach, at least at an experimental level. Note that most of the knowledge needed here is thanks to Hadley Wickam's amazing Advanced R programming book, particularly the chapters on environments, non-standard evaluation, and metaprogramming.
First, lets define the indicators. You can skim these, perhaps noticing that we aren't writing much more than the pure definitions of each indicator. I left the infinite_as_NA
function in, but its possible that could be pulled out of here into another layer.
# Quick function for converting infinite values to NA
infinite_as_NA <- function(x) {
ifelse(is.infinite(x), NA, x)
}
require(stringr) # for string detect
## Loading required package: stringr
# Example indicators (these are from a real example)
line_by_line_indicators = quote({
is_primary = str_detect(facility_type, "primary")
is_junior_secondary = str_detect(facility_type, "junior")
pj = is_primary | is_junior_secondary
src = "mopup"
date_of_survey = as.character(as.Date(start))
# INFRASTRUCTURE
improved_water_supply = improved_water_supply.tap | improved_water_supply.protected_well |
improved_water_supply.rainwater | improved_water_supply.handpump
improved_water_supply = improved_water_supply.tap | improved_water_supply.protected_well |
improved_water_supply.rainwater | improved_water_supply.handpump
improved_sanitation = improved_sanitation.vip_latrine | improved_sanitation.pit_latrine_with_slab |
improved_sanitation.flush
# PARTICIPATION
male_to_female_student_ratio = num_students_male/num_students_female
# INFRASTRUCTURE:WATER&SAN
pupil_toilet_ratio_facility = infinite_as_NA(num_students_total/num_toilets_total)
# INFRASTRUCTURE:BUILDING STRUCTURE
power_access = (power_sources.generator | power_sources.solar_system | power_sources.grid)
# INFRASTRUCTURE:LEARNING ENVIRONMENT
pupil_classrm_ratio = num_students_total/num_classrms_total
# ADEQUACY OF STAFFING
pupil_tchr_ratio = num_students_total/num_tchrs_total
})
Here is the function to find dependencies (note: built off of one day's learning about meta-programming, if you notice a bug, do point it out).
## Find 'dependencies' in lhs = rhs type equations, where all names in rhs
## are dependences of lhs. Outputs in a named vector, where names are from
## lhs, and the values from rhs. Example, pupil_teacher_ratio = num_pupils /
## num_tchrs will produce (in json-represented named vector):
## {'pupil_teacher_ratio': 'num_pupils', 'pupil_teacher_ratio' : 'num_tchrs'}
find_deps <- function(x) {
if (is.atomic(x)) {
# these are evaluated values; we don't want these
character()
} else if (is.name(x)) {
# these are our unevaluated values; this is what we want
as.character(x)
} else if (is.call(x)) {
if (identical(x[[1]], quote(`=`))) {
# x looks like = lhs rhs; we recurse first on rhs (x[-1][-1])
rhs <- unlist(lapply(x[-1][-1], find_deps))
# and take the second element (x[[2]]) as lhs
lhs <- as.character(x[[2]])
if (length(rhs) == 0)
setNames(NA, lhs) # no dependencies
else setNames(rhs, rep(lhs, length(rhs)))
} else {
# for other function calls, we recurse only on everything but function name
unlist(lapply(x[-1], find_deps))
}
}
}
And finally, here is the magic. A dictionary of dependencies. The dependent indicator is listed as the key, once per indicator that it depends on (which is the value). Notice src
, which doesn't have any dependencies.
dependencies <- find_deps(line_by_line_indicators)
cat(RJSONIO::toJSON(dependencies, pretty = TRUE))
## {
## "is_primary" : "facility_type",
## "is_junior_secondary" : "facility_type",
## "pj" : "is_primary",
## "pj" : "is_junior_secondary",
## "src" : null,
## "date_of_survey" : "start",
## "improved_water_supply" : "improved_water_supply.tap",
## "improved_water_supply" : "improved_water_supply.protected_well",
## "improved_water_supply" : "improved_water_supply.rainwater",
## "improved_water_supply" : "improved_water_supply.handpump",
## "improved_water_supply" : "improved_water_supply.tap",
## "improved_water_supply" : "improved_water_supply.protected_well",
## "improved_water_supply" : "improved_water_supply.rainwater",
## "improved_water_supply" : "improved_water_supply.handpump",
## "improved_sanitation" : "improved_sanitation.vip_latrine",
## "improved_sanitation" : "improved_sanitation.pit_latrine_with_slab",
## "improved_sanitation" : "improved_sanitation.flush",
## "male_to_female_student_ratio" : "num_students_male",
## "male_to_female_student_ratio" : "num_students_female",
## "pupil_toilet_ratio_facility" : "num_students_total",
## "pupil_toilet_ratio_facility" : "num_toilets_total",
## "power_access" : "power_sources.generator",
## "power_access" : "power_sources.solar_system",
## "power_access" : "power_sources.grid",
## "pupil_classrm_ratio" : "num_students_total",
## "pupil_classrm_ratio" : "num_classrms_total",
## "pupil_tchr_ratio" : "num_students_total",
## "pupil_tchr_ratio" : "num_tchrs_total"
## }
Evaluation isn't super tricky either. Here it is, being performed on a real dataset:
e <- readRDS("~/Code/mop_up/data/in_process_data/education_mopup_outliercleaned.rds")
e2 <- within(e, eval(line_by_line_indicators))
## Some examples to show you the output was right:
sample_columns <- unique(c(na.omit(dependencies)[1:4], names(dependencies)[1:5]))
e2[1:10, sample_columns]
## facility_type is_primary is_junior_secondary pj src
## 1 junior_sec_only FALSE TRUE TRUE mopup
## 2 primary_only TRUE FALSE TRUE mopup
## 3 <NA> NA NA NA mopup
## 4 primary_only TRUE FALSE TRUE mopup
## 5 primary_only TRUE FALSE TRUE mopup
## 6 primary_only TRUE FALSE TRUE mopup
## 7 junior_sec_only FALSE TRUE TRUE mopup
## 8 primary_only TRUE FALSE TRUE mopup
## 9 primary_only TRUE FALSE TRUE mopup
## 10 primary_only TRUE FALSE TRUE mopup