We'll be checking whether the top 100 users of formhub follow a power law distribution, or a log-normal distribution, using guidance from a blog post by CMU stats prof. Cosma Rohilla Shaliz.

First, load up the dataset; I have included it here in case you want to re-produce it. Each value is a number of submissions on formhub for top 100 users.

source("~/Downloads/pli-R-v0.0.3-2007-07-25/pareto.R")
source("~/Downloads/pli-R-v0.0.3-2007-07-25/lnorm.R")
source("~/Downloads/pli-R-v0.0.3-2007-07-25/power-law-test.R")

users <- c(220346L, 31099L, 28568L, 16573L, 14862L, 7531L, 6510L, 6138L, 4609L, 
    4435L, 4279L, 3899L, 3825L, 3532L, 3069L, 2790L, 2770L, 2632L, 2622L, 2589L, 
    2393L, 2371L, 2343L, 2251L, 2239L, 2053L, 1981L, 1963L, 1910L, 1839L, 1680L, 
    1661L, 1579L, 1438L, 1416L, 1387L, 1271L, 1250L, 1238L, 1044L, 985L, 906L, 
    904L, 837L, 778L, 765L, 636L, 628L, 623L, 601L, 558L, 521L, 474L, 472L, 
    441L, 434L, 433L, 431L, 428L, 398L, 395L, 381L, 380L, 364L, 343L, 320L, 
    301L, 300L, 292L, 284L, 275L, 274L, 269L, 268L, 265L, 256L, 246L, 228L, 
    219L, 213L, 207L, 205L, 201L, 198L, 184L, 182L, 175L, 173L, 168L, 164L, 
    156L, 148L, 147L, 146L, 142L, 136L, 134L, 130L, 130L, 127L)

Then, make a quick plot… is it roughly linear?

plot.eucdf.loglog(users, xlab = "Submissions", ylab = "Cumulative probability")

plot of chunk unnamed-chunk-2

Close enough. Lets throw the pareto and lognormal tests at the dataset, and plot the predicted fit on top of the ecdf.

users.pareto <- pareto.fit(users, threshold = 3)
users.lnorm <- lnorm.fit(users)

plot.eucdf.loglog(users, xlab = "Submissions", ylab = "Cumulative probability")
curve(ppareto(x, exponent = users.pareto$exponent, threshold = 3, lower.tail = FALSE), 
    add = TRUE, col = "red")
curve(plnorm(x, meanlog = users.lnorm$meanlog, sdlog = users.lnorm$sdlog, lower.tail = FALSE)/plnorm(3, 
    meanlog = users.lnorm$meanlog, sdlog = users.lnorm$sdlog, lower.tail = FALSE), 
    add = TRUE, col = "blue")

plot of chunk unnamed-chunk-3

Looks like a “text-book” log-normal fit to me. One more observation: Looks like five of the top users are “outliers” in this data, which makes sense if you know the users.

For a little more insight, I print below a typical “scenario” according to the model: the # of submissions from 100 typical (“random”) top-100 formhub users would look like the graph below:

qplot(rlnorm(100, meanlog = users.lnorm$meanlog, users.lnorm$sdlog)) + xlab(paste("# of submissions for modeled top-100 formhub users.\nLognormal with meanlog ", 
    users.lnorm$meanlog, " meansd ", users.lnorm$sdlog))

plot of chunk unnamed-chunk-4