We'll be checking whether the top 100 users of formhub follow a power law distribution, or a log-normal distribution, using guidance from a blog post by CMU stats prof. Cosma Rohilla Shaliz.
First, load up the dataset; I have included it here in case you want to re-produce it. Each value is a number of submissions on formhub for top 100 users.
source("~/Downloads/pli-R-v0.0.3-2007-07-25/pareto.R")
source("~/Downloads/pli-R-v0.0.3-2007-07-25/lnorm.R")
source("~/Downloads/pli-R-v0.0.3-2007-07-25/power-law-test.R")
users <- c(220346L, 31099L, 28568L, 16573L, 14862L, 7531L, 6510L, 6138L, 4609L,
4435L, 4279L, 3899L, 3825L, 3532L, 3069L, 2790L, 2770L, 2632L, 2622L, 2589L,
2393L, 2371L, 2343L, 2251L, 2239L, 2053L, 1981L, 1963L, 1910L, 1839L, 1680L,
1661L, 1579L, 1438L, 1416L, 1387L, 1271L, 1250L, 1238L, 1044L, 985L, 906L,
904L, 837L, 778L, 765L, 636L, 628L, 623L, 601L, 558L, 521L, 474L, 472L,
441L, 434L, 433L, 431L, 428L, 398L, 395L, 381L, 380L, 364L, 343L, 320L,
301L, 300L, 292L, 284L, 275L, 274L, 269L, 268L, 265L, 256L, 246L, 228L,
219L, 213L, 207L, 205L, 201L, 198L, 184L, 182L, 175L, 173L, 168L, 164L,
156L, 148L, 147L, 146L, 142L, 136L, 134L, 130L, 130L, 127L)
Then, make a quick plot… is it roughly linear?
plot.eucdf.loglog(users, xlab = "Submissions", ylab = "Cumulative probability")
Close enough. Lets throw the pareto and lognormal tests at the dataset, and plot the predicted fit on top of the ecdf.
users.pareto <- pareto.fit(users, threshold = 3)
users.lnorm <- lnorm.fit(users)
plot.eucdf.loglog(users, xlab = "Submissions", ylab = "Cumulative probability")
curve(ppareto(x, exponent = users.pareto$exponent, threshold = 3, lower.tail = FALSE),
add = TRUE, col = "red")
curve(plnorm(x, meanlog = users.lnorm$meanlog, sdlog = users.lnorm$sdlog, lower.tail = FALSE)/plnorm(3,
meanlog = users.lnorm$meanlog, sdlog = users.lnorm$sdlog, lower.tail = FALSE),
add = TRUE, col = "blue")
Looks like a “text-book” log-normal fit to me. One more observation: Looks like five of the top users are “outliers” in this data, which makes sense if you know the users.
For a little more insight, I print below a typical “scenario” according to the model: the # of submissions from 100 typical (“random”) top-100 formhub users would look like the graph below:
qplot(rlnorm(100, meanlog = users.lnorm$meanlog, users.lnorm$sdlog)) + xlab(paste("# of submissions for modeled top-100 formhub users.\nLognormal with meanlog ",
users.lnorm$meanlog, " meansd ", users.lnorm$sdlog))