facf5240152509c55e7e by mimno

Burstiness experiment

These files implement an experiment based on Madsen et al., Modeling Word Burstiness Using the Dirichlet Distribution. They observe that the probability of seeing large numbers of rare words in a single document is much larger in real natural language text than we would expect if words were sampled i.i.d from a multinomial distribution. The multinomial model is good at predicting how many times a low-frequency word like “camelid” will appear in a corpus overall, but it assigns far too little probability to the event that all instances of “camelid” will occur in a single document.

We simulate this by counting the event that word w occurs N times in a document from a real corpus, and then again from a synthetic corpus drawn from the proportions of words in the corpus.

There are two versions, one that creates count files in python and then visualizes them in R, and another by Xanda Schofield and Adith Swaminathan that does everything in python using matplotlib.

Usage for python+R version:

python bursty.py stories.txt > real.frequencies
python bursty.py stories.txt resample > fake.frequencies
R -f bursty.R

Usage for python version:

python newbursty.py stories.txt

stories.txt is a collection of translations of Danish folktales, used with permission by the author.

facf5240152509c55e7e

Burstiness experiment

Rplots.pdf

bursty.R

bursty.py

newbursty.py