block by timelyportfolio a6916e88484f962d8f95

R | Norvig's English Letter ngrams with sunburstR (d3.js sequence sunburst htmlwidget)

Full Screen

In an attempt to find something useful to plug into my new htmlwidget sunburstR (see post), I rediscovered this insightful article by Peter Norvig.

English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU

The content is great, but even better, he has published the ngram data in Google Fusion Tables. As a simple example, let’s look at 2 letter ngrams for the start of a word with sunburstR.

# devtools::install_github("timelyportfolio/sunburstR")

#  use sunburst to analyze ngram data from Peter Norvig
#    http://norvig.com/mayzner.html

library(sunburstR)
library(pipeR)

#  read the csv data downloaded from the Google Fusion Table linked in the article
ngrams2 <- read.csv("./inst/examples/ngrams2.csv", stringsAsFactors = FALSE)

ngrams2 %>>%
  #  let's look at ngrams at the start of a word, so columns 1 and 3
  (.[,c(1,3)]) %>>%
  #  split the ngrams into a sequence by splitting each letter and adding -
  (
    data.frame(
      sequence = strsplit(.[,1],"") %>>%
        lapply( function(ng){ paste0(ng,collapse = "-") } ) %>>%
        unlist
      ,freq = .[,2]
      ,stringsAsFactors = FALSE
    )
  ) %>>%
  sunburst %>>%
  htmlwidgets::saveWidget("example_ngrams.html")

live example

Thanks


If you want to look at the 2 letter ngrams for all letter positions, all it takes is this.
library(htmltools)

ngrams2 %>>%
  (
    lapply(
      seq.int(3,ncol(.))
      ,function(letpos){
        (.[,c(1,letpos)]) %>>%
          #  split the ngrams into a sequence by splitting each letter and adding -
          (
            data.frame(
              sequence = strsplit(.[,1],"") %>>%
                lapply( function(ng){ paste0(ng,collapse = "-") } ) %>>%
                unlist
              ,freq = .[,2]
              ,stringsAsFactors = FALSE
            )
          ) %>>%
          sunburst
      }
    )
  ) %>>%
  tagList %>>%
  browsable