block by nitaku 8579a28a78ddd3391d6b

Word cloud intersection

Full Screen

This experiment tries to define and show the concepts of intersection and difference between two word clouds.

Two distinct documents A and B (the first about infovis and datavis, and the other about Human-Computer Interaction) are each used to compute word frequencies. Each word in common between the two documents is then used to create the intersection set, in which each word is assigned the minimum value among its original weights (something reminiscent of a fuzzy intersection operation). The remaining words are used to create the difference sets. For each word in common, the set which had the greatest weight also retains that word with a weight subtracted of the intersection weight mentioned above. In pseudocode:

for each word in common between A and B
  let a_w be the weight of the word in A
  let b_w be the weight of the word in B
  put the word into the intersection set with weight = min(a_w, b_w)
  if a_w - min(a_w, b_w) > 0
    put the word into the A \ B set with weight = a_w - min(a_w, b_w)
  else if b_w - min(a_w, b_w) > 0
    put the word into the B \ A set with weight = b_w - min(a_w, b_w)

for each remaining word in A
  put the word into the A \ B set with weight = a_w (original weight)

for each remaining word in B
  put the word into the B \ A set with weight = b_w (original weight)

It can be seen by the three resulting “clouds” that some words can be interpreted as more specific to the infovis-datavis world (e.g. data, visualization, graphic, arts), while others are more used in HCI (e.g. user, computer, system). There’s also a fair amount of intersection between the two (e.g. information, design, research, field). Please note that the size of the intersection set is heavily influenced by the choice of removing stopwords.

While not completely satisfactory in its meaningfulness (as it’s often the case with word clouds), this experiment could lead to an interesting research path. Many different formulas can be used to define a concept of intersection (something that normalizes on document length, or even something inspired by TF-IDF could be interesting), and many choices are available for representing the results (a round, Venn-like layout could be nice).

This example employed a treemap word cloud layout, the useful nlp_compromise javascript library, and a precomputed list of English stopwords.

index.js

index.html

english_stopwords.txt

english_stopwords_long.txt

index.coffee

index.css