Phrase graph and clusters:


Parts of the phrase graph:

Phrase clusters identified by our DAG partitioning algorithm:


Time lag of different media sites on reporting a story:



Number of documents, words and phrases over time:


Notice daily and weekly periodicities but the total amount of new articles, and words and phrases in those articles is about constant over time.


Proof that DAG partitioning is NP-hard:



Comparison to baseline techniques:


Here we compare our meme-tracking techniques to simple baseline approaches for topic tracking and information cascade identification.

  • Top 50 most frequent words and named entities (after heavy preprocessing, after stopword removal, and thresholding on document frequency).
  • Top 5 most frequent words and named entities (after heavy preprocessing, stopword removal, and thresholding on document frequency).
    Now few patterns can be observed: Palin and McCain diminish over time, Obama peaks on the election day of November 4.
  • Top 50 most in-linked documents. Most documents receive links for short periods of time.
  • 50 topics from LDA(Latent Dirichlet Allocation). Notice as the topics broadly resemble elections, social media and bogging, war in Iraq and so on, the changes in vocabulary are not significant and thus most clusters are quite stable over time. Since we had scalability issues we took top 10,000 most in-linked documents and applied aggressive stopword removal (minimum word frequency is 40). To find 100 topics LDA took 2 days to run.
  • Top 50 unclustered phrases: raw quoted phrases. No clustering. Notice how no phrase gains significant volume as its appearances are scattered around in many different variations.