Web and Blog datasets

Memetracker data

MemeTracker is an approach for extracting short textual phrases from web documents (news articles and blog posts) and then tracking how such prases spread over the Web and how they change and evolve as they spread.

MemeTracker data contains two datasets:

ICWSM 2009 Spinn3r data

A collection of raw blog posts and news media articles collected by Spinn3r and released as a part of International Conference on Weblogs and Social Media 2009.

Stanford WebBase web crawls

A collection of web crawls from Stanford InfoLab. The web crawls go almost 10 years back.