The data contains the complete edit history (all revisions, all pages) of all Wikipedia since its inception till January 2008.
There are two parts to the dataset:
File | Description |
---|---|
enwiki-20080103-pages-meta-history.xml.7z | Complete Wikipedia edit history (18GB!) |
Note that the file decompresses to several (>3) Terabytes of text. Use 7zip to decompress the data on the fly.
See All revisions of Wikipedia and Latest complete dump for more information about different dumps of the Wikipedia dataset.
The data set contains processed metadata for all revisions of all articles extracted from the full Wikipedia XML dump as of 2008-01-03.
For each specified namespace, there is a bzipped file with pre-processed data and also a file with all redirects. The output data is in the tagged multi-line format (14 lines per revision, space-delimited). Each revision record contains the following lines:
For example:
Anonymous editors are listed by their ip address, e.g. ip:69.17.21.242.
The list of admins with simplified dates of adminship (disregarding demotions and reappointments of the same user) can be found at http://en.wikipedia.org/wiki/User:NoSeptember/List_of_Administrators and http://en.wikipedia.org/wiki/Wikipedia:Former_administrators
Bots can often (but neither necessarily nor exclusively) be identified by the string "bot" in the username. You can create a list of bots by using the bot status page at http://en.wikipedia.org/wiki/Wikipedia:Bots/Status
Sometimes Wikipedia editors change their user names, which may lead to misattribution of edits (it does not seem that name changes are retroactively applied to the previously generated content). This issue may be especially important for prolific contributors. To handle name changes properly, you want to use the logs at http://en.wikipedia.org/wiki/Special:Log/renameuser and/or http://en.wikipedia.org/wiki/Wikipedia:Changing_username
Data and the description was prepared by Gueorgi Kossinets.
File | Description |
---|---|
enwiki-20080103.main.bz2 | Revisions in the main namespace (the Wikipedia articles) (8GB!) |
enwiki-20080103.talk.bz2 | Talk namespace -- edits of discussion pages attached to each Wikipedia article) (<1GB) |
enwiki-20080103.user.bz2 | Revisions of user personal pages (<1GB) |
enwiki-20080103.user_talk.bz2 | Revisions of user talk pages (<1GB) |
enwiki-20080103.wikipedia.bz2 | Wikipedia Wiki namespace (administrative procedures and pages) (3GB) |
enwiki-20080103.wikipedia_talk.bz2 | Wikipedia Wiki namespace talk pages (<1GB) |
To examine a part of the data file, use bzcat and pipe its output to a combination of head, tail, grep, awk, sed, and so on. For example, the command
$ bzcat enwiki-20080103.talk.bz2 | head -n 1414 | tail -n 14will print lines 1401 through 1414 from the Talk namespace data file.
Similarly
$ 7z x -so enwiki-20080103-pages-meta-history.xml.7z | head -n 1414 | tail -n 14will print lines 1401 through 1414 from pages-meta-history file.