This is a web graph of Wikipedia hyperlinks collected in September 2011. The network was constructed by first taking the largest strongly connected component of Wikipedia, then restricting to pages in the top set of categories (those with at least 100 pages), and finally taking the largest strongly connected component of the restricted graph.
In addition to the graph, we also provide the page names of the articles and the categories of the articles. The categories can serve as "ground-truth" communities. The categories are overlapping as each article may be classified into several categories.
Dataset statistics | |
---|---|
Nodes | 1791489 |
Edges | 28511807 |
Nodes in largest WCC | 1791489 (1.000) |
Edges in largest WCC | 28511807 (1.000) |
Nodes in largest SCC | 1791489 (1.000) |
Edges in largest SCC | 28511807 (1.000) |
Average clustering coefficient | 0.2746 |
Number of triangles | 52106893 |
Fraction of closed triangles | 0.00165 |
Diameter (longest shortest path) | 9 |
90-percentile effective diameter | 3.8 |
File | Description |
---|---|
wiki-topcats.txt.gz | Hyperlink network of Wikipedia |
wiki-topcats-categories.txt.gz | Which articles are in which of the top categories |
wiki-topcats-page-names.txt.gz | Names of the articles |