Open positions
Open research positions in SNAP group are available at undergraduate, graduate and postdoctoral levels.

Web data: Amazon reviews

Dataset information

This dataset consists of reviews from amazon. The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review. Note: this dataset contains potential duplicates, due to products whose reviews Amazon merges. A file has been added below (possible_dupes.txt.gz) to help identify products that are potentially duplicates of each other.

Note: A new-and-improved Amazon dataset is available here, which corrects the above duplication issues, and also contains more complete data/metadata.


Dataset statistics
Number of reviews 34,686,770
Number of users 6,643,669
Number of products 2,441,053
Users with > 50 reviews 56,772
Median no. of words per review 82
Timespan Jun 1995 - Mar 2013

Source (citation)


Files

File Description Size
all.txt.gz All product reviews (34,686,770 reviews)11G
possible_dupes.txt.gz List of possible duplicate products226M
Amazon_Instant_Video.txt.gz Amazon Instant Video reviews (717,651 reviews)252M
Arts.txt.gz Arts product reviews (27,980 reviews)5.3M
Automotive.txt.gz Automotive product reviews (188,728 reviews)36M
Baby.txt.gz Baby product reviews (184,887 reviews)42M
Beauty.txt.gz Beauty product reviews (252,056 reviews)46M
Books.txt.gz Book reviews (12,886,488 reviews)4.4G
Cell_Phones_&_Accessories.txt.gz Cell Phone reviews (78,930 reviews)20M
Clothing_&_Accessories.txt.gz Clothing reviews (581,933 reviews)78M
Electronics.txt.gz Electronics product reviews (1,241,778 reviews)325M
Gourmet_Foods.txt.gz Gourmet Food reviews (154,635 reviews)30M
Health.txt.gz Health product reviews (428,781 reviews)87M
Home_&_Kitchen.txt.gz Home & Kitchen product reviews (991,794 reviews)210M
Industrial_&_Scientific.txt.gz Industrial & Scientific product reviews (137,042 reviews)13M
Jewelry.txt.gz Jewelry reviews (58,621 reviews)7.8M
Kindle_Store.txt.gz Kindle Store reviews (160,793 reviews)59M
Movies_&_TV.txt.gz Movie & TV reviews (7,850,072 reviews)2.8G
Musical_Instruments.txt.gz Musical Instrument reviews (85,405 reviews)20M
Music.txt.gz Music reviews (6,396,350 reviews)2.1G
Office_Products.txt.gz Office product reviews (138,084 reviews)30M
Patio.txt.gz Patio product reviews (206,250 reviews)45M
Pet_Supplies.txt.gz Pet Supply reviews (217,170 reviews)47M
Shoes.txt.gz Shoe reviews (389,877 reviews)51M
Software.txt.gz Software reviews (95,084 reviews)30M
Sports_&_Outdoors.txt.gz Sports & Outdoor product reviews (510,991 reviews)100M
Tools_&_Home_Improvement.txt.gz Tools & Home Improvement product reviews (409,499 reviews)90M
Toys_&_Games.txt.gz Toy & Game reviews (435,996 reviews)89M
Video_Games.txt.gz Video Game reviews (463,669 reviews)152M
Watches.txt.gz Watch reviews (68,356 reviews)15M
descriptions.txt.gz Dscriptions of all products (where available)740M
categories.txt.gz Category information for all products45M
titles.txt.gz Titles for all products61M
related.txt.gz Related products ("users who purchased this also purchased")34M
brands.txt.gz Product brand info539K

Data format

product/productId: B00006HAXW product/title: Rock Rhythm & Doo Wop: Greatest Early Rock product/price: unknown review/userId: A1RSDE90N6RSZF review/profileName: Joseph M. Kotow review/helpfulness: 9/9 review/score: 5.0 review/time: 1042502400 review/summary: Pittsburgh - Home of the OLDIES review/text: I have all of the doo wop DVD's and this one is as good or better than the 1st ones. Remember once these performers are gone, we'll never get to see them again. Rhino did an excellent job and if you like or love doo wop and Rock n Roll you'll LOVE this DVD !!

where

How to parse (in Python)

import gzip import simplejson def parse(filename): f = gzip.open(filename, 'r') entry = {} for l in f: l = l.strip() colonPos = l.find(':') if colonPos == -1: yield entry entry = {} continue eName = l[:colonPos] rest = l[colonPos+2:] entry[eName] = rest yield entry for e in parse("all.txt.gz"): print simplejson.dumps(e)