This version of the dataset is obsolete. It contains duplicate ratings (same user_id,book_id), as reported by Philipp Spachtholz in his illustrious notebook.
The current version has duplicates removed, and more ratings (six million), sorted by time. Book and user IDs are the same.
**It is available at https://github.com/zygmuntz/goodbooks-10k. **
There have been good datasets for movies (Netflix, Movielens) and music (Million Songs) recommendation, but not for books. That is, until now.
This dataset contains ratings for ten thousand popular books. As to the source, let's say that these ratings were found on the internet. Generally, there are 100 reviews for each book, although some have less - fewer - ratings. Ratings go from one to five.
Both book IDs and user IDs are contiguous. For books, they are 1-10000, for users, 1-53424. All users have made at least two ratings. Median number of ratings per user is 8.
There are also books marked to read by the users, book metadata (author, year, etc.) and tags.
Contents
ratings.csv contains ratings and looks like that:
book_id,user_id,rating
1,314,5
1,439,3
1,588,5
1,1169,4
1,1185,4
to_read.csv provides IDs of the books marked "to read" by each user, as user_id,book_id pairs.
books.csv has metadata for each book (goodreads IDs, authors, title, average rating, etc.).
The metadata have been extracted from goodreads XML files, available in the third version of this dataset as books_xml.tar.gz. The archive contains 10000 XML files. One of them is available as sample_book.xml. To make the download smaller, these files are absent from the current version. Download version 3 if you want them.
book_tags.csv contains tags/shelves/genres assigned by users to books. Tags in this file are represented by their IDs.
tags.csv translates tag IDs to names.
See the notebook for some basic stats of the dataset.
goodreads IDs
Each book may have many editions. goodreads_book_id and best_book_id generally point to the most popular edition of a given book, while goodreads work_id refers to the book in the abstract sense.
You can use the goodreads book and work IDs to create URLs as follows:
https://www.goodreads.com/book/show/2767052
https://www.goodreads.com/work/editions/2792775