Baselight

7k Books

Dataset of books with title, author, description, rating, thumbnail, and more

@kaggle.dylanjcastillo_7k_books_with_metadata

Loading...
Loading...

About this Dataset

7k Books

Do we really need another dataset of books?

My initial plan was to build a toy example for a recommender system article I was writing. After a bit of googling, I found a few datasets. Sadly, most of them had some issues that made them unusable for me (e.g, missing description of the book, a mix of different languages but no column to specify the language per row or weird delimiters).

So I decided to make a dataset that would match my purposes.

First, I got ISBNs from Soumik's Goodreads-books dataset. Using those identifiers, I crawled the Google Books API to extract the books' information.

Then, I merged those results with some of the original columns from the dataset and after some cleaning I got the dataset you see here.

What can I do with this?

Different Exploratory Data Analysis, clustering of books by topics/category, content-based recommendation engine using different fields from the book's description.

Why is this dataset smaller than Soumik's Goodreads-books?

Many of the ISBNs of that dataset did not return valid results from the Google Books API. I plan to update this in the future, using more fields (e.g., title, author) in the API requests, as to have a bigger dataset.

What did you use to build this dataset?

Check out the repoistory here Google Books Crawler

Acknowledgements

This dataset relied heavily on Soumik's Goodreads-books dataset.

Tables

Books

@kaggle.dylanjcastillo_7k_books_with_metadata.books
  • 2.24 MB
  • 6810 rows
  • 12 columns
Loading...

CREATE TABLE books (
  "isbn13" BIGINT,
  "isbn10" VARCHAR,
  "title" VARCHAR,
  "subtitle" VARCHAR,
  "authors" VARCHAR,
  "categories" VARCHAR,
  "thumbnail" VARCHAR,
  "description" VARCHAR,
  "published_year" DOUBLE,
  "average_rating" DOUBLE,
  "num_pages" DOUBLE,
  "ratings_count" DOUBLE
);

Share link

Anyone who has the link will be able to view this.