Dataset: Yahoo! Answers Topic Classification

About this Dataset

Yahoo! Answers Topic Classification

The Yahoo! Answers topic classification dataset is constructed using the 10 largest main categories. Each class contains 140,000 training samples and 6,000 testing samples. Therefore, the total number of training samples is 1,400,000, and testing samples are 60,000 in this dataset. From all the answers and other meta-information, we only used the best answer content and the main category information.

Society & Culture
Science & Mathematics
Health
Education & Reference
Computers & Internet
Sports
Business & Finance
Entertainment & Music
Family & Relationships
Politics & Government

The Yahoo! Answers topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the above dataset. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015)

Tables

Test

@kaggle.bhavikardeshna_yahoo_email_classification.test

20.31 MB
59999 rows
4 columns


CREATE TABLE test (
  "n_9" BIGINT,
  "what_makes_friendship_click" VARCHAR,
  "how_does_the_spark_keep_going" VARCHAR,
  "good_communication_is_what_does_it_can_you_move_beyond_71f3920b" VARCHAR
);

Train

@kaggle.bhavikardeshna_yahoo_email_classification.train

472.67 MB
1399999 rows
4 columns


CREATE TABLE train (
  "n_5" BIGINT,
  "why_doesn_t_an_optical_mouse_work_on_a_glass_table" VARCHAR,
  "or_even_on_some_surfaces" VARCHAR,
  "optical_mice_use_an_led_and_a_camera_to_rapidly_captur_76243c37" VARCHAR
);