Baselight

Newsgroups (Text Classification)

Comprehensive Collection of Text Classification Datasets

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati

Loading...
Loading...

About this Dataset

Newsgroups (Text Classification)

Newsgroups (Text Classification)

Comprehensive Collection of Text Classification Datasets


Source

Huggingface Hub: link

About this dataset

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

does not include cross-posts and includes only the "From" and "Subject" headers.

Research Ideas

  • Text classification
  • Text clustering
  • Sentiment analysis

Acknowledgements

License

> License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
> No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: bydate_sci.electronics_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 18828_sci.med_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 19997_sci.med_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_comp.sys.ibm.pc.hardware_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_talk.politics.guns_test.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_comp.windows.x_test.csv

Column name Description
text The text of the newsgroup document. (string)

File: 18828_comp.graphics_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_rec.sport.hockey_test.csv

Column name Description
text The text of the newsgroup document. (string)

File: 18828_rec.autos_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_comp.graphics_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 19997_rec.motorcycles_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 18828_comp.windows.x_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 19997_alt.atheism_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_rec.sport.baseball_test.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_comp.sys.mac.hardware_test.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_soc.religion.christian_test.csv

Column name Description
text The text of the newsgroup document. (string)

File: 19997_comp.sys.mac.hardware_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_rec.motorcycles_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_sci.space_test.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_talk.politics.misc_test.csv

Column name Description
text The text of the newsgroup document. (string)

File: 18828_comp.os.ms-windows.misc_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_soc.religion.christian_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 19997_comp.sys.ibm.pc.hardware_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 19997_misc.forsale_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 19997_talk.politics.mideast_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 19997_sci.space_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_rec.motorcycles_test.csv

Column name Description
text The text of the newsgroup document. (string)

File: 19997_talk.politics.misc_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 18828_talk.politics.mideast_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 18828_talk.politics.guns_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 18828_sci.electronics_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 18828_talk.religion.misc_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 18828_comp.sys.ibm.pc.hardware_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 18828_alt.atheism_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_talk.politics.mideast_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 19997_soc.religion.christian_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_comp.sys.mac.hardware_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 19997_sci.crypt_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_sci.crypt_test.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_misc.forsale_test.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_sci.electronics_test.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_rec.sport.hockey_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 18828_talk.politics.misc_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_comp.os.ms-windows.misc_test.csv

Column name Description
text The text of the newsgroup document. (string)

File: 18828_rec.sport.hockey_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_talk.religion.misc_test.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_sci.crypt_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 19997_sci.electronics_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_rec.autos_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 18828_rec.sport.baseball_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 18828_sci.crypt_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_talk.religion.misc_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_misc.forsale_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_alt.atheism_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 19997_rec.sport.baseball_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_sci.med_test.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_rec.sport.baseball_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 18828_comp.sys.mac.hardware_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_rec.autos_test.csv

Column name Description
text The text of the newsgroup document. (string)

File: 19997_comp.os.ms-windows.misc_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_comp.os.ms-windows.misc_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 18828_misc.forsale_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_talk.politics.misc_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 18828_soc.religion.christian_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 19997_comp.windows.x_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_comp.graphics_test.csv

Column name Description
text The text of the newsgroup document. (string)

File: 19997_rec.sport.hockey_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 19997_talk.religion.misc_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_comp.windows.x_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 18828_sci.space_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_talk.politics.mideast_test.csv

Column name Description
text The text of the newsgroup document. (string)

File: 19997_rec.autos_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 19997_comp.graphics_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 19997_talk.politics.guns_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_comp.sys.ibm.pc.hardware_test.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_alt.atheism_test.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_talk.politics.guns_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: 18828_rec.motorcycles_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_sci.space_train.csv

Column name Description
text The text of the newsgroup document. (string)

File: bydate_sci.med_train.csv

Column name Description
text The text of the newsgroup document. (string)

Tables

Bydate Alt Atheism Test

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati.bydate_alt_atheism_test
  • 411.22 KB
  • 319 rows
  • 3 columns
Loading...

CREATE TABLE bydate_alt_atheism_test (
  "unnamed_0" VARCHAR,
  "ex" VARCHAR,
  "unnamed_2" VARCHAR
);

Bydate Alt Atheism Train

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati.bydate_alt_atheism_train
  • 605.85 KB
  • 480 rows
  • 3 columns
Loading...

CREATE TABLE bydate_alt_atheism_train (
  "unnamed_0" VARCHAR,
  "ex" VARCHAR,
  "unnamed_2" VARCHAR
);

Bydate Comp Graphics Test

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati.bydate_comp_graphics_test
  • 503.4 KB
  • 389 rows
  • 3 columns
Loading...

CREATE TABLE bydate_comp_graphics_test (
  "unnamed_0" VARCHAR,
  "ex" VARCHAR,
  "unnamed_2" VARCHAR
);

Bydate Comp Graphics Train

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati.bydate_comp_graphics_train
  • 525.7 KB
  • 584 rows
  • 3 columns
Loading...

CREATE TABLE bydate_comp_graphics_train (
  "unnamed_0" VARCHAR,
  "ex" VARCHAR,
  "unnamed_2" VARCHAR
);

Bydate Comp Os Ms Windows Misc Train

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati.bydate_comp_os_ms_windows_misc_train
  • 1 MB
  • 591 rows
  • 3 columns
Loading...

CREATE TABLE bydate_comp_os_ms_windows_misc_train (
  "unnamed_0" VARCHAR,
  "ex" VARCHAR,
  "unnamed_2" VARCHAR
);

Bydate Comp Sys Ibm Pc Hardware Test

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati.bydate_comp_sys_ibm_pc_hardware_test
  • 284.66 KB
  • 392 rows
  • 3 columns
Loading...

CREATE TABLE bydate_comp_sys_ibm_pc_hardware_test (
  "unnamed_0" VARCHAR,
  "ex" VARCHAR,
  "unnamed_2" VARCHAR
);

Bydate Comp Sys Ibm Pc Hardware Train

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati.bydate_comp_sys_ibm_pc_hardware_train
  • 460.92 KB
  • 590 rows
  • 3 columns
Loading...

CREATE TABLE bydate_comp_sys_ibm_pc_hardware_train (
  "unnamed_0" VARCHAR,
  "ex" VARCHAR,
  "unnamed_2" VARCHAR
);

Bydate Comp Sys Mac Hardware Test

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati.bydate_comp_sys_mac_hardware_test
  • 273.57 KB
  • 385 rows
  • 3 columns
Loading...

CREATE TABLE bydate_comp_sys_mac_hardware_test (
  "unnamed_0" VARCHAR,
  "ex" VARCHAR,
  "unnamed_2" VARCHAR
);

Bydate Comp Sys Mac Hardware Train

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati.bydate_comp_sys_mac_hardware_train
  • 407.78 KB
  • 578 rows
  • 3 columns
Loading...

CREATE TABLE bydate_comp_sys_mac_hardware_train (
  "unnamed_0" VARCHAR,
  "ex" VARCHAR,
  "unnamed_2" VARCHAR
);

Bydate Comp Windows X Test

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati.bydate_comp_windows_x_test
  • 485.13 KB
  • 395 rows
  • 3 columns
Loading...

CREATE TABLE bydate_comp_windows_x_test (
  "unnamed_0" VARCHAR,
  "ex" VARCHAR,
  "unnamed_2" VARCHAR
);

Bydate Comp Windows X Train

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati.bydate_comp_windows_x_train
  • 678.53 KB
  • 593 rows
  • 3 columns
Loading...

CREATE TABLE bydate_comp_windows_x_train (
  "unnamed_0" VARCHAR,
  "ex" VARCHAR,
  "unnamed_2" VARCHAR
);

Bydate Misc Forsale Test

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati.bydate_misc_forsale_test
  • 237.33 KB
  • 390 rows
  • 3 columns
Loading...

CREATE TABLE bydate_misc_forsale_test (
  "unnamed_0" VARCHAR,
  "ex" VARCHAR,
  "unnamed_2" VARCHAR
);

Bydate Misc Forsale Train

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati.bydate_misc_forsale_train
  • 364.14 KB
  • 585 rows
  • 3 columns
Loading...

CREATE TABLE bydate_misc_forsale_train (
  "unnamed_0" VARCHAR,
  "ex" VARCHAR,
  "unnamed_2" VARCHAR
);

Bydate Rec Autos Test

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati.bydate_rec_autos_test
  • 319.93 KB
  • 396 rows
  • 3 columns
Loading...

CREATE TABLE bydate_rec_autos_test (
  "unnamed_0" VARCHAR,
  "ex" VARCHAR,
  "unnamed_2" VARCHAR
);

Bydate Rec Autos Train

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati.bydate_rec_autos_train
  • 502.06 KB
  • 594 rows
  • 3 columns
Loading...

CREATE TABLE bydate_rec_autos_train (
  "unnamed_0" VARCHAR,
  "ex" VARCHAR,
  "unnamed_2" VARCHAR
);

Bydate Rec Motorcycles Test

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati.bydate_rec_motorcycles_test
  • 290.22 KB
  • 398 rows
  • 3 columns
Loading...

CREATE TABLE bydate_rec_motorcycles_test (
  "unnamed_0" VARCHAR,
  "ex" VARCHAR,
  "unnamed_2" VARCHAR
);

Bydate Rec Motorcycles Train

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati.bydate_rec_motorcycles_train
  • 475.84 KB
  • 598 rows
  • 3 columns
Loading...

CREATE TABLE bydate_rec_motorcycles_train (
  "unnamed_0" VARCHAR,
  "ex" VARCHAR,
  "unnamed_2" VARCHAR
);

Bydate Rec Sport Baseball Test

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati.bydate_rec_sport_baseball_test
  • 361.62 KB
  • 397 rows
  • 3 columns
Loading...

CREATE TABLE bydate_rec_sport_baseball_test (
  "unnamed_0" VARCHAR,
  "ex" VARCHAR,
  "unnamed_2" VARCHAR
);

Bydate Rec Sport Baseball Train

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati.bydate_rec_sport_baseball_train
  • 496.33 KB
  • 597 rows
  • 3 columns
Loading...

CREATE TABLE bydate_rec_sport_baseball_train (
  "unnamed_0" VARCHAR,
  "ex" VARCHAR,
  "unnamed_2" VARCHAR
);

Bydate Rec Sport Hockey Test

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati.bydate_rec_sport_hockey_test
  • 384.47 KB
  • 399 rows
  • 3 columns
Loading...

CREATE TABLE bydate_rec_sport_hockey_test (
  "unnamed_0" VARCHAR,
  "ex" VARCHAR,
  "unnamed_2" VARCHAR
);

Bydate Rec Sport Hockey Train

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati.bydate_rec_sport_hockey_train
  • 663.12 KB
  • 600 rows
  • 3 columns
Loading...

CREATE TABLE bydate_rec_sport_hockey_train (
  "unnamed_0" VARCHAR,
  "ex" VARCHAR,
  "unnamed_2" VARCHAR
);

Bydate Sci Crypt Test

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati.bydate_sci_crypt_test
  • 386.5 KB
  • 396 rows
  • 3 columns
Loading...

CREATE TABLE bydate_sci_crypt_test (
  "unnamed_0" VARCHAR,
  "ex" VARCHAR,
  "unnamed_2" VARCHAR
);

Bydate Sci Crypt Train

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati.bydate_sci_crypt_train
  • 841.2 KB
  • 595 rows
  • 3 columns
Loading...

CREATE TABLE bydate_sci_crypt_train (
  "unnamed_0" VARCHAR,
  "ex" VARCHAR,
  "unnamed_2" VARCHAR
);

Bydate Sci Electronics Test

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati.bydate_sci_electronics_test
  • 297.14 KB
  • 393 rows
  • 3 columns
Loading...

CREATE TABLE bydate_sci_electronics_test (
  "unnamed_0" VARCHAR,
  "ex" VARCHAR,
  "unnamed_2" VARCHAR
);

Bydate Sci Electronics Train

@kaggle.thedevastator_the_20_newsgroups_data_set_for_text_classificati.bydate_sci_electronics_train
  • 476.17 KB
  • 591 rows
  • 3 columns
Loading...

CREATE TABLE bydate_sci_electronics_train (
  "unnamed_0" VARCHAR,
  "ex" VARCHAR,
  "unnamed_2" VARCHAR
);

Share link

Anyone who has the link will be able to view this.