Reuters-21578 (Text Categorization) by Kaggle | Finance and Economics

About this Dataset

Reuters-21578 (Text Categorization)

Ruters financial newswire service in 1987

By Huggingface Hub [source]

About this dataset

The Reuters-21578 dataset, one of the most influential and widely used collections of newswire articles from the Reuters financial newswire service, is an essential benchmark for text categorization research. This extensive repository provides a range of valuable insight into topics frequently covered by financial publications and is available in multiple splits for optimal machine learning exploration.

Within this dataset, users will find columns with detailed information such as text (the full body of article text), text_type (classifying whether the article was part of the training or test set), topics (what topics are associated with the particular document), lewis_split (which split it belongs to) , cgis_split (split between train and test set given by core group iteration sampling method), places/people/orgs/exchanges mentioned within it, date and title. In addition to these classifications, there are separate files containing Reuters-21578 articles that were not used in specific splits (ModApte_unused.csv & ModLewis_unused.csv). By leveraging this dataset, you can unlock deep understanding into financial news categorization from an abundance of data points across categories - enabling you to build high performing models that provide better accuracy than ever before!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

The Reuters-21578 dataset is a great resource for uncovering valuable insights in financial news. With its wide range of topics and data splits, it is well-suited to be used as a benchmark dataset for text categorization research. Here are some tips and tricks on how to get the most out of this dataset:

Familiarize yourself with the columns: Before getting started, make sure to familiarize yourself with all of the columns included in the dataset. This includes understanding what each column means, as well as identifying which are essential for your research project.

Use an appropriate split: Depending on your research goals, you may need to use different training and test sets from those provided in this dataset (ModHayes_train/test or ModLewis_train/test). You can also create custom splits from the unique ‘ModApte_unused’ set contained within this collection if desired.

Explore other methods: While text categorization is often used with this type of data, you may also want to explore other methods that can help uncover useful information such as topic modelling or sentiment analysis.

Leverage related packages: If you’re using Python or R there are some great packages available specifically designed for working with textual data from Reuters-21578 such as sklearn’s reuters21578 module and klabutils’ reutersR package respectively . Both offer helpful features such as vectorizers that let you transform words into feature vectors when implementing ML models such as Naive Bayes or Random Forest classifiers .

5 Tackle low-level preprocessing tasks : Before getting started with building models using ML algorithms , remember that all input data will benefit greatly from being cleaned up first – particularly in terms of removing invalid characters along side any symbols associated with a language other than English; which could severely affect model accuracy! Additionally , performing minor tasks like stopword removal and stemming words into their root form prior to getting underway could help improve overall performance too!

Research Ideas

Automated text classification - Using the data from the Reuters-21578 dataset, machine learning algorithms can be trained to automatically classify and categorize newswire articles into their appropriate topics. This not only saves time, but also ensures reliable results with minimal human intervention.

Sentiment analysis - By analyzing the sentiment of individual news article in the Reuters-21578 dataset, one could gain valuable insight into how people generally perceive financial news and then use this information to make more informed investing decisions.

Stock market predictions - By applying data mining techniques on the content of news articles in this dataset, correlations between certain topics or exchanges mentioned in an article and their effects on stock prices can be identified and used for algorithmic trading strategies aimed at predicting short term stock price movements accurately

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: ModHayes_train.csv

Column name	Description
text	The text of the article. (String)
text_type	The type of text, either train or test. (String)
topics	The topics associated with the article. (String)
lewis_split	The Lewis split of the article. (String)
cgis_split	The CGIS split of the article. (String)
places	The places mentioned in the article. (String)
people	The people mentioned in the article. (String)
orgs	The organizations mentioned in the article. (String)
exchanges	The exchanges mentioned in the article. (String)
date	The date the article was published. (Date)
title	The title of the article. (String)

File: ModHayes_test.csv

Column name	Description
text	The text of the article. (String)
text_type	The type of text, either train or test. (String)
topics	The topics associated with the article. (String)
lewis_split	The Lewis split of the article. (String)
cgis_split	The CGIS split of the article. (String)
places	The places mentioned in the article. (String)
people	The people mentioned in the article. (String)
orgs	The organizations mentioned in the article. (String)
exchanges	The exchanges mentioned in the article. (String)
date	The date the article was published. (Date)
title	The title of the article. (String)

File: ModApte_unused.csv

Column name	Description
text	The text of the article. (String)
text_type	The type of text, either train or test. (String)
topics	The topics associated with the article. (String)
lewis_split	The Lewis split of the article. (String)
cgis_split	The CGIS split of the article. (String)
places	The places mentioned in the article. (String)
people	The people mentioned in the article. (String)
orgs	The organizations mentioned in the article. (String)
exchanges	The exchanges mentioned in the article. (String)
date	The date the article was published. (Date)
title	The title of the article. (String)

File: ModApte_test.csv

Column name	Description
text	The text of the article. (String)
text_type	The type of text, either train or test. (String)
topics	The topics associated with the article. (String)
lewis_split	The Lewis split of the article. (String)
cgis_split	The CGIS split of the article. (String)
places	The places mentioned in the article. (String)
people	The people mentioned in the article. (String)
orgs	The organizations mentioned in the article. (String)
exchanges	The exchanges mentioned in the article. (String)
date	The date the article was published. (Date)
title	The title of the article. (String)

File: ModLewis_train.csv

Column name	Description
text	The text of the article. (String)
text_type	The type of text, either train or test. (String)
topics	The topics associated with the article. (String)
lewis_split	The Lewis split of the article. (String)
cgis_split	The CGIS split of the article. (String)
places	The places mentioned in the article. (String)
people	The people mentioned in the article. (String)
orgs	The organizations mentioned in the article. (String)
exchanges	The exchanges mentioned in the article. (String)
date	The date the article was published. (Date)
title	The title of the article. (String)

File: ModLewis_test.csv

Column name	Description
text	The text of the article. (String)
text_type	The type of text, either train or test. (String)
topics	The topics associated with the article. (String)
lewis_split	The Lewis split of the article. (String)
cgis_split	The CGIS split of the article. (String)
places	The places mentioned in the article. (String)
people	The people mentioned in the article. (String)
orgs	The organizations mentioned in the article. (String)
exchanges	The exchanges mentioned in the article. (String)
date	The date the article was published. (Date)
title	The title of the article. (String)

File: ModApte_train.csv

Column name	Description
text	The text of the article. (String)
text_type	The type of text, either train or test. (String)
topics	The topics associated with the article. (String)
lewis_split	The Lewis split of the article. (String)
cgis_split	The CGIS split of the article. (String)
places	The places mentioned in the article. (String)
people	The people mentioned in the article. (String)
orgs	The organizations mentioned in the article. (String)
exchanges	The exchanges mentioned in the article. (String)
date	The date the article was published. (Date)
title	The title of the article. (String)

File: ModLewis_unused.csv

Column name	Description
text	The text of the article. (String)
text_type	The type of text, either train or test. (String)
topics	The topics associated with the article. (String)
lewis_split	The Lewis split of the article. (String)
cgis_split	The CGIS split of the article. (String)
places	The places mentioned in the article. (String)
people	The people mentioned in the article. (String)
orgs	The organizations mentioned in the article. (String)
exchanges	The exchanges mentioned in the article. (String)
date	The date the article was published. (Date)
title	The title of the article. (String)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit Huggingface Hub.

Tables

Modapte Test

@kaggle.thedevastator_uncovering_financial_insights_with_the_reuters_2.modapte_test

1.51 MB
3299 rows
13 columns


CREATE TABLE modapte_test (
  "text" VARCHAR,
  "text_type" VARCHAR,
  "topics" VARCHAR,
  "lewis_split" VARCHAR,
  "cgis_split" VARCHAR,
  "old_id" VARCHAR,
  "new_id" VARCHAR,
  "places" VARCHAR,
  "people" VARCHAR,
  "orgs" VARCHAR,
  "exchanges" VARCHAR,
  "date" VARCHAR,
  "title" VARCHAR
);

Modapte Train

@kaggle.thedevastator_uncovering_financial_insights_with_the_reuters_2.modapte_train

4.7 MB
9603 rows
13 columns


CREATE TABLE modapte_train (
  "text" VARCHAR,
  "text_type" VARCHAR,
  "topics" VARCHAR,
  "lewis_split" VARCHAR,
  "cgis_split" VARCHAR,
  "old_id" VARCHAR,
  "new_id" VARCHAR,
  "places" VARCHAR,
  "people" VARCHAR,
  "orgs" VARCHAR,
  "exchanges" VARCHAR,
  "date" VARCHAR,
  "title" VARCHAR
);

Modapte Unused

@kaggle.thedevastator_uncovering_financial_insights_with_the_reuters_2.modapte_unused

508.38 KB
722 rows
13 columns


CREATE TABLE modapte_unused (
  "text" VARCHAR,
  "text_type" VARCHAR,
  "topics" VARCHAR,
  "lewis_split" VARCHAR,
  "cgis_split" VARCHAR,
  "old_id" VARCHAR,
  "new_id" VARCHAR,
  "places" VARCHAR,
  "people" VARCHAR,
  "orgs" VARCHAR,
  "exchanges" VARCHAR,
  "date" VARCHAR,
  "title" VARCHAR
);

Modhayes Test

@kaggle.thedevastator_uncovering_financial_insights_with_the_reuters_2.modhayes_test

508.38 KB
722 rows
13 columns


CREATE TABLE modhayes_test (
  "text" VARCHAR,
  "text_type" VARCHAR,
  "topics" VARCHAR,
  "lewis_split" VARCHAR,
  "cgis_split" VARCHAR,
  "old_id" VARCHAR,
  "new_id" VARCHAR,
  "places" VARCHAR,
  "people" VARCHAR,
  "orgs" VARCHAR,
  "exchanges" VARCHAR,
  "date" VARCHAR,
  "title" VARCHAR
);

Modhayes Train

@kaggle.thedevastator_uncovering_financial_insights_with_the_reuters_2.modhayes_train

9.87 MB
20856 rows
13 columns


CREATE TABLE modhayes_train (
  "text" VARCHAR,
  "text_type" VARCHAR,
  "topics" VARCHAR,
  "lewis_split" VARCHAR,
  "cgis_split" VARCHAR,
  "old_id" VARCHAR,
  "new_id" VARCHAR,
  "places" VARCHAR,
  "people" VARCHAR,
  "orgs" VARCHAR,
  "exchanges" VARCHAR,
  "date" VARCHAR,
  "title" VARCHAR
);

Modlewis Test

@kaggle.thedevastator_uncovering_financial_insights_with_the_reuters_2.modlewis_test

2.77 MB
6188 rows
13 columns


CREATE TABLE modlewis_test (
  "text" VARCHAR,
  "text_type" VARCHAR,
  "topics" VARCHAR,
  "lewis_split" VARCHAR,
  "cgis_split" VARCHAR,
  "old_id" VARCHAR,
  "new_id" VARCHAR,
  "places" VARCHAR,
  "people" VARCHAR,
  "orgs" VARCHAR,
  "exchanges" VARCHAR,
  "date" VARCHAR,
  "title" VARCHAR
);

Modlewis Train

@kaggle.thedevastator_uncovering_financial_insights_with_the_reuters_2.modlewis_train

6.75 MB
13625 rows
13 columns


CREATE TABLE modlewis_train (
  "text" VARCHAR,
  "text_type" VARCHAR,
  "topics" VARCHAR,
  "lewis_split" VARCHAR,
  "cgis_split" VARCHAR,
  "old_id" VARCHAR,
  "new_id" VARCHAR,
  "places" VARCHAR,
  "people" VARCHAR,
  "orgs" VARCHAR,
  "exchanges" VARCHAR,
  "date" VARCHAR,
  "title" VARCHAR
);

Modlewis Unused

@kaggle.thedevastator_uncovering_financial_insights_with_the_reuters_2.modlewis_unused

508.38 KB
722 rows
13 columns


CREATE TABLE modlewis_unused (
  "text" VARCHAR,
  "text_type" VARCHAR,
  "topics" VARCHAR,
  "lewis_split" VARCHAR,
  "cgis_split" VARCHAR,
  "old_id" VARCHAR,
  "new_id" VARCHAR,
  "places" VARCHAR,
  "people" VARCHAR,
  "orgs" VARCHAR,
  "exchanges" VARCHAR,
  "date" VARCHAR,
  "title" VARCHAR
);