Reuters-21578 (Text Categorization)
Ruters financial newswire service in 1987
By Huggingface Hub [source]
About this dataset
The Reuters-21578 dataset, one of the most influential and widely used collections of newswire articles from the Reuters financial newswire service, is an essential benchmark for text categorization research. This extensive repository provides a range of valuable insight into topics frequently covered by financial publications and is available in multiple splits for optimal machine learning exploration.
Within this dataset, users will find columns with detailed information such as text (the full body of article text), text_type (classifying whether the article was part of the training or test set), topics (what topics are associated with the particular document), lewis_split (which split it belongs to) , cgis_split (split between train and test set given by core group iteration sampling method), places/people/orgs/exchanges mentioned within it, date and title. In addition to these classifications, there are separate files containing Reuters-21578 articles that were not used in specific splits (ModApte_unused.csv & ModLewis_unused.csv). By leveraging this dataset, you can unlock deep understanding into financial news categorization from an abundance of data points across categories - enabling you to build high performing models that provide better accuracy than ever before!
More Datasets
For more datasets, click here.
Featured Notebooks
- 🚨 Your notebook can be here! 🚨!
How to use the dataset
The Reuters-21578 dataset is a great resource for uncovering valuable insights in financial news. With its wide range of topics and data splits, it is well-suited to be used as a benchmark dataset for text categorization research. Here are some tips and tricks on how to get the most out of this dataset:
-
Familiarize yourself with the columns: Before getting started, make sure to familiarize yourself with all of the columns included in the dataset. This includes understanding what each column means, as well as identifying which are essential for your research project.
-
Use an appropriate split: Depending on your research goals, you may need to use different training and test sets from those provided in this dataset (ModHayes_train/test or ModLewis_train/test). You can also create custom splits from the unique ‘ModApte_unused’ set contained within this collection if desired.
-
Explore other methods: While text categorization is often used with this type of data, you may also want to explore other methods that can help uncover useful information such as topic modelling or sentiment analysis.
-
Leverage related packages: If you’re using Python or R there are some great packages available specifically designed for working with textual data from Reuters-21578 such as sklearn’s reuters21578 module and klabutils’ reutersR package respectively . Both offer helpful features such as vectorizers that let you transform words into feature vectors when implementing ML models such as Naive Bayes or Random Forest classifiers .
5 Tackle low-level preprocessing tasks : Before getting started with building models using ML algorithms , remember that all input data will benefit greatly from being cleaned up first – particularly in terms of removing invalid characters along side any symbols associated with a language other than English; which could severely affect model accuracy! Additionally , performing minor tasks like stopword removal and stemming words into their root form prior to getting underway could help improve overall performance too!
Research Ideas
- Automated text classification - Using the data from the Reuters-21578 dataset, machine learning algorithms can be trained to automatically classify and categorize newswire articles into their appropriate topics. This not only saves time, but also ensures reliable results with minimal human intervention.
- Sentiment analysis - By analyzing the sentiment of individual news article in the Reuters-21578 dataset, one could gain valuable insight into how people generally perceive financial news and then use this information to make more informed investing decisions.
- Stock market predictions - By applying data mining techniques on the content of news articles in this dataset, correlations between certain topics or exchanges mentioned in an article and their effects on stock prices can be identified and used for algorithmic trading strategies aimed at predicting short term stock price movements accurately
Acknowledgements
If you use this dataset in your research, please credit the original authors.
Data Source
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
Columns
File: ModHayes_train.csv
Column name |
Description |
text |
The text of the article. (String) |
text_type |
The type of text, either train or test. (String) |
topics |
The topics associated with the article. (String) |
lewis_split |
The Lewis split of the article. (String) |
cgis_split |
The CGIS split of the article. (String) |
places |
The places mentioned in the article. (String) |
people |
The people mentioned in the article. (String) |
orgs |
The organizations mentioned in the article. (String) |
exchanges |
The exchanges mentioned in the article. (String) |
date |
The date the article was published. (Date) |
title |
The title of the article. (String) |
File: ModHayes_test.csv
Column name |
Description |
text |
The text of the article. (String) |
text_type |
The type of text, either train or test. (String) |
topics |
The topics associated with the article. (String) |
lewis_split |
The Lewis split of the article. (String) |
cgis_split |
The CGIS split of the article. (String) |
places |
The places mentioned in the article. (String) |
people |
The people mentioned in the article. (String) |
orgs |
The organizations mentioned in the article. (String) |
exchanges |
The exchanges mentioned in the article. (String) |
date |
The date the article was published. (Date) |
title |
The title of the article. (String) |
File: ModApte_unused.csv
Column name |
Description |
text |
The text of the article. (String) |
text_type |
The type of text, either train or test. (String) |
topics |
The topics associated with the article. (String) |
lewis_split |
The Lewis split of the article. (String) |
cgis_split |
The CGIS split of the article. (String) |
places |
The places mentioned in the article. (String) |
people |
The people mentioned in the article. (String) |
orgs |
The organizations mentioned in the article. (String) |
exchanges |
The exchanges mentioned in the article. (String) |
date |
The date the article was published. (Date) |
title |
The title of the article. (String) |
File: ModApte_test.csv
Column name |
Description |
text |
The text of the article. (String) |
text_type |
The type of text, either train or test. (String) |
topics |
The topics associated with the article. (String) |
lewis_split |
The Lewis split of the article. (String) |
cgis_split |
The CGIS split of the article. (String) |
places |
The places mentioned in the article. (String) |
people |
The people mentioned in the article. (String) |
orgs |
The organizations mentioned in the article. (String) |
exchanges |
The exchanges mentioned in the article. (String) |
date |
The date the article was published. (Date) |
title |
The title of the article. (String) |
File: ModLewis_train.csv
Column name |
Description |
text |
The text of the article. (String) |
text_type |
The type of text, either train or test. (String) |
topics |
The topics associated with the article. (String) |
lewis_split |
The Lewis split of the article. (String) |
cgis_split |
The CGIS split of the article. (String) |
places |
The places mentioned in the article. (String) |
people |
The people mentioned in the article. (String) |
orgs |
The organizations mentioned in the article. (String) |
exchanges |
The exchanges mentioned in the article. (String) |
date |
The date the article was published. (Date) |
title |
The title of the article. (String) |
File: ModLewis_test.csv
Column name |
Description |
text |
The text of the article. (String) |
text_type |
The type of text, either train or test. (String) |
topics |
The topics associated with the article. (String) |
lewis_split |
The Lewis split of the article. (String) |
cgis_split |
The CGIS split of the article. (String) |
places |
The places mentioned in the article. (String) |
people |
The people mentioned in the article. (String) |
orgs |
The organizations mentioned in the article. (String) |
exchanges |
The exchanges mentioned in the article. (String) |
date |
The date the article was published. (Date) |
title |
The title of the article. (String) |
File: ModApte_train.csv
Column name |
Description |
text |
The text of the article. (String) |
text_type |
The type of text, either train or test. (String) |
topics |
The topics associated with the article. (String) |
lewis_split |
The Lewis split of the article. (String) |
cgis_split |
The CGIS split of the article. (String) |
places |
The places mentioned in the article. (String) |
people |
The people mentioned in the article. (String) |
orgs |
The organizations mentioned in the article. (String) |
exchanges |
The exchanges mentioned in the article. (String) |
date |
The date the article was published. (Date) |
title |
The title of the article. (String) |
File: ModLewis_unused.csv
Column name |
Description |
text |
The text of the article. (String) |
text_type |
The type of text, either train or test. (String) |
topics |
The topics associated with the article. (String) |
lewis_split |
The Lewis split of the article. (String) |
cgis_split |
The CGIS split of the article. (String) |
places |
The places mentioned in the article. (String) |
people |
The people mentioned in the article. (String) |
orgs |
The organizations mentioned in the article. (String) |
exchanges |
The exchanges mentioned in the article. (String) |
date |
The date the article was published. (Date) |
title |
The title of the article. (String) |
Acknowledgements
If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit Huggingface Hub.