A Comprehensive Collection of more than 1 Million News Articles

News Dataset

Context

This dataset comprises news articles collected over the past few months using the NewsAPI. The primary motivation behind curating this dataset was to develop and experiment with various natural language processing (NLP) models. The dataset aims to support the creation of text summarization models, sentiment analysis models, and other NLP applications.

Sources

The data is sourced from the NewsAPI, a comprehensive and up-to-date news aggregation service. The API provides access to a wide range of news articles from various reputable sources, making it a valuable resource for constructing a diverse and informative dataset.

Data Fetching Script

The data for this dataset was collected using a custom Python script. You can find the script used for data retrieval dailyWorker.py. This script leverages the NewsAPI to gather information on news articles over a specified period.

Feel free to explore and modify the script to suit your data collection needs. If you have any questions or suggestions for improvement, please don't hesitate to reach out.

Inspiration

The inspiration behind collecting this dataset stems from the growing interest in NLP applications and the need for high-quality, real-world data to train and evaluate these models effectively. By leveraging the NewsAPI, we aim to contribute to the development of robust text summarization and sentiment analysis models that can better understand and process news content.

Dataset Features

Text of news articles
Publication date and time
Source information
Any additional metadata available through the NewsAPI

Potential Use Cases

Text Summarization: Develop models to generate concise and informative summaries of news articles.
Sentiment Analysis: Analyze the sentiment expressed in news articles to understand public opinion.
Topic Modeling: Explore trends and topics within the news data.

Note:
Please refer to the NewsAPI documentation for terms of use and ensure compliance with their policies when using this dataset.

Related Datasets

Global Public Holidays And Calendar Events

@blt
Comprehensive News Articles Dataset

@kaggle
Country Mentions In GDELT 2.0 Events

@owid
Ethnic Power Relations Dataset (ETH, 2021)

@owid
Global Forest Resources Assessment

@owid
Media Mentions Of Causes Of Death

@owid

Global Public Holidays And Calendar Events

Comprehensive News Articles Dataset

Country Mentions In GDELT 2.0 Events

Ethnic Power Relations Dataset (ETH, 2021)

Global Forest Resources Assessment

Media Mentions Of Causes Of Death