Baselight
Sign In
kaggle

One Week Of Global News Feeds

Kaggle
•

@kaggle.therohk_global_news_week

Loading...
Loading...

7 days of tracking 20k news feeds worldwide

Dataset Description

Context

This dataset is a snapshot of most of the new news content published online over one week. It covers the 7 Day-period of August 24 through August 30 for the years 2017 and 2018.

Year 2017: 1,398,431 ; Year 2018: 1,912,872

It includes approximately 3.3 million articles, with 20,000 news sources and 20+ languages.

This dataset has just four fields:

  • publish_time - earliest known time of the url appearing online in yyyyMMddHHmm format, IST timezone
  • feed_code - unique identifier for the publisher or domain
  • source_url - url of the article
  • headline_text - Headline of the article (UTF8, Any possible languages)

See the "Basic Feed-Code Exploration" notebook for a quick look at the dataset contents.

Inspiration

The sources include news feeds, news websites, government agencies, tech journals, company websites, blogs and wikipedia updates. The data has been collected by polling RSS feeds and by crawling other large news aggregators.

As of 2018, these 7-Day slices were selected as there wasn't any downtime or outage during the intervals. New news content is produced at this rate by publishers everyday, throughout the year.

Acknowledgements

This dataset is free to use with the following citation:

Rohit Kulkarni (2018), One Week of Global Feeds [News CSV Dataset], doi:10.7910/DVN/ILAT5B, Retrieved from: [this url]

Original paper by M Trampus, B Novak: Internals of An Aggregated Web News Feed

Hosted By: Josef Stefan Institute, Slovenia : (http://ailab.ijs.si/si/people)

Further Exploration and Live News: (eventregistry.org)


Related Datasets

Share link

Anyone who has the link will be able to view this.