Baselight

Yet Another Chinese News Dataset

With Article Titles, Descriptions, Cover Images, and Links.

@kaggle.ceshine_yet_another_chinese_news_dataset

About this Dataset

Yet Another Chinese News Dataset

A collections of news articles in Traditional and Simplified Chinese. It includes some Internet news outlets that are NOT Chinese state media (they deserve a separate dataset).

Complete coverage is not guaranteed. Therefore this dataset is not suitable for analyzing event coverage. It is meant for using as a corpus for NLP algorithms.

Data Collection Process

  1. The links to the news articles were collected from the RSS feeds or the Twitter accounts of the news outlets.
  2. Download and parse the web pages. Then the meta tags were used to extract the title, description/summary, and cover image of each article. (These are the stuffs that are used in the Twitter and Facebook summary cards.)

Note: Only minimal text cleaning has been performed on the meta tags.

Data Fields

  1. title: Article title from og:title or twitter:title meta tag.
  2. desc: Article summary from twitter:description or og:description meta tag.
  3. image: URL to the cover image from twitter:image or og:image meta tag.
  4. url: URL of the article.
  5. source: The code of the news outlet.
  6. date: The publish date of the article on Twitter or in RSS feeds. Format: YYYYMMDD

This dataset does not provide full texts of the article. You'll need to scrape it yourself using the links provided.

Share link

Anyone who has the link will be able to view this.