With Article Titles, Descriptions, Cover Images, and Links.

A collections of news articles in Traditional and Simplified Chinese. It includes some Internet news outlets that are NOT Chinese state media (they deserve a separate dataset).

Complete coverage is not guaranteed. Therefore this dataset is not suitable for analyzing event coverage. It is meant for using as a corpus for NLP algorithms.

Data Collection Process

The links to the news articles were collected from the RSS feeds or the Twitter accounts of the news outlets.
Download and parse the web pages. Then the meta tags were used to extract the title, description/summary, and cover image of each article. (These are the stuffs that are used in the Twitter and Facebook summary cards.)

Note: Only minimal text cleaning has been performed on the meta tags.

Data Fields

title: Article title from og:title or twitter:title meta tag.
desc: Article summary from twitter:description or og:description meta tag.
image: URL to the cover image from twitter:image or og:image meta tag.
url: URL of the article.
source: The code of the news outlet.
date: The publish date of the article on Twitter or in RSS feeds. Format: YYYYMMDD

This dataset does not provide full texts of the article. You'll need to scrape it yourself using the links provided.

Yet Another Chinese News Dataset

With Article Titles, Descriptions, Cover Images, and Links.

Data Collection Process

Data Fields

Related Datasets

Global News Dataset

Modern China Geospatial Database - Main Dataset

Ethnic Power Relations Dataset (ETH, 2021)

Historical Series Of Phenological Data For Cherry Tree Flowering At Kyoto City (and March Mean Temperature Reconstructions)

Social Media Ban For Minors: A Computational Analysis Of Media Coverage In Europe And Beyond, Dataset

Directgov Internal Search