A collections of news articles in Traditional and Simplified Chinese. It includes some Internet news outlets that are NOT Chinese state media (they deserve a separate dataset).
Complete coverage is not guaranteed. Therefore this dataset is not suitable for analyzing event coverage. It is meant for using as a corpus for NLP algorithms.
Data Collection Process
- The links to the news articles were collected from the RSS feeds or the Twitter accounts of the news outlets.
- Download and parse the web pages. Then the meta tags were used to extract the title, description/summary, and cover image of each article. (These are the stuffs that are used in the Twitter and Facebook summary cards.)
Note: Only minimal text cleaning has been performed on the meta tags.
Data Fields
- title: Article title from
og:title
or twitter:title
meta tag.
- desc: Article summary from
twitter:description
or og:description
meta tag.
- image: URL to the cover image from
twitter:image
or og:image
meta tag.
- url: URL of the article.
- source: The code of the news outlet.
- date: The publish date of the article on Twitter or in RSS feeds. Format: YYYYMMDD
This dataset does not provide full texts of the article. You'll need to scrape it yourself using the links provided.