Baselight

COVID News Articles (2020 - 2022)

Global news articles covering the COVID pandemic peak era of 2 years

@kaggle.timmayer_covid_news_articles_2020_2022

About this Dataset

COVID News Articles (2020 - 2022)

The dataset encapsulates approximately half a million news articles collected over a period of 2 years during the Coronavirus pandemic onset and surge. It consists of 3 columns - title, content and category. title refers to the headline of the news article. content refers to the article in itself and category denotes the overall context of the news article at a high level. The dataset encapsulates approximately half a million news articles collected over a period of 2 years during the Coronavirus pandemic onset and surge. It consists of 3 columns - title, content and category. title refers to the headline of the news article. content refers to the article in itself and category denotes the overall context of the news article at a high level.

This dataset can be used to pre-train large language models (LLMs) and demonstrate NLP downstream tasks like binary/multi-class text classification. The dataset can be used to study the difference in behaviors of language models when there is a shift in data. For e.g., the classic transformers based BERT model was trained before the COVID era. By training a masked language model (MLM) using this dataset, we can try to differentiate the behaviors of the original BERT model vs the newly trained models.

Share link

Anyone who has the link will be able to view this.