190k+ Medium articles with text content, title, publication date, and tags

Data source

This data has been collected through a standard scraping process from the Medium website, looking for published articles.

Data description

Each row in the data is a different article published on Medium. For each article, you have the following features:

title [string]: The title of the article.
text [string]: The text content of the article.
url [string]: The URL associated to the article.
authors [list of strings]: The article authors.
timestamp [string]: The publication datetime of the article.
tags [list of strings]: List of tags associated to the article.

Data analysis

You can find a very quick data analysis in this notebook.

What can I do with this data?

A multilabel classification model that assigns tags to articles.
A seq2seq model that generates article titles.
Text analysis.
Finetune text generation models on the general domain of Medium, or on specific domains by filtering articles by the appropriate tags.

Collection methodology

Scraping has been done with Python and the requests library. Starting from a random article on Medium, the next articles to scrape are selected by visiting:

The author archive pages.
The publication archive pages (if present).
The tags archives (if present).

The article HTML pages have been parsed with the newspaper Python library.

Published articles have been filtered for English articles only, using the Python langdetect library.

As a consequence of the collection methodology, the scraped articles are coming from a not uniform publication date distribution. This means that there are articles published in 2016 and in 2022, but the number of articles in this dataset published in 2016 is not the same as the number of articles published in 2022. In particular, there is a strong prevalence of articles published in 2020. Have a look at the accompanying notebook to see the distribution of the publication dates.

Related Datasets

Medium Articles

@kaggle
Yahoo Finance Historical Prices And Ticker Fundamentals

@yahoo
SFC2014 - REACT EU Overview Allocation Vs Decided

@esifunds
EU SPI 2020 Scores And Other Statistics

@esifunds
Lookup Comparison Of 2017-13 V 2014-2020 Thematic Categorisation Codes

@esifunds
Wars On Territory

@owid

Medium Articles

Yahoo Finance Historical Prices And Ticker Fundamentals

SFC2014 - REACT EU Overview Allocation Vs Decided

EU SPI 2020 Scores And Other Statistics

Lookup Comparison Of 2017-13 V 2014-2020 Thematic Categorisation Codes

Wars On Territory