190k+ Medium Articles
190k+ Medium articles with text content, title, publication date, and tags
@kaggle.fabiochiusano_medium_articles
190k+ Medium articles with text content, title, publication date, and tags
@kaggle.fabiochiusano_medium_articles
This data has been collected through a standard scraping process from the Medium website, looking for published articles.
Each row in the data is a different article published on Medium. For each article, you have the following features:
You can find a very quick data analysis in this notebook.
Scraping has been done with Python and the requests library. Starting from a random article on Medium, the next articles to scrape are selected by visiting:
The article HTML pages have been parsed with the newspaper Python library.
Published articles have been filtered for English articles only, using the Python langdetect library.
As a consequence of the collection methodology, the scraped articles are coming from a not uniform publication date distribution. This means that there are articles published in 2016 and in 2022, but the number of articles in this dataset published in 2016 is not the same as the number of articles published in 2022. In particular, there is a strong prevalence of articles published in 2020. Have a look at the accompanying notebook to see the distribution of the publication dates.
CREATE TABLE medium_articles (
"title" VARCHAR,
"text" VARCHAR,
"url" VARCHAR,
"authors" VARCHAR,
"timestamp" VARCHAR,
"tags" VARCHAR
);Anyone who has the link will be able to view this.