Dataset for Predicting Post Flair Categories on r/AskScience Subreddit

Context

Reddit is a massive platform for news, content, and discussions, hosting millions of active users daily. Among its vast number of subreddits, we focus on the r/AskScience community, where users engage in science-related discussions and questions.

Content

This dataset is derived from the r/AskScience subreddit, collected between January 1, 2016, and May 20, 2022. It includes 612,668 datapoints across 22 columns, featuring diverse information such as the content of the questions, submission descriptions, associated flairs, NSFW/SFW status, year of submission, and more. The data was extracted using Python and Pushshift's API, followed by some cleaning with NumPy and pandas. Detailed column descriptions are available for clarity.

Mendeley Data

Ideas for Usage

Flair Prediction:Train models to predict post flairs (e.g., 'Science', 'Ask', 'Discussion') to automate content categorization for platforms like Reddit.
NSFW Classification: Classify posts as SFW or NSFW based on textual content, enabling content moderation tools for online forums.
Text Mining / NLP Tasks: Apply NLP techniques like Sentiment Analysis, Topic Modeling, and Text Classification to explore the content and themes of science-related discussions.
Community Engagement Analysis: Investigate which post types or flairs generate more engagement (e.g., upvotes or comments), offering insights into user interaction.
Trend Detection in Science Topics: Identify emerging science topics and analyze shifts in interest areas, which can help predict future trends in scientific discussions.

Related Datasets

The Reddit Dataset Dataset

@kaggle
Eucalyptus Growth And Environmental Data

@euremarkable
Ethnic Power Relations Dataset (ETH, 2021)

@owid
Trust Questions In The European Social Survey, Latinobarómetro And Afrobarometer

@owid
Long-term Food And Agriculture Trends

@owid
Share Of Cage-free Eggs (Various Sources, 2023)

@owid

The Reddit Dataset Dataset

Eucalyptus Growth And Environmental Data

Ethnic Power Relations Dataset (ETH, 2021)

Trust Questions In The European Social Survey, Latinobarómetro And Afrobarometer

Long-term Food And Agriculture Trends

Share Of Cage-free Eggs (Various Sources, 2023)