Baselight

Twitter And Reddit

Topic labelled online social network (OSN) data sets

@kaggle.saurabhshahane_twitter_and_reddit

Loading...
Loading...

About this Dataset

Twitter And Reddit

Context

Topic labelled online social network (OSN) data sets are useful to evaluate topic modelling and document clustering tasks. We provide three data sets with topic labels from two online social networks: Twitter and Reddit. To comply with Twitter’s terms and conditions, we only publish the tweet identifiers along with the topic label. The Reddit data is supplied with the full text and the topic label. The first Twitter data set was collected from the Twitter API by filtering for the hashtag #Auspol, used to tag political discussion tweets in Australia. The second Twitter data set was originally used in the RepLab 2013 competition and contains expert annotated topics. The Reddit data set consists of 40,000 Reddit parent comments from May 2015 belonging to 5 subreddit pages, which are used as topic labels.

Acknowledgements

Curiskis, Stephan; Kennedy, Paul; Osborn, Thomas; Drake, Barry (2019), “Data for: An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit”, Mendeley Data, V1, doi: 10.17632/85njyhj45m.1

Tables

Reddit Data

@kaggle.saurabhshahane_twitter_and_reddit.reddit_data
  • 10.62 MB
  • 40001 rows
  • 5 columns
Loading...

CREATE TABLE reddit_data (
  "parent_id" VARCHAR,
  "text" VARCHAR,
  "topic" VARCHAR,
  "length" VARCHAR,
  "size_range" VARCHAR
);

Twitter Auspol Data

@kaggle.saurabhshahane_twitter_and_reddit.twitter_auspol_data
  • 287.2 KB
  • 29283 rows
  • 2 columns
Loading...

CREATE TABLE twitter_auspol_data (
  "id" BIGINT,
  "topic" VARCHAR
);

Twitter Replab2013 Data

@kaggle.saurabhshahane_twitter_and_reddit.twitter_replab2013_data
  • 25.11 KB
  • 2657 rows
  • 2 columns
Loading...

CREATE TABLE twitter_replab2013_data (
  "id" BIGINT,
  "topic" VARCHAR
);

Share link

Anyone who has the link will be able to view this.