Hello there!
This is my data, which I've used for my bachelor diploma research in 2024 at HSE University. I have parsed all comments (or you can call them stock twits) from T-pulse threads from 01 JAN 2019 (launch of the platform) to 30 MARCH 2024. A total of 10 tickers were taken: SBER, GAZP, YNDX, TCSG, SGZH, PIKK, RTKM, MVID, KMAZ, BANE. During the chosen period, there were changes in the CCP of the Bank of Russia, the introduction of sanctions by Western countries against the Russian Federation.
Language: Russian (mostly) and English
Columns
- inserted - date of posting of a comment (or post);
- likesCount - amount of likes under comment (or post);
- commentsCount - amount of comments under comment (or post);
- text - raw text of a parsed comment (you should probably clean it from emoji etc);
- reactions_counters - list of dicts with type and amount of reactions under comment. There are emoji-like reactions like "rocket", "like", "dislike", "not-convinced", "buy-up".
Additionally
I have added df_labelled_llm.csv dataset with labelled posts. Around 1000 from each ticker mentioned above, so total is around 10K posts. Labelling was done 90% with LLM and 10% manually for slang posts. You can use this as a starting point of your research.
Areas of application
- Sentiment analysis of stock twits;
- Fine-tuning BERT-based models;
- Testing algotrading strategies based on sentiment analysis;
- Research.
This data was gathered for educational purposes only. No exact names, phone numbers or addresses of the authors of posts/comments were included into the dataset.