VTuber 1B is a dataset for large-scale academic research, collecting over a billion live chats, superchats, and moderation events (bans/deletions) from virtual YouTubers' live streams.
See GitHub and join #vtuber-1b
channel on SIGVT Discord for discussions.
We also offer ❤️🩹 Sensai, a live chat dataset specifically made for building ML models for spam detection / toxic chat classification.
Provenance
- Source: YouTube live chat events collected by our Honeybee cluster. Holodex is a stream index provider for Honeybee which covers Hololive, Nijisanji, 774inc, etc.
- Temporal Coverage:
- Chats: from 2021-01-15
- Super chats: from 2021-03-16
- Update Frequency:
- At least once every 6 months
Research Ideas
- Toxic Chat Classification
- Spam Detection
- Demographic Visualization
- Superchat Analysis
- Training neural language models
See public notebooks built on VTuber 1B and VTuber 1B Elements for ideas.
We employed Honeybee cluster to collect real-time live chat events across major Vtubers' live streams. All sensitive data such as author name or author profile image are omitted from the dataset, and author channel id is anonymized by SHA-1 hashing algorithm with a grain of salt.
Editions
VTuber 1B Elements
Kaggle Datasets (2 MB)
VTuber 1B Elements is most suitable for statistical visualizations and explanatory data analysis.
filename |
summary |
channels.csv |
Channel index |
chat_stats.csv |
Chat statistics |
superchat_stats.csv |
Super Chat statistics |
VTuber 1B
Kaggle Datasets
VTuber 1B is most suitable for frequency analysis. This edition includes only the essential columns in order to reduce dataset size and make it faster fro Kaggle Kernels to load data in.
filename |
summary |
chats_%Y-%m.parquet |
Live chat events (> 1,000,000,000) |
superchats_%Y-%m.parquet |
Super chat events (> 4,000,000) |
deletion_events.parquet |
Deletion events |
ban_events.parquet |
Ban events |
Dataset Breakdown
Ban and deletion are equivalent to markChatItemsByAuthorAsDeletedAction
and markChatItemAsDeletedAction
respectively.
Channels (channels.csv
)
column |
type |
description |
channelId |
string |
channel id |
name |
string |
channel name |
englishName |
nullable string |
channel name (English) |
affiliation |
string |
channel affiliation |
group |
nullable string |
group |
subscriptionCount |
number |
subscription count |
videoCount |
number |
uploads count |
photo |
string |
channel icon |
Inactive channels have INACTIVE
in group
column.
Chat Statistics (chat_stats.csv
)
column |
type |
description |
channelId |
string |
channel id |
period |
string |
interested period (%Y-%M) |
chats |
number |
number of chats |
memberChats |
number |
number of chats with membership status attached |
uniqueChatters |
number |
number of unique chatters |
uniqueMembers |
number |
number of unique members appeared on live chat |
bannedChatters |
number |
number of unique chatters marked as banned by mods |
deletedChats |
number |
number of chats deleted by mods |
Super Chat Statistics (superchat_stats.csv
)
column |
type |
description |
channelId |
string |
channel id |
period |
string |
interested period (%Y-%M) |
superChats |
number |
number of super chats |
uniqueSuperChatters |
number |
number of unique super chatters |
totalSC |
number |
total amount of super chats (JPY) |
averageSC |
number |
average amount of super chat (JPY) |
totalMessageLength |
number |
total message length |
averageMessageLength |
number |
average mesage length |
mostFrequentCurrency |
string |
most frequent currency |
mostFrequentColor |
string |
most frequent color |
Consideration
Anonymization
id
and authorChannelId
are anonymized by SHA-1 hashing algorithm with a pinch of undisclosed salt.
Handling Custom Emojis
All custom emojis are replaced with a Unicode replacement character � (U+FFFD
).
Redundant Ban and Deletion Events
Bans and deletions from multiple moderators for the same person or chat will be logged separately. For simplicity, you can safely ignore all but the first line recorded in time order.
Citation
@misc{vtuber-livechat-dataset,
author={Yasuaki Uechi},
title={VTuber 1B: Large-scale Live Chat and Moderation Events Dataset},
year={2022},
month={2},
version={37},
url={https://sigvt.org/vtuber-1b}
}
License