VTuber 1B Elements: Live Chat Statistics
Virtual YouTubers Live Chat Statistics
@kaggle.uetchy_vtuber_livechat_elements
Virtual YouTubers Live Chat Statistics
@kaggle.uetchy_vtuber_livechat_elements
VTuber 1B is a dataset for large-scale academic research, collecting over a billion live chats, superchats, and moderation events (bans/deletions) from virtual YouTubers' live streams.
See GitHub and join #vtuber-1b channel on SIGVT Discord for discussions.
We also offer ❤️🩹 Sensai, a live chat dataset specifically made for building ML models for spam detection / toxic chat classification.
See public notebooks built on VTuber 1B and VTuber 1B Elements for ideas.
We employed Honeybee cluster to collect real-time live chat events across major Vtubers' live streams. All sensitive data such as author name or author profile image are omitted from the dataset, and author channel id is anonymized by SHA-1 hashing algorithm with a grain of salt.
Kaggle Datasets (2 MB)
VTuber 1B Elements is most suitable for statistical visualizations and explanatory data analysis.
| filename | summary |
|---|---|
channels.csv |
Channel index |
chat_stats.csv |
Chat statistics |
superchat_stats.csv |
Super Chat statistics |
VTuber 1B is most suitable for frequency analysis. This edition includes only the essential columns in order to reduce dataset size and make it faster fro Kaggle Kernels to load data in.
| filename | summary |
|---|---|
chats_%Y-%m.parquet |
Live chat events (> 1,000,000,000) |
superchats_%Y-%m.parquet |
Super chat events (> 4,000,000) |
deletion_events.parquet |
Deletion events |
ban_events.parquet |
Ban events |
Ban and deletion are equivalent to
markChatItemsByAuthorAsDeletedActionandmarkChatItemAsDeletedActionrespectively.
channels.csv)| column | type | description |
|---|---|---|
| channelId | string | channel id |
| name | string | channel name |
| englishName | nullable string | channel name (English) |
| affiliation | string | channel affiliation |
| group | nullable string | group |
| subscriptionCount | number | subscription count |
| videoCount | number | uploads count |
| photo | string | channel icon |
Inactive channels have INACTIVE in group column.
chat_stats.csv)| column | type | description |
|---|---|---|
| channelId | string | channel id |
| period | string | interested period (%Y-%M) |
| chats | number | number of chats |
| memberChats | number | number of chats with membership status attached |
| uniqueChatters | number | number of unique chatters |
| uniqueMembers | number | number of unique members appeared on live chat |
| bannedChatters | number | number of unique chatters marked as banned by mods |
| deletedChats | number | number of chats deleted by mods |
superchat_stats.csv)| column | type | description |
|---|---|---|
| channelId | string | channel id |
| period | string | interested period (%Y-%M) |
| superChats | number | number of super chats |
| uniqueSuperChatters | number | number of unique super chatters |
| totalSC | number | total amount of super chats (JPY) |
| averageSC | number | average amount of super chat (JPY) |
| totalMessageLength | number | total message length |
| averageMessageLength | number | average mesage length |
| mostFrequentCurrency | string | most frequent currency |
| mostFrequentColor | string | most frequent color |
id and authorChannelId are anonymized by SHA-1 hashing algorithm with a pinch of undisclosed salt.
All custom emojis are replaced with a Unicode replacement character � (U+FFFD).
Bans and deletions from multiple moderators for the same person or chat will be logged separately. For simplicity, you can safely ignore all but the first line recorded in time order.
@misc{vtuber-livechat-dataset,
author={Yasuaki Uechi},
title={VTuber 1B: Large-scale Live Chat and Moderation Events Dataset},
year={2022},
month={2},
version={37},
url={https://sigvt.org/vtuber-1b}
}
CREATE TABLE channels (
"channelid" VARCHAR,
"name" VARCHAR,
"englishname" VARCHAR,
"affiliation" VARCHAR,
"group" VARCHAR,
"subscriptioncount" BIGINT,
"videocount" BIGINT,
"photo" VARCHAR
);CREATE TABLE chat_stats (
"channelid" VARCHAR,
"period" TIMESTAMP,
"chats" BIGINT,
"memberchats" BIGINT,
"uniquechatters" BIGINT,
"uniquemembers" BIGINT,
"bannedchatters" BIGINT,
"deletedchats" BIGINT
);CREATE TABLE superchat_stats (
"channelid" VARCHAR,
"period" TIMESTAMP,
"superchats" BIGINT,
"uniquesuperchatters" BIGINT,
"totalsc" BIGINT,
"averagesc" BIGINT,
"totalmessagelength" BIGINT,
"averagemessagelength" BIGINT,
"mostfrequentcurrency" VARCHAR,
"mostfrequentcolor" VARCHAR
);Anyone who has the link will be able to view this.