Baselight

VTuber 1B Elements: Live Chat Statistics

Virtual YouTubers Live Chat Statistics

@kaggle.uetchy_vtuber_livechat_elements

Loading...
Loading...

About this Dataset

VTuber 1B Elements: Live Chat Statistics

VTuber 1B is a dataset for large-scale academic research, collecting over a billion live chats, superchats, and moderation events (bans/deletions) from virtual YouTubers' live streams.

See GitHub and join #vtuber-1b channel on SIGVT Discord for discussions.

We also offer ❤️‍🩹 Sensai, a live chat dataset specifically made for building ML models for spam detection / toxic chat classification.

Provenance

  • Source: YouTube live chat events collected by our Honeybee cluster. Holodex is a stream index provider for Honeybee which covers Hololive, Nijisanji, 774inc, etc.
  • Temporal Coverage:
    • Chats: from 2021-01-15
    • Super chats: from 2021-03-16
  • Update Frequency:
    • At least once every 6 months

Research Ideas

  • Toxic Chat Classification
  • Spam Detection
  • Demographic Visualization
  • Superchat Analysis
  • Training neural language models

See public notebooks built on VTuber 1B and VTuber 1B Elements for ideas.

We employed Honeybee cluster to collect real-time live chat events across major Vtubers' live streams. All sensitive data such as author name or author profile image are omitted from the dataset, and author channel id is anonymized by SHA-1 hashing algorithm with a grain of salt.

Editions

VTuber 1B Elements

Kaggle Datasets (2 MB)

VTuber 1B Elements is most suitable for statistical visualizations and explanatory data analysis.

filename summary
channels.csv Channel index
chat_stats.csv Chat statistics
superchat_stats.csv Super Chat statistics

VTuber 1B

Kaggle Datasets

VTuber 1B is most suitable for frequency analysis. This edition includes only the essential columns in order to reduce dataset size and make it faster fro Kaggle Kernels to load data in.

filename summary
chats_%Y-%m.parquet Live chat events (> 1,000,000,000)
superchats_%Y-%m.parquet Super chat events (> 4,000,000)
deletion_events.parquet Deletion events
ban_events.parquet Ban events

Dataset Breakdown

Ban and deletion are equivalent to markChatItemsByAuthorAsDeletedAction and markChatItemAsDeletedAction respectively.

Channels (channels.csv)

column type description
channelId string channel id
name string channel name
englishName nullable string channel name (English)
affiliation string channel affiliation
group nullable string group
subscriptionCount number subscription count
videoCount number uploads count
photo string channel icon

Inactive channels have INACTIVE in group column.

Chat Statistics (chat_stats.csv)

column type description
channelId string channel id
period string interested period (%Y-%M)
chats number number of chats
memberChats number number of chats with membership status attached
uniqueChatters number number of unique chatters
uniqueMembers number number of unique members appeared on live chat
bannedChatters number number of unique chatters marked as banned by mods
deletedChats number number of chats deleted by mods

Super Chat Statistics (superchat_stats.csv)

column type description
channelId string channel id
period string interested period (%Y-%M)
superChats number number of super chats
uniqueSuperChatters number number of unique super chatters
totalSC number total amount of super chats (JPY)
averageSC number average amount of super chat (JPY)
totalMessageLength number total message length
averageMessageLength number average mesage length
mostFrequentCurrency string most frequent currency
mostFrequentColor string most frequent color

Consideration

Anonymization

id and authorChannelId are anonymized by SHA-1 hashing algorithm with a pinch of undisclosed salt.

Handling Custom Emojis

All custom emojis are replaced with a Unicode replacement character � (U+FFFD).

Redundant Ban and Deletion Events

Bans and deletions from multiple moderators for the same person or chat will be logged separately. For simplicity, you can safely ignore all but the first line recorded in time order.

Citation

@misc{vtuber-livechat-dataset,
 author={Yasuaki Uechi},
 title={VTuber 1B: Large-scale Live Chat and Moderation Events Dataset},
 year={2022},
 month={2},
 version={37},
 url={https://sigvt.org/vtuber-1b}
}

License

Tables

Channels

@kaggle.uetchy_vtuber_livechat_elements.channels
  • 195.71 KB
  • 1358 rows
  • 8 columns
Loading...

CREATE TABLE channels (
  "channelid" VARCHAR,
  "name" VARCHAR,
  "englishname" VARCHAR,
  "affiliation" VARCHAR,
  "group" VARCHAR,
  "subscriptioncount" BIGINT,
  "videocount" BIGINT,
  "photo" VARCHAR
);

Chat Stats

@kaggle.uetchy_vtuber_livechat_elements.chat_stats
  • 286.54 KB
  • 12468 rows
  • 8 columns
Loading...

CREATE TABLE chat_stats (
  "channelid" VARCHAR,
  "period" TIMESTAMP,
  "chats" BIGINT,
  "memberchats" BIGINT,
  "uniquechatters" BIGINT,
  "uniquemembers" BIGINT,
  "bannedchatters" BIGINT,
  "deletedchats" BIGINT
);

Superchat Stats

@kaggle.uetchy_vtuber_livechat_elements.superchat_stats
  • 253.18 KB
  • 10942 rows
  • 10 columns
Loading...

CREATE TABLE superchat_stats (
  "channelid" VARCHAR,
  "period" TIMESTAMP,
  "superchats" BIGINT,
  "uniquesuperchatters" BIGINT,
  "totalsc" BIGINT,
  "averagesc" BIGINT,
  "totalmessagelength" BIGINT,
  "averagemessagelength" BIGINT,
  "mostfrequentcurrency" VARCHAR,
  "mostfrequentcolor" VARCHAR
);

Share link

Anyone who has the link will be able to view this.