Dataset containing popular Youtuber's video subtitles from different categories.

Intro

Founded and maintained since 2005, YouTube is one of the internet's biggest platforms. With their number of videos watched per day exceeding 1 Billion, it's easy for any user to differentiate genres just by glancing at the thumbnail and the Title. My inspiration to make this dataset was to try and answer the question of whether it is equally easy for a computer to do.

The Transcript column in the dataset contains the subtitles for the respective videos. However, the reliability of the subtitles may vary. Even though the auto-generated subtitles work great (most of the time). Sometimes under heavy pressure of thick accents, it lets go of the ball. Please consult the CC attribute to check whether the subtitle is auto-generated or not. 1381 of these video subtitles are auto-generated, the rest of the 1134 are manual ones.

Since the values of Subscribers and Views are based on the time when the dataset was generated. That's to be taken into account. The most recent version of this dataset was generated on 05-Feb-2022.

Description

This dataset contains subtitles from over 91 different YouTubers, ranging from all different kinds of categories. The data were collected and cleaned (as much as necessary) by me. Currently, the dataset contains 2515 unique videos and their subtitles. There are 11 columns in the dataset. You can find their purpose in the column descriptors.

Improvements

I am open to suggestions please feel free to let me know of any major Categories or Channels that I've missed or you'll like to be included. I'll try my best to include them in the dataset. Find the dataset page on my Github.

![drawing](https://github.githubassets.com/images/modules/logos_page/GitHub-Logo.png =100x20)

Related Datasets

YouTubers' Popularity Dataset

@kaggle
Trust Questions In The European Social Survey, Latinobarómetro And Afrobarometer

@owid
Global Forest Resources Assessment

@owid
Ethnic Power Relations Dataset (ETH, 2021)

@owid
Nuclear Weapons Proliferation

@owid
Wars On Territory

@owid

YouTubers' Popularity Dataset

Trust Questions In The European Social Survey, Latinobarómetro And Afrobarometer

Global Forest Resources Assessment

Ethnic Power Relations Dataset (ETH, 2021)

Nuclear Weapons Proliferation

Wars On Territory