Intro
Founded and maintained since 2005, YouTube is one of the internet's biggest platforms. With their number of videos watched per day exceeding 1 Billion, it's easy for any user to differentiate genres just by glancing at the thumbnail and the Title. My inspiration to make this dataset was to try and answer the question of whether it is equally easy for a computer to do.
The Transcript
column in the dataset contains the subtitles for the respective videos. However, the reliability of the subtitles may vary. Even though the auto-generated subtitles work great (most of the time). Sometimes under heavy pressure of thick accents, it lets go of the ball. Please consult the CC
attribute to check whether the subtitle is auto-generated or not. 1381
of these video subtitles are auto-generated, the rest of the 1134
are manual ones.
Since the values of Subscribers
and Views
are based on the time when the dataset was generated. That's to be taken into account. The most recent version of this dataset was generated on 05-Feb-2022
.
Description
This dataset contains subtitles from over 91
different YouTubers, ranging from all different kinds of categories. The data were collected and cleaned (as much as necessary) by me. Currently, the dataset contains 2515
unique videos and their subtitles. There are 11
columns in the dataset. You can find their purpose in the column descriptors.
Improvements
I am open to suggestions please feel free to let me know of any major Categories or Channels that I've missed or you'll like to be included. I'll try my best to include them in the dataset. Find the dataset page on my Github.
