Context 📃
I wanted to practice text classification using NLP techniques, so I thought why not practice it by generating the data myself!
This way, I brushed up on my scraping techniques using Selenium, collected the data, cleaned it, and then started working on it.
You can take a peek at my work Github Repository For This Dataset and Trained Models/ Results
Content 📰
The total number of videos scraped was 3600. I scraped the following things from each video:
link |
title |
description |
category |
Video ID |
Category for which the video was scraped |
Description of the video |
Category for which the video was scraped. |
I queried the videos for 4 categories:
Travel Vlogs 🧳
Food 🥑
Art and Music 🎨 🎻
History 📜
Acknowledgements 🙏
I could have used a ready made API, but just for the fun of it, I scraped the data from Youtube using Selenium.
Inspiration 🦋
The data is not clean (for your enjoyment of cleaning the data!), has some missing values, and is imbalanced.
Practice text classification on this dataset, you will have to learn different techniques for eg:- How to handle imbalanced classes..?
While working on this dataset, you will learn a lot of different things and also get an opportunity to apply on this dataset.