Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mukhopadhyay/youtubers-saying-things

Dataset containing popular YouTuber channel's video subtitles
https://github.com/mukhopadhyay/youtubers-saying-things

classification dataset kaggle kaggle-dataset nlp subtitles youtube

Last synced: 23 days ago
JSON representation

Dataset containing popular YouTuber channel's video subtitles

Awesome Lists containing this project

README

        

![background](documentation/background.jpg)

# Dataset with popular YouTuber's video subtitles

### Find it on:

[kaggle](https://www.kaggle.com/praneshmukhopadhyay/youtubers-saying-things)

## Intro

Founded and maintained since 2005, YouTube is one of the internet's biggest platforms. With their number of videos watched per day exceeding **1 Billion**, it's easy for any user to differentiate genres just by glancing at the thumbnail and the Title. My inspiration to make this dataset was to try and answer the question of _whether it is equally easy for a computer to do_.

The `Transcript` column in the dataset contains the subtitles for the respective videos. However, the reliability of the subtitles may vary. Even though the auto-generated subtitles work great (most of the time). Sometimes under heavy pressure of thick accents, it lets go of the ball. Please consult the `CC` attribute to check whether the subtitle is auto-generated or not. `1381` of these video subtitles are auto-generated, the rest of the `1134` are manual ones.

Since the values of `Subscribers` and `Views` are based on the time when the dataset was generated. That's to be taken into account. The most recent version of this dataset was generated on `05-Feb-2022`.

## Description

This dataset contains subtitles from over `91` different YouTubers, ranging from all different kinds of categories. The data were collected and cleaned (as much as necessary) by me. Currently, the dataset contains `2515` unique videos and their subtitles. There are `11` columns in the dataset.
The purpose of each column is as follows:

| **No** | **Column** | **Dtype** | **Description** |
| :----- | :------------ | :-------- | :------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1 | `Id` | `str` | Unique ID for the video. (e.g., `dQw4w9WgXcQ`) |
| 2 | `Channel` | `str` | Name of the YouTube channel. |
| 3 | `Subscribers` | `str` | How many subscribers did the channel have while collecting the dataset |
| 4 | `Title` | `str` | Title of the video |
| 5 | `CC` | `int` | Did the video have manual subtitles? (Possible values `0` or `1`, where `0` means that the Transcript is auto-generated, and may be less reliable) |
| 6 | `URL` | `str` | URL of the video (e.g., `https://www.youtube.com/watch?v=dQw4w9WgXcQ`) |
| 7 | `Released` | `str` | When the video was released |
| 8 | `Views` | `str` | How many views did the video have during the collection of this dataset |
| 9 | `Category` | `str` | Category of the channel (e.g., **Science**, **Comedy**, etc) |
| 10 | `Transcript` | `str` | Subtitle for the video |
| 11 | `Length` | `str` | Duration of the video |

## Genres and channels

| **Genre** | **Channels** | **#** |
| :-------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---- |
| `Comedy` | **Brooklyn Nine-Nine**, **DRIVETRIBE**, **Incognito Mode**, **Internet Historian**, **Joma Tech**, **Key & Peele**, **OverSimplified**, **Parks and Recreation**, **penguinz0**, **Screen Junkies**, **Team Coco**, **The Graham Norton Show**, **The Grand Tour**, **The Office**, **Top Gear** | 15 |
| `Science` | **3Blue1Brown**, **Dr. Becky**, **ElectroBOOM**, **Kurzgesagt – In a Nutshell**, **Lex Clips**, **Mark Rober**, **NileRed**, **PBS Space Time**, **SEA**, **SmarterEveryDay**, **Veritasium**, **Vsauce** | 12 |
| `Automobile` | **Car Throttle**, **carwow**, **ChrisFix**, **Donut Media**, **DRIVETRIBE**, **The Grand Tour**, **The Stig**, **TheStraightPipes**, **Throttle House**, **Top Gear** | 10 |
| `VideoGames` | **EpicNameBro**, **gameranx**, **jacksepticeye**, **Let's Game It Out**, **LevelCapGaming**, **LGR**, **Markiplier**, **Shroud**, **TheWarOwl**, **videogamedunkey** | 10 |
| `Food` | **About To Eat**, **Bon Appétit**, **Eater**, **Epicurious**, **First We Feast**, **FoodTribe**, **Gordon Ramsay**, **Hell's Kitchen**, **Munchies**, **Mythical Kitchen**, **The F Word** | 11 |
| `Entertainment` | **Brooklyn Nine-Nine**, **BuzzFeedVideo**, **Doctor Who**, **First We Feast**, **MrBeast**, **Parks and Recreation**, **Screen Junkies**, **Team Coco**, **The Graham Norton Show**, **The Office**, **The Try Guys** | 11 |
| `Informative` | **Barely Sociable**, **Captain Disillusion**, **CNET**, **Coder Coder**, **Internet Historian**, **JCS - Criminal Psychology**, **LEMMiNO**, **OverSimplified**, **Sam O'Nella Academy**, **The Infographics Show**, **Vox** | 11 |
| `Blog` | **Abroad in Japan**, **CDawgVA**, **DramaAlert**, **Drew Gooden**, **Incognito Mode**, **Joma in NYC**, **JRE Clips**, **Lex Clips**, **MrBeast**, **penguinz0**, **The Try Guys** | 11 |
| `News` | **A&E**, **BBC News**, **Insider News**, **NBC News**, **NowThis News**, **Sky News**, **SomeGoodNews**, **TechLinked**, **The Daily Show with Trevor Noah**, **VICE** | 10 |
| `Tech` | **Austin Evans**, **Coder Coder**, **Fireship**, **Hardware Canucks**, **Joma Tech**, **Linus Tech Tips**, **Marques Brownlee**, **TechLinked**, **Techquickie**, **Web Dev Simplified** | 10 |

## Note

I am open to suggestions please feel free to let me know of any major Categories or channels that I've missed or you'll like to be included. I'll try my best to include them to the dataset.

Thanks!

## Todo

- Data extraction script
- Seperate list of channels