https://github.com/vdutts7/ai-mreflow

YouTubeGPT • AI Chat with 100+ videos ft. YouTuber Matt Wolfe (@mreflow) 🐺🟣🤖💬
https://github.com/vdutts7/ai-mreflow

ai chatbot langchain pinecone vector-database vector-embeddings youtube-api-v3

Last synced: 4 months ago
JSON representation

YouTubeGPT • AI Chat with 100+ videos ft. YouTuber Matt Wolfe (@mreflow) 🐺🟣🤖💬

Host: GitHub
URL: https://github.com/vdutts7/ai-mreflow
Owner: vdutts7
Created: 2023-06-21T19:08:47.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-11-30T21:39:01.000Z (almost 2 years ago)
Last Synced: 2025-04-27T00:32:23.633Z (6 months ago)
Topics: ai, chatbot, langchain, pinecone, vector-database, vector-embeddings, youtube-api-v3
Language: TypeScript
Homepage: https://mreflow-ai.vercel.app/
Size: 70.5 MB
Stars: 32
Watchers: 1
Forks: 3
Open Issues: 17
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          





    

    

    

  

  

  
YouTubeGPT ft. Matt Wolfe (@mreflow) 
  AI Chatbot with 100+ videos from YouTuber Matt Wolfe  @mreflow  
 
   
 

## Table of Contents

  


    📝 About

        

        

    💻 How to build

        

            Initial setup

            Handle massive data

            Embeddings and database backend

            Frontend UI with chat

            Run app

        

    🚀 Next steps

        

            Deploy

            Customizations

        

    🔧 Tools used

        

        

    👤 Contact

  





## 📝 About

Chat with 100+ YouTube videos from any creator in less than 10 minutes. This project combines basic Python scripting, vector embeddings, OpenAI, Pinecone, and Langchain into a modern chat interface, allowing you to quickly reference any content your favorite YouTuber covers. Type in natural language and get returned detailed answers: (1) in the style / tone of your YouTuber, and (2) with the top 2-3 specific videos referenced hyperlinked.

(back to top)
 

## 💻 How to build 

_Note: macOS version, adjust accordingly for Windows / Linux_

### Initial setup

Clone and install dependencies:

```

git clone https://github.com/vdutts7/ai-mreflow

cd ai-mreflow

npm i

```

Copy `.env.example` and rename to `.env` in root directory. Fill out API keys:

```

ASSEMBLY_AI_API_TOKEN=""

OPENAI_API_KEY=""

PINECONE_API_KEY=""

PINECONE_ENVIRONMENT=""

PINECONE_INDEX=""

```

Get API keys:

- [AssemblyAI](https://www.assemblyai.com/docs) - ~ $3.50 per 100 vids

- [OpenAI](https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key)

- [Pinecone](https://docs.pinecone.io/docs/quickstart)

      

_**IMPORTANT: Verify that `.gitignore` contains `.env` in it.**_

### Handle massive data

Outline: 

- Export metadata (.csv) of YouTube videos ⬇️

- Download the audio files

- Transcribe audio files

Navigate to `scripts` folder, which will host all of the data from the YouTube videos. 

   

   ```

   cd scripts

   ```

Setup python environemnt:

```

conda env list

conda activate youtube-chat

pip install -r requirements.txt

```

  

Scrape YouTube channel-- replace `@mreflow` with @ of your choice. Replace `` with the number of videos you want included (the script traverses backwards starting from most recent upload). A new file `.csv` will be created at the directory as referenced below:

```

python scripts/scrape_vids.py https://www.youtube.com/@ `` scripts/vid_list/.csv

```

Refer to `example.csv` inside folder and verify your output matches this format:



    

Download audio files:

```

python scripts/download_yt_audios.py scripts/vid_list/.csv scripts/audio_files/

```



We will utilize AssemblyAI's API wrapper class for OpenAI's Whisper API. Their script provides step-by-step directions for a more efficient, faster speech-to-text conversion as Whisper is way too slow and will cost you more. I spent ~ $3.50 to transcribe the 112 videos for Matt Wolfe. 



```

python scripts/transcribe_audios.py scripts/audio_files/ scripts/transcripts

```



Upsert to Pinecone database:

```

python scripts/pinecone_helper.py scripts/vid_list/.csv scripts/transcripts/

```

Pinecone index setup I used below. I used P1 since this is optimized for speed. 1536 is OpenAI's standard we're limited to when querying data from the vectorstore: 



### Embeddings and database backend

Breaking down `scripts/pinecone_helper.py` :

- Chunk size of 1000 characters with 500 character overlap. I found this working for me but obviously experiment and adjust according to your content library's size, complexity, etc.

- Metadata: (1) video url and (2) video title

With Pinecone vectorstore loaded, we use Langchain's Conversational Retrieval QA to ask questions, extract relevant metadata from our embeddings, and deliver back to the user in a packaged format as an answer. 

The relevant video titles are cited via hyperlinks directly to the video url.

### Frontend UI with chat

NextJs styled with Tailwind CSS. `src/pages/index.tsx` contains base skeleton. `src/pages/api/chat-chain.ts` is heart of the code where the Langchain connections are outlined.

### Run app

```

npm run dev

```

Go to `http://localhost:3000`. You should be able to type and ask questions now. Done ✅ 





## 🚀 Next steps

### Deploy

I used [Vercel](https://vercel.com/dashboard) as this was a relatively small project.

_Alternatives: Heroku, Firebase, AWS Elastic Beanstalk, DigitalOcean, etc._

### Customizations

**UI/UX:** change to your liking. 

**Bot personality:** edit prompt template in `/src/pages/api/chat-chain.ts` to fine-tune and add greater control on the bot's outputs.

(back to top)


## 🔧 Built With

[![Next][Next]][Next-url]

[![Typescript][Typescript]][Typescript-url]

[![Python][Python]][Python-url]

[![Langchain][Langchain]][Langchain-url]

[![OpenAI][OpenAI]][OpenAI-url]

[![AssemblyAI][AssemblyAI]][AssemblyAI-url]

[![Pinecone][Pinecone]][Pinecone-url]

[![Tailwind CSS][TailwindCSS]][TailwindCSS-url]

[![Vercel][Vercel]][Vercel-url]

(back to top)


## 👤 Contact

`me@vdutts7.com` 

🔗 Project Link: `https://github.com/vdutts7/ai-mreflow`

(back to top)


[Python]: https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54

[Python-url]: https://www.python.org/

[Next]: https://img.shields.io/badge/next.js-000000?style=for-the-badge&logo=nextdotjs&logoColor=white

[Next-url]: https://nextjs.org/

[Langchain]: https://img.shields.io/badge/🦜🔗Langchain-DD0031?style=for-the-badge&color=

[Langchain-url]: https://langchain.com/

[TailwindCSS]: https://img.shields.io/badge/Tailwind_CSS-38B2AC?style=for-the-badge&logo=tailwind-css&logoColor=skyblue&color=0A192F

[TailwindCSS-url]: https://tailwindcss.com/

[OpenAI]: https://img.shields.io/badge/OpenAI%20ada--002%20GPT--3.5%20Whisper-0058A0?style=for-the-badge&logo=openai&logoColor=white&color=4aa481

[OpenAI-url]: https://openai.com/

[AssemblyAI]: https://img.shields.io/badge/Assembly_AI-DD0031?style=for-the-badge&logo=https://github.com/vdutts7/yt-ai-chat/public/assemblyai.png&color=blue

[AssemblyAI-url]: https://www.assemblyai.com/

[TypeScript]: https://img.shields.io/badge/TypeScript-007ACC?style=for-the-badge&logo=typescript&logoColor=white

[Typescript-url]: https://www.typescriptlang.org/

[Pinecone]: https://img.shields.io/badge/Pinecone-FFCA28?style=for-the-badge&https://github.com/vdutts7/yt-ai-chat/public/pinecone.png&logoColor=black&color=white

[Pinecone-url]: https://www.pinecone.io/

[Vercel]: https://img.shields.io/badge/Vercel-FFFFFF?style=for-the-badge&logo=Vercel&logoColor=white&color=black

[Vercel-url]: https://Vercel.com/

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vdutts7/ai-mreflow

Awesome Lists containing this project

README

YouTubeGPT ft. Matt Wolfe (@mreflow)