https://github.com/eddaoust/chatwithstarterstory
A basic Retrieval-Augmented Generation (RAG) implementation for testing purposes, built to enable conversational interactions with Starter Story YouTube video.
https://github.com/eddaoust/chatwithstarterstory
daisyui docker-compose elasticsearch embedings llphant openai php8 postgresql rag rag-chatbot symfony tailwindcss
Last synced: 2 months ago
JSON representation
A basic Retrieval-Augmented Generation (RAG) implementation for testing purposes, built to enable conversational interactions with Starter Story YouTube video.
- Host: GitHub
- URL: https://github.com/eddaoust/chatwithstarterstory
- Owner: Eddaoust
- Created: 2025-06-25T07:40:55.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-07-09T13:53:24.000Z (12 months ago)
- Last Synced: 2025-07-09T14:51:54.473Z (12 months ago)
- Topics: daisyui, docker-compose, elasticsearch, embedings, llphant, openai, php8, postgresql, rag, rag-chatbot, symfony, tailwindcss
- Language: PHP
- Homepage:
- Size: 110 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Chat with Starter Story - RAG Implementation
A basic **Retrieval-Augmented Generation (RAG)** implementation for testing purposes, built to enable conversational interactions with [Starter Story YouTube video](https://www.youtube.com/@starterstory).
## 🚀 Technology Stack
- **Backend**: Symfony 7.3 (PHP 8.4)
- **Database**: PostgreSQL 17
- **Search Engine**: Elasticsearch 8.13.2
- **AI/ML**: OpenAI API with [LLPhant library](https://github.com/LLPhant/LLPhant)
- **Frontend**: Tailwind (4.1) & DaisyUI
- **Web Server**: Caddy
## 📋 Prerequisites
- Docker and Docker Compose
- [OpenAI](https://platform.openai.com/) API key
- [Supadata](https://supadata.ai/) API access (for YouTube data fetching)
- [Youtube](https://console.cloud.google.com/) API key
## 🖥️ Demo
You can try a demo [here](https://chat.eddaoust.com/)
## 🛠️ Installation Setup
### 1. Clone the Repository
```bash
git clone
cd ChatWithStarterStory
```
### 2. Environment Configuration
Copy the environment files and configure them:
```bash
cp .env .env.local
```
Add your API keys to `.env.local`:
```env
OPENAI_API_KEY=your_openai_api_key_here
SUPADATA_API_KEY=your_supadata_api_key_here
YOUTUBE_API_KEY=your_youtube_api_key_here
```
### 3. Start the Application
```bash
# Build and start all services
docker compose --env-file .env.docker up -d --build
# Access the PHP container
docker exec -ti php /bin/bash
```
### 4. Install Dependencies & Setup Database
Inside the PHP container:
```bash
# Install Composer dependencies
composer install
# Create database and run migrations
bin/console doctrine:database:create
bin/console doctrine:migrations:migrate
# Build Tailwind CSS (in a separate terminal)
bin/console tailwind:build --watch
```
### 5. Access the Application
- **Web Interface**: http://localhost:8080 (Caddy will proxy to the Symfony app)
- **Elasticsearch**: http://localhost:9200
- **Database**: PostgreSQL on default port with credentials from `.env.docker`
## 📊 Data Generation for Embeddings
The RAG system requires a three-step data preparation process:
### Step 1: Import YouTube Videos
```bash
bin/console app:import-youtube-videos
```
This command:
- Fetches videos from the Starter Story YouTube channel
- Retrieves video metadata (title, description, thumbnail, etc.)
- Stores video information in the PostgreSQL database
- Processes up to 100 videos in batches
### Step 2: Create Transcription Chunks
```bash
bin/console app:create-transcription-chunks
```
This command:
- Fetches transcriptions for each imported video using Supadata API
- Breaks transcriptions into manageable chunks with timestamps
- Creates `TranscriptionChunk` entities with content, offset, and duration
- Respects API rate limits with built-in delays
### Step 3: Generate Embeddings
```bash
bin/console app:generate-embeddings
```
This command:
- Processes transcription chunks that don't have embeddings
- Generates vector embeddings using OpenAI's embedding model
- Stores embeddings for semantic search capabilities
- Processes chunks in batches of 25 for optimal performance
## 🧠 How It Works
### RAG Architecture Overview
1. **Data Ingestion**: YouTube videos are imported and transcribed into searchable chunks
2. **Vector Storage**: Text chunks are converted to embeddings and stored in Elasticsearch
3. **Query Processing**: User questions are converted to embeddings for similarity search
4. **Context Retrieval**: Most relevant video chunks are retrieved based on semantic similarity
5. **Response Generation**: OpenAI LLM generates answers using retrieved context
6. **Result Presentation**: Responses include relevant video links with timestamps
### Data Flow
```
User Question → Embedding → Vector Search → Context Building → LLM Query → Response + Video Links
```
## 🔧 Development Commands
### Docker Management
```bash
# Stop all services
docker compose down --remove-orphans
# View logs
docker compose logs -f
# Rebuild specific service
docker compose up -d --build php
```
### Asset Management
```bash
# Build Tailwind CSS
bin/console tailwind:build
# Watch for changes
bin/console tailwind:build --watch
```
## 📄 License
This project is for testing and educational purposes. Please ensure compliance with YouTube's Terms of Service and OpenAI's usage policies when using this application.