https://github.com/eddaoust/chatwithstarterstory

A basic Retrieval-Augmented Generation (RAG) implementation for testing purposes, built to enable conversational interactions with Starter Story YouTube video.
https://github.com/eddaoust/chatwithstarterstory

daisyui docker-compose elasticsearch embedings llphant openai php8 postgresql rag rag-chatbot symfony tailwindcss

Last synced: 3 months ago
JSON representation

A basic Retrieval-Augmented Generation (RAG) implementation for testing purposes, built to enable conversational interactions with Starter Story YouTube video.

Host: GitHub
URL: https://github.com/eddaoust/chatwithstarterstory
Owner: Eddaoust
Created: 2025-06-25T07:40:55.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-07-09T13:53:24.000Z (about 1 year ago)
Last Synced: 2025-07-09T14:51:54.473Z (about 1 year ago)
Topics: daisyui, docker-compose, elasticsearch, embedings, llphant, openai, php8, postgresql, rag, rag-chatbot, symfony, tailwindcss
Language: PHP
Homepage:
Size: 110 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Chat with Starter Story - RAG Implementation

A basic **Retrieval-Augmented Generation (RAG)** implementation for testing purposes, built to enable conversational interactions with [Starter Story YouTube video](https://www.youtube.com/@starterstory).

## 🚀 Technology Stack

- **Backend**: Symfony 7.3 (PHP 8.4)
- **Database**: PostgreSQL 17
- **Search Engine**: Elasticsearch 8.13.2
- **AI/ML**: OpenAI API with [LLPhant library](https://github.com/LLPhant/LLPhant)
- **Frontend**: Tailwind (4.1) & DaisyUI
- **Web Server**: Caddy

## 📋 Prerequisites

- Docker and Docker Compose
- [OpenAI](https://platform.openai.com/) API key
- [Supadata](https://supadata.ai/) API access (for YouTube data fetching)
- [Youtube](https://console.cloud.google.com/) API key

## 🖥️ Demo

You can try a demo [here](https://chat.eddaoust.com/)

## 🛠️ Installation Setup

### 1. Clone the Repository
```bash
git clone
cd ChatWithStarterStory
```

### 2. Environment Configuration
Copy the environment files and configure them:
```bash
cp .env .env.local
```

Add your API keys to `.env.local`:
```env
OPENAI_API_KEY=your_openai_api_key_here
SUPADATA_API_KEY=your_supadata_api_key_here
YOUTUBE_API_KEY=your_youtube_api_key_here
```

### 3. Start the Application
```bash
# Build and start all services
docker compose --env-file .env.docker up -d --build

# Access the PHP container
docker exec -ti php /bin/bash
```

### 4. Install Dependencies & Setup Database
Inside the PHP container:
```bash
# Install Composer dependencies
composer install

# Create database and run migrations
bin/console doctrine:database:create
bin/console doctrine:migrations:migrate

# Build Tailwind CSS (in a separate terminal)
bin/console tailwind:build --watch
```

### 5. Access the Application
- **Web Interface**: http://localhost:8080 (Caddy will proxy to the Symfony app)
- **Elasticsearch**: http://localhost:9200
- **Database**: PostgreSQL on default port with credentials from `.env.docker`

## 📊 Data Generation for Embeddings

The RAG system requires a three-step data preparation process:

### Step 1: Import YouTube Videos
```bash
bin/console app:import-youtube-videos
```
This command:
- Fetches videos from the Starter Story YouTube channel
- Retrieves video metadata (title, description, thumbnail, etc.)
- Stores video information in the PostgreSQL database
- Processes up to 100 videos in batches

### Step 2: Create Transcription Chunks
```bash
bin/console app:create-transcription-chunks
```
This command:
- Fetches transcriptions for each imported video using Supadata API
- Breaks transcriptions into manageable chunks with timestamps
- Creates `TranscriptionChunk` entities with content, offset, and duration
- Respects API rate limits with built-in delays

### Step 3: Generate Embeddings
```bash
bin/console app:generate-embeddings
```
This command:
- Processes transcription chunks that don't have embeddings
- Generates vector embeddings using OpenAI's embedding model
- Stores embeddings for semantic search capabilities
- Processes chunks in batches of 25 for optimal performance

## 🧠 How It Works

### RAG Architecture Overview

1. **Data Ingestion**: YouTube videos are imported and transcribed into searchable chunks
2. **Vector Storage**: Text chunks are converted to embeddings and stored in Elasticsearch
3. **Query Processing**: User questions are converted to embeddings for similarity search
4. **Context Retrieval**: Most relevant video chunks are retrieved based on semantic similarity
5. **Response Generation**: OpenAI LLM generates answers using retrieved context
6. **Result Presentation**: Responses include relevant video links with timestamps

### Data Flow
```
User Question → Embedding → Vector Search → Context Building → LLM Query → Response + Video Links
```

## 🔧 Development Commands

### Docker Management
```bash
# Stop all services
docker compose down --remove-orphans

# View logs
docker compose logs -f

# Rebuild specific service
docker compose up -d --build php
```

### Asset Management
```bash
# Build Tailwind CSS
bin/console tailwind:build

# Watch for changes
bin/console tailwind:build --watch
```

## 📄 License

This project is for testing and educational purposes. Please ensure compliance with YouTube's Terms of Service and OpenAI's usage policies when using this application.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/eddaoust/chatwithstarterstory

Awesome Lists containing this project

README