{"id":31037032,"url":"https://github.com/abrahamkoloboe27/youtube-transcript-rag-project","last_synced_at":"2026-04-02T03:11:49.329Z","repository":{"id":310678444,"uuid":"1040721216","full_name":"abrahamkoloboe27/youtube-transcript-rag-project","owner":"abrahamkoloboe27","description":"Chat with your YouTube Video","archived":false,"fork":false,"pushed_at":"2025-12-22T09:54:28.000Z","size":638,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-12-23T20:54:21.960Z","etag":null,"topics":["ci-cd","docker","embeddings","huggingface","langchain","python","rag","rag-chatbot","streamlit","uv"],"latest_commit_sha":null,"homepage":"https://youtube-transcript-rag.streamlit.app/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/abrahamkoloboe27.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-08-19T12:07:41.000Z","updated_at":"2025-12-22T09:54:27.000Z","dependencies_parsed_at":"2025-08-25T05:02:12.299Z","dependency_job_id":null,"html_url":"https://github.com/abrahamkoloboe27/youtube-transcript-rag-project","commit_stats":null,"previous_names":["abrahamkoloboe27/youtube-transcript-rag-project"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/abrahamkoloboe27/youtube-transcript-rag-project","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abrahamkoloboe27%2Fyoutube-transcript-rag-project","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abrahamkoloboe27%2Fyoutube-transcript-rag-project/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abrahamkoloboe27%2Fyoutube-transcript-rag-project/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abrahamkoloboe27%2Fyoutube-transcript-rag-project/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/abrahamkoloboe27","download_url":"https://codeload.github.com/abrahamkoloboe27/youtube-transcript-rag-project/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abrahamkoloboe27%2Fyoutube-transcript-rag-project/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31294925,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-02T01:43:37.129Z","status":"online","status_checked_at":"2026-04-02T02:00:08.535Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ci-cd","docker","embeddings","huggingface","langchain","python","rag","rag-chatbot","streamlit","uv"],"created_at":"2025-09-14T04:46:52.238Z","updated_at":"2026-04-02T03:11:49.321Z","avatar_url":"https://github.com/abrahamkoloboe27.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Naive RAG YouTube\n\nA Retrieval-Augmented Generation (RAG) application that allows you to chat with the content of YouTube videos using AI.\n\n## 🎯 Overview\n\nThis project enables you to:\n1. Enter a YouTube video URL\n2. Automatically transcribe and index the video content\n3. Ask questions about the video in natural language\n4. Get AI-generated answers based on the actual video content\n\nIt combines several technologies:\n- YouTube Transcript API for video transcription\n- Sentence Transformers for text embedding\n- Qdrant for vector storage and similarity search\n- Groq for fast LLM inference\n- Streamlit for the web interface\n\n## 🏗️ Architecture\n\n```\n┌─────────────────┐    ┌──────────────────┐    ┌──────────────────┐\n│   YouTube URL   │───▶│  Transcript API  │───▶│  Text Chunks     │\n└─────────────────┘    └──────────────────┘    └──────────────────┘\n                                                              │\n                                                              ▼\n┌─────────────────┐    ┌──────────────────┐    ┌──────────────────┐\n│  User Question  │───▶│  Embedding Model │───▶│  Similarity      │\n└─────────────────┘    └──────────────────┘    │    Search        │\n                                                └──────────────────┘\n                                                              │\n                                                              ▼\n┌─────────────────┐    ┌──────────────────┐    ┌──────────────────┐\n│   Qdrant DB     │───▶│ Relevant Chunks  │───▶│  Groq LLM        │\n└─────────────────┘    └──────────────────┘    └──────────────────┘\n                                                              │\n                                                              ▼\n                                                ┌──────────────────┐\n                                                │   AI Answer      │\n                                                └──────────────────┘\n```\n\n## 📁 Project Structure\n\n```\nnaive-rag/\n├── main.py                 # Main CLI entry point\n├── streamlit_app.py        # Streamlit web interface\n├── src/\n│   ├── youtube.py          # YouTube URL handling and transcription\n│   ├── embedding.py        # Text chunking and embedding\n│   ├── qdrant.py           # Vector database operations\n│   ├── retrieve.py         # Similarity search in Qdrant\n│   ├── query.py            # LLM query generation\n│   ├── grok.py             # Groq API client\n│   ├── prompt.py           # Prompt templates\n│   └── loggings.py         # Logging configuration\n├── downloads/              # Temporary storage for transcripts\n└── requirements.txt        # Python dependencies\n```\n\n## 🚀 Getting Started\n\n### Prerequisites\n\n- Python 3.10+\n- A Groq API key (free at [groq.com](https://groq.com))\n- A Qdrant Cloud account (free tier available at [qdrant.tech](https://qdrant.tech))\n\n### Installation\n\n1. **Clone the repository:**\n   ```bash\n   git clone https://github.com/yourusername/naive-rag-youtube.git\n   cd naive-rag-youtube\n   ```\n\n2. **Create a virtual environment:**\n   ```bash\n   python -m venv venv\n   source venv/bin/activate  # On Windows: venv\\Scripts\\activate\n   ```\n\n3. **Install dependencies:**\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n4. **Set up environment variables:**\n   Create a `.env` file in the project root:\n   ```env\n   # Required\n   GROQ_API_KEY=your_groq_api_key_here\n   QDRANT_URL=your_qdrant_cluster_url\n   QDRANT_API_KEY=your_qdrant_api_key\n   \n   # Optional (for MongoDB logging and conversation storage)\n   MONGO_DB_URI_RAG=your_mongodb_uri\n   MONGO_DB_NAME_RAG=your_mongodb_database_name\n   ```\n\n### Running the Application\n\n#### CLI Version\n```bash\npython main.py\n```\n\n#### Web Interface (Streamlit)\n```bash\nstreamlit run streamlit_app.py\n```\n\n## 🧠 How It Works\n\n### 1. Video Ingestion\n1. User provides a YouTube URL\n2. System extracts the video ID\n3. Transcript is fetched using `youtube-transcript-api`\n4. Text is split into chunks (700 chars with 100 overlap)\n5. Each chunk is embedded using `sentence-transformers/all-mpnet-base-v2`\n6. Embeddings + metadata are stored in Qdrant\n\n### 2. Question Answering\n1. User asks a question in the chat interface\n2. Question is embedded using the same model\n3. Similarity search finds top 5 relevant chunks in Qdrant\n4. Chunks + conversation history are sent to Groq LLM\n5. LLM generates a contextualized answer\n6. Answer is displayed to the user\n\n### 3. Key Features\n- **Automatic Ingestion**: Videos are processed on first query\n- **Conversation Context**: Maintains chat history for coherent responses\n- **Model Selection**: Choose between different Groq models\n- **Generation Parameters**: Adjustable temperature and max tokens\n- **Multi-language Support**: Handles videos in different languages\n\n## ⚙️ Configuration\n\n### Environment Variables\n\n**Required:**\n- `GROQ_API_KEY`: Your Groq API key for LLM access\n- `QDRANT_URL`: Your Qdrant cluster URL\n- `QDRANT_API_KEY`: Your Qdrant API key\n\n**Optional (for MongoDB logging and conversation storage):**\n- `MONGO_DB_URI_RAG`: MongoDB connection URI (e.g., `mongodb+srv://user:pass@cluster.mongodb.net/`)\n- `MONGO_DB_NAME_RAG`: MongoDB database name for logs and conversations\n\n### Available Models\n- `openai/gpt-oss-120b`\n- `openai/gpt-oss-20b`\n- `qwen/qwen3-32b`\n\n### Adjustable Parameters\n- **Temperature**: Controls randomness (0.0 = deterministic, 1.0 = creative)\n- **Max Tokens**: Maximum length of generated response\n\n## 🛠️ Development\n\n### Main Modules\n\n#### `src/youtube.py`\n- Extracts video ID from YouTube URLs\n- Saves transcripts to text files\n\n#### `src/embedding.py`\n- Splits text into manageable chunks\n- Generates embeddings using Sentence Transformers\n- Stores embeddings in Qdrant\n\n#### `src/qdrant.py`\n- Manages connection to Qdrant vector database\n- Creates collections and indexes\n- Handles upsert and search operations\n\n#### `src/retrieve.py`\n- Performs similarity search in Qdrant\n- Filters results by video ID\n\n#### `src/query.py`\n- Formats prompts for the LLM\n- Calls Groq API to generate responses\n\n#### `src/grok.py`\n- Wrapper for Groq API client\n- Handles LLM inference\n\n#### `src/prompt.py`\n- Centralized prompt templates\n- Structured prompts for better responses\n\n### Logging\n\nAll modules use structured logging with support for:\n- **File logging**: Logs are written to `./logs/*.log` files\n- **Console logging**: Logs are displayed in the console\n- **MongoDB logging** (optional): Logs can be automatically sent to MongoDB when configured\n\nTo enable MongoDB logging, set the `MONGO_DB_URI_RAG` and `MONGO_DB_NAME_RAG` environment variables.\n\nFor detailed information on MongoDB logging configuration, see [MONGODB_LOGGING.md](MONGODB_LOGGING.md).\n\n## 🤝 Contributing\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/AmazingFeature`)\n3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)\n4. Push to the branch (`git push origin feature/AmazingFeature`)\n5. Open a Pull Request\n\n## 📄 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## 🙏 Acknowledgments\n\n- [youtube-transcript-api](https://github.com/jdepoix/youtube-transcript-api) for easy transcript retrieval\n- [Sentence Transformers](https://www.sbert.net/) for powerful embeddings\n- [Qdrant](https://qdrant.tech/) for the excellent vector database\n- [Groq](https://groq.com/) for blazing-fast LLM inference\n- [Streamlit](https://streamlit.io/) for the simple web framework\n\n## 🚨 Limitations\n\n- Only works with videos that have transcripts available\n- Performance depends on the quality of the original transcript\n- Free tiers of Qdrant and Groq have usage limits\n- Large videos may take time to process initially\n\n## 🔒 Privacy\n\n- Video content is processed locally for transcription\n- Only text chunks and embeddings are stored in Qdrant\n- No personal data is collected or stored by the application\n- API keys are stored locally in `.env` file\n\n## ⚠️ Limitations\n\n### YouTube IP Blocking\nWhen deployed on cloud platforms (Streamlit Cloud, Render, etc.), YouTube often blocks requests for transcripts due to their restrictions on cloud server IPs.\n\n**This is not a bug in the application but a limitation imposed by YouTube.**\n\n**Workarounds:**\n- Use videos that have manually added subtitles (more likely to be accessible)\n- Run the application locally on your machine\n- Consider alternative data sources for production deployments","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabrahamkoloboe27%2Fyoutube-transcript-rag-project","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabrahamkoloboe27%2Fyoutube-transcript-rag-project","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabrahamkoloboe27%2Fyoutube-transcript-rag-project/lists"}