https://github.com/antoniakras/semantic-video-search
GPU-optimized semantic search on video transcripts, with benchmarking of FAISS, Pinecone, and PostgreSQL vector databases. Deployed via Docker on FORTH’s GPU infrastructure.
https://github.com/antoniakras/semantic-video-search
bert-embeddings bert-fine-tuning cuda dokcer embedding-models embeddings-word2vec faiss-vector-database gpu-computing huggingface-transformers nlp-machine-learning pgvector pineconedb postgresql python pytorch retrieval-augmented-generation similarity-search vector-database whisper-ai
Last synced: about 1 month ago
JSON representation
GPU-optimized semantic search on video transcripts, with benchmarking of FAISS, Pinecone, and PostgreSQL vector databases. Deployed via Docker on FORTH’s GPU infrastructure.
- Host: GitHub
- URL: https://github.com/antoniakras/semantic-video-search
- Owner: antoniakras
- Created: 2025-09-20T19:24:37.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-09-30T10:17:08.000Z (9 months ago)
- Last Synced: 2025-09-30T12:18:21.794Z (9 months ago)
- Topics: bert-embeddings, bert-fine-tuning, cuda, dokcer, embedding-models, embeddings-word2vec, faiss-vector-database, gpu-computing, huggingface-transformers, nlp-machine-learning, pgvector, pineconedb, postgresql, python, pytorch, retrieval-augmented-generation, similarity-search, vector-database, whisper-ai
- Homepage:
- Size: 11.7 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 🎓 Bachelor Thesis: GPU-Optimized Semantic Search on Video Transcripts
## 📌 Overview
This repository documents my **Bachelor of Science in Computer Science thesis project** at **FORTH (Foundation for Research and Technology – Hellas)**.
The project implemented an **end-to-end Retrieval-Augmented Generation (RAG) pipeline**, enabling **semantic search over video transcripts** with results enriched by **timestamps and video identifiers**.
✨ Key features include:
- **RAG Architecture**: Combined video transcript retrieval with embedding-based semantic search.
- **End-to-End Workflow**: Built a **Django REST Framework** API for embedding generation, vector storage, and similarity search.
- **Semantic Video Transcript Search**: Extracted audio from videos, transcribed with **OpenAI Whisper**, and embedded with **BERT**.
- **GPU Optimization**: Leveraged **PyTorch + FAISS GPU** for high-performance similarity search and efficient memory use.
- **Vector Database Benchmarking**: Conducted a comparative study of **FAISS, Pinecone, and PostgreSQL**
- **Embedding Optimization**
- **Deployment**: Containerized the system with **Docker** for reproducibility and portability.
- **Environment**: Developed in **Python on Linux Ubuntu**, running exclusively on **FORTH’s GPU infrastructure via VPN**.
⚠️ **Note**: The source code is not publicly available due to infrastructure and legislation restrictions. This repository serves as a **portfolio showcase**. Access for review under supervision is possible by request.
---
## 🛠️ Tech Stack
- **Languages**: Python
- **Frameworks/Models**: PyTorch, Hugging Face Transformers (BERT), OpenAI Whisper, Django REST Framework
- **Vector Databases**: FAISS (GPU), Pinecone, PostgreSQL (pgvector)
- **Benchmarking**: FAISS GPU benchmarking tools + custom evaluation scripts
- **Deployment**: Docker, Linux Ubuntu
- **Other Tools**: NLTK (text preprocessing), MoviePy (audio extraction)
## 🚀 Features
- End-to-end **RAG pipeline** for semantic retrieval.
- Transcribes video to text with **Whisper**.
- Encodes transcripts into embeddings with **BERT**.
- Optimized embeddings with **normalization and quantization**.
- Stores and queries embeddings in **vector databases**.
- Retrieves **most semantically similar transcript segments** with their **timeframes and source video IDs**.
- Fully **GPU-accelerated** for optimized performance.
- Exposed as a **REST API** for seamless integration.
---
## 📊 Vector Database Benchmarks & Embedding optimization
**Embedding Optimization**:
- Applied **normalization (L2, quantization, etc.)** for space and time efficiency.
- Matched **embedding normalization strategies** with appropriate similarity search methods (inner product, cosine similarity, Euclidean distance).
**Vector Databse Benchmarks**
I tested and compared FAISS, Pinecone, and PostgreSQL on:
- **Accuracy and integrity** of top-k retrieved results
- **Query speed / latency** and API response time (time complexity)
- **Storage efficiency** under different embedding optimization strategies (space complexity)
**Each benchmark combined**:
- Embedding normalization choice.
- Appropriate similarity search metric.
---
## 📸 Screenshots
**Workflow**

---
## About This Project
- **Type**: Bachelor Thesis Project
- **Institution**: University of Crete, Computer Science Department
- **Research Infrastructure**: The Foundation for Research and Technology - Hellas (FORTH)