https://github.com/saivarun2611/rag_student
I built a RAG chatbot that helps students find the perfect Northeastern University Data Science graduate courses based on what they're interested in. The tech stack includes FastAPI for the backend, FAISS for vector search, SentenceTransformers for embeddings, and Gemini 2.0 Flash for generating responses. The frontend is a clean and responsive.
https://github.com/saivarun2611/rag_student
beautifulsoup faiss fastapi gemini html javascript llm rag rag-chatbot sentence-transformers vector-embeddings webscraping
Last synced: 2 months ago
JSON representation
I built a RAG chatbot that helps students find the perfect Northeastern University Data Science graduate courses based on what they're interested in. The tech stack includes FastAPI for the backend, FAISS for vector search, SentenceTransformers for embeddings, and Gemini 2.0 Flash for generating responses. The frontend is a clean and responsive.
- Host: GitHub
- URL: https://github.com/saivarun2611/rag_student
- Owner: Saivarun2611
- License: mit
- Created: 2025-08-25T03:42:15.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-08-25T04:29:06.000Z (10 months ago)
- Last Synced: 2025-09-25T19:55:04.810Z (9 months ago)
- Topics: beautifulsoup, faiss, fastapi, gemini, html, javascript, llm, rag, rag-chatbot, sentence-transformers, vector-embeddings, webscraping
- Language: Python
- Homepage:
- Size: 10 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# π NEU DS Course Matcher
A Retrieval-Augmented Generation (RAG) chatbot that helps students discover **Northeastern University Data Science** graduate courses that fit their interests.
Built with **FastAPI**, **FAISS**, **SentenceTransformers**, **Gemini 2.0 Flash**, and a polished **HTML/CSS/JS** frontend.
---
## π₯ Demo
> Fallback download link: [demo.mp4](./demo.mp4)
---
## πΈ Screenshots


---
## β¨ Features
- π **Semantic Retrieval** with FAISS (cosine similarity on normalized embeddings)
- π€ **RAG Generation** using Gemini 2.0 Flash (grounded by retrieved context)
- π§ **Zero-Hallucination Prompting** (answers constrained to catalog context)
- β‘ **FastAPI** endpoints: `/retrieve` and `/ask`
- π₯οΈ **Clean Frontend** (`frontend.html`) with loader, cards, and helpful layout
---
## π§± Project Structure
```
RAG_Student/
βββ scraping.py # Scrapes catalog course links & descriptions
βββ preprocessing.py # Cleans text, builds processed_courses2.json
βββ embeddingvectordb.py # Builds FAISS index from processed data
βββ query.py # CLI test of retrieval (top-k)
βββ rag.py # LLM RAG (Gemini) using retrieved context
βββ api.py # FastAPI server exposing /retrieve and /ask
βββ frontend.html # Standalone UI (no build tools needed)
βββ data/
β βββ processed_courses2.json # Cleaned metadata (title, number, desc, url)
β βββ course_index.faiss # FAISS index (IP on normalized vectors)
βββ .env # GEMINI_API_KEY=...
βββ requirements.txt # Python dependencies
βββ README.md # This file
```
---
## π οΈ Prerequisites
- Python 3.9+ (recommended)
- A Google **Gemini API key**
- macOS/Linux/Windows
---
## βοΈ Setup
### 1) Clone & enter the project
```bash
git clone https://github.com/your-username/RAG_Student.git
cd RAG_Student
```
### 2) Create & activate a virtual environment
```bash
python3 -m venv venv
# macOS/Linux
source venv/bin/activate
# Windows (PowerShell)
venv\Scripts\Activate.ps1
```
### 3) Install dependencies
```bash
pip install -r requirements.txt
```
### 4) Add your Gemini API key
Create a `.env` file in the project root:
```ini
GEMINI_API_KEY=your_api_key_here
```
### 5) Prepare data (scrape β preprocess β index)
Run the pipeline in order (these produce files in `data/`):
```bash
python scraping.py
python preprocessing.py
python embeddingvectordb.py
```
You should now have:
- `data/processed_courses2.json`
- `data/course_index.faiss`
---
## π Run the App
### Backend (FastAPI)
```bash
uvicorn api:app --reload --port 8000
```
API base: [http://127.0.0.1:8000](http://127.0.0.1:8000)
Swagger docs: [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)
### Frontend (static HTML)
Open `frontend.html` directly in your browser (double-click or drag into a tab).
The page calls `http://127.0.0.1:8000/ask`. Make sure the backend is running.
---
## π§© API Endpoints
### `GET /health`
Health check.
**Response**
```json
{ "status": "ok" }
```
### `POST /retrieve`
Retrieve top-k relevant courses (no LLM).
**Request**
```json
{
"question": "I want courses in machine learning",
"top_k": 5
}
```
**Response**
```json
{
"courses": [
{
"rank": 1,
"course_number": "CS 6140",
"title": "Machine Learning",
"description": "Provides a broad look at ...",
"url": "https://catalog.northeastern.edu/...",
"score": 0.76
}
]
}
```
### `POST /ask`
RAG: retrieve + generate a grounded answer.
**Request**
```json
{
"question": "I want to learn about machine learning and AI",
"top_k": 5,
"temperature": 0.2
}
```
**Response**
```json
{
"model": "gemini-2.0-flash",
"answer": "Here are courses that cover ML and AI...",
"courses": [ /* same shape as /retrieve */ ]
}
```
`temperature` is optional (defaults to 0.2). Lower = more deterministic.
---
## π§ͺ Quick cURL Test
```bash
curl -X POST http://127.0.0.1:8000/ask \
-H "Content-Type: application/json" \
-d '{"question":"Which courses cover NLP?","top_k":5,"temperature":0.2}'
```
---
## π Notes on Retrieval
- Embeddings model: `sentence-transformers/all-MiniLM-L6-v2`
- We normalize embeddings and use FAISS `IndexFlatIP`
- Inner Product (IP) on normalized vectors = cosine similarity
---
## π License
This project is licensed under the MIT License. See LICENSE for details.
---
## π€ Author
Built by Saivarun Garimella NarasimhaΒ· Data Scientist
---
##