https://github.com/wei/mlh-gemini-rag-workshop
Build a RAG-powered doc assistant with Gemini File Search
https://github.com/wei/mlh-gemini-rag-workshop
Last synced: 4 months ago
JSON representation
Build a RAG-powered doc assistant with Gemini File Search
- Host: GitHub
- URL: https://github.com/wei/mlh-gemini-rag-workshop
- Owner: wei
- Created: 2026-02-16T19:02:41.000Z (4 months ago)
- Default Branch: master
- Last Pushed: 2026-02-17T00:22:27.000Z (4 months ago)
- Last Synced: 2026-02-17T06:47:43.468Z (4 months ago)
- Language: Python
- Size: 1.19 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# MLH Gemini RAG Workshop
Build a RAG-powered doc assistant with [Gemini File Search](https://ai.google.dev/gemini-api/docs/file-search) for [MLH Global Hack Week](https://ghw.mlh.io/).
## What You'll Build
A chatbot that can answer questions about MLH's documentation using **Retrieval-Augmented Generation (RAG)**. Upload docs, ask questions, get accurate answers with citations — no hallucinations.
The same technique works with **any** documentation. Swap in your own docs after the workshop!
## Workshop: Python + Streamlit
| | Details |
|---|---|
| **Folder** | [`python/`](./python/) |
| **Stack** | Python 3.10+, Streamlit, `google-genai` SDK |
| **UI** | Streamlit chat components (chat interface in 10 lines!) |
| **Best For** | Learning RAG concepts deeply, rapid prototyping |
| **Deploy** | `streamlit run app.py` (or Streamlit Cloud for free hosting) |
## Demo Corpus: MLH Resources
The workshop uses these MLH docs as the knowledge base:
- [MLH Hackathon Organizer Guide](https://github.com/MLH/mlh-hackathon-organizer-guide) (organizer guide chapters, timeline, sponsorship, judging, community values)
- [MLH Policies](https://github.com/MLH/mlh-policies/tree/main) (code of conduct, community values)
- [MLH Hack Days Organizer Guide](https://github.com/MLH/mlh-hack-days-organizer-guide)
## What is RAG?
**Retrieval-Augmented Generation** solves a core LLM problem: models don't know about your private docs and can hallucinate when asked about them.
RAG works in 3 steps:
1. **Index** — Split your docs into chunks, convert to embeddings (numerical vectors), store in a vector database
2. **Retrieve** — When a user asks a question, find the most relevant chunks via semantic search
3. **Generate** — Feed the retrieved chunks to the LLM as context, generating an accurate, grounded answer
**Gemini File Search** handles all of this automatically — no external vector database, no chunking logic, no embedding pipeline. Upload files, query, done.
## Prerequisites
- A [Google AI Studio](https://aistudio.google.com/) account and API key (free)
- Python 3.10+, pip
- No AI/ML experience required!
## Getting Started
```bash
# Clone the repo
git clone https://github.com/wei/mlh-gemini-rag-workshop.git
cd mlh-gemini-rag-workshop/python
# Follow the STREAM_PLAN.md or docs/WORKSHOP.md
```
## Quick Demo
```bash
cd python/finished/
pip install -r requirements.txt
cp .env.example .env
# Add your GEMINI_API_KEY to .env
python setup_store.py
streamlit run app.py
```
## License
MIT