https://github.com/anquetos/gcp-professional-data-engineer-rag
Build a local RAG (Retrieval Augmented Generation) to generate exam questions for the Google Cloud Platform professional Data Engineer certification.
https://github.com/anquetos/gcp-professional-data-engineer-rag
embeddings huggingface langchain pdf-extraction pdfplumber rag sentence-transformers tokenizer vector-search
Last synced: 2 months ago
JSON representation
Build a local RAG (Retrieval Augmented Generation) to generate exam questions for the Google Cloud Platform professional Data Engineer certification.
- Host: GitHub
- URL: https://github.com/anquetos/gcp-professional-data-engineer-rag
- Owner: anquetos
- Created: 2024-10-30T13:33:24.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-02-06T14:21:27.000Z (8 months ago)
- Last Synced: 2025-02-06T14:27:26.466Z (8 months ago)
- Topics: embeddings, huggingface, langchain, pdf-extraction, pdfplumber, rag, sentence-transformers, tokenizer, vector-search
- Language: Jupyter Notebook
- Homepage:
- Size: 285 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Local RAG for GCP Professional Data Engineer certification
This project aims at building a local RAG which will help in training for the Google Cloud Cloud Professional Data Engineer certification by generating exam questions.
Various topics will be covered on the journey to build this RAG like :
* extracting content from a PDF file ;
* embeddings text ;
* generating output based on retrieved context ;
* creating custom prompt templates ;
* building a user interface.> **Note**
> 🙏 This project won't have been possible without the great video tutorial ([Local Retrieval Augmented Generation (RAG) from Scratch](#https://youtu.be/qN_2fnOPY-M?si=9dsfcNGMjgQhF8Bs)) from [Daniel Bourke](#https://www.mrdbourke.com/).## Project structure
```
.
├── .gitignore
├── README.md
├── notebooks
│ └── rag-building-discovery.ipynb
├── notes.md
│ └── source.pdf
├── requirements.txt
├── src
│ ├── __init__.py
│ ├── generation
│ │ ├── __init__.py
│ │ ├── augment_prompt.py
│ │ ├── generation_pipeline.py
│ │ ├── load_model.py
│ │ └── text_retriever.py
│ ├── helpers
│ │ └── timing_functions.py
│ └── preprocessing
│ ├── pdf_extractor.py
│ ├── preprocessing_pipeline.py
│ └── text_embedder.py
└── templates
├── generate_exam_question.yaml
└── question_answer.yaml
```