https://github.com/anquetos/gcp-professional-data-engineer-rag

Build a local RAG (Retrieval Augmented Generation) to generate exam questions for the Google Cloud Platform professional Data Engineer certification.
https://github.com/anquetos/gcp-professional-data-engineer-rag

embeddings huggingface langchain pdf-extraction pdfplumber rag sentence-transformers tokenizer vector-search

Last synced: 2 months ago
JSON representation

Build a local RAG (Retrieval Augmented Generation) to generate exam questions for the Google Cloud Platform professional Data Engineer certification.

Host: GitHub
URL: https://github.com/anquetos/gcp-professional-data-engineer-rag
Owner: anquetos
Created: 2024-10-30T13:33:24.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-02-06T14:21:27.000Z (8 months ago)
Last Synced: 2025-02-06T14:27:26.466Z (8 months ago)
Topics: embeddings, huggingface, langchain, pdf-extraction, pdfplumber, rag, sentence-transformers, tokenizer, vector-search
Language: Jupyter Notebook
Homepage:
Size: 285 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Local RAG for GCP Professional Data Engineer certification

This project aims at building a local RAG which will help in training for the Google Cloud Cloud Professional Data Engineer certification by generating exam questions.

Various topics will be covered on the journey to build this RAG like :
* extracting content from a PDF file ;
* embeddings text ;
* generating output based on retrieved context ;
* creating custom prompt templates ;
* building a user interface.

> **Note**
> 🙏 This project won't have been possible without the great video tutorial ([Local Retrieval Augmented Generation (RAG) from Scratch](#https://youtu.be/qN_2fnOPY-M?si=9dsfcNGMjgQhF8Bs)) from [Daniel Bourke](#https://www.mrdbourke.com/).

## Project structure

```
.
├── .gitignore
├── README.md
├── notebooks
│   └── rag-building-discovery.ipynb
├── notes.md
├── pdf
│   └── source.pdf
├── requirements.txt
├── src
│   ├── __init__.py
│   ├── generation
│   │   ├── __init__.py
│   │   ├── augment_prompt.py
│   │   ├── generation_pipeline.py
│   │   ├── load_model.py
│   │   └── text_retriever.py
│   ├── helpers
│   │   └── timing_functions.py
│   └── preprocessing
│   ├── pdf_extractor.py
│   ├── preprocessing_pipeline.py
│   └── text_embedder.py
└── templates
├── generate_exam_question.yaml
└── question_answer.yaml
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/anquetos/gcp-professional-data-engineer-rag

Awesome Lists containing this project

README