https://github.com/shruthimohan03/content-engine

[EdTech] Content Engine - Query and Compare PDFs (Single agent RAG system)
https://github.com/shruthimohan03/content-engine

content-engine langchain retrieval-augmented-generation

Last synced: about 1 month ago
JSON representation

[EdTech] Content Engine - Query and Compare PDFs (Single agent RAG system)

Host: GitHub
URL: https://github.com/shruthimohan03/content-engine
Owner: shruthimohan03
Created: 2024-11-24T12:01:10.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-12-07T08:51:37.000Z (over 1 year ago)
Last Synced: 2025-06-03T22:04:15.330Z (12 months ago)
Topics: content-engine, langchain, retrieval-augmented-generation
Language: Python
Homepage:
Size: 21.7 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

https://github.com/user-attachments/assets/3948a27e-4efb-498a-89e1-99aca771bc3b

**Content Engine - Query and Compare PDFs**
[ML, NLP, Information Retrival, Document Search]
- This Streamlit application enables users to upload PDF documents and query their contents in a chatbot-like interface.
- It uses FAISS for efficient similarity search and a pre-trained embedding model to generate vector embeddings for document content.

**Features**
- Upload multiple PDF files to create a searchable document index.
- Ask questions about the content of the uploaded PDFs.
- Get relevant results ranked by their similarity to the query.
- Easily clear chat history and start fresh.
- Intuitive chatbot interface with session persistence.

**How It Works**
- Loading PDFs: PDF files are preprocessed, and their contents are converted into embeddings using a pre-trained model.
- Creating FAISS Index: The embeddings are stored in a FAISS index for fast similarity search.
- Querying: User queries are embedded and compared against the FAISS index to retrieve the most relevant content.
- Interactive Interface: Results are displayed in a chatbot-like interface, and users can reset the conversation with the "Clear Chat" button.

**Retrieval-Augmented Generation (RAG)**
In this project, RAG principles are applied to allow efficient PDF querying by leveraging:
- Document Retrieval: Relevant content is retrieved using FAISS (vector similarity search) based on semantic embeddings.
- Content Display: Instead of generating new content, the system retrieves the closest matches and presents them as answers, simulating RAG-like behavior without generative modeling.

**LangChain simplifies document preprocessing for RAG by:**
- Extracting content from PDFs using PyPDFLoader.
- Splitting documents into smaller, structured chunks for easier embedding and retrieval.

**Customization**
- Embedding Model: You can replace the default embedding model with another model supported by Hugging Face by modifying preprocess.py.
- FAISS Parameters: Tune FAISS settings for better indexing and search performance in query_engine.py.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shruthimohan03/content-engine

Awesome Lists containing this project

README