https://github.com/sayamalt/content-engine
Successfully designed and developed a system which analyzes and compares multiple PDF documents, specifically identifying and highlighting their differences.
https://github.com/sayamalt/content-engine
chromadb content-engineering langchain-python llm rag
Last synced: about 1 month ago
JSON representation
Successfully designed and developed a system which analyzes and compares multiple PDF documents, specifically identifying and highlighting their differences.
- Host: GitHub
- URL: https://github.com/sayamalt/content-engine
- Owner: SayamAlt
- Created: 2024-06-30T15:12:58.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-06-30T15:25:11.000Z (almost 2 years ago)
- Last Synced: 2025-09-10T12:21:47.130Z (9 months ago)
- Topics: chromadb, content-engineering, langchain-python, llm, rag
- Language: Jupyter Notebook
- Homepage:
- Size: 11.8 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Content Engine Documentation
### Overview
The Content Engine is designed to analyze and compare multiple PDF documents using Retrieval Augmented Generation (RAG) techniques. It integrates a backend framework, vector store, embedding model, and local language model (LLM), along with a Streamlit frontend for user interaction.
### 1. Setup
#### a) Backend Framework
- LangChain
A powerful toolkit for building LLM applications with a focus on retrieval-augmented generation.
Installation instructions: pip install langchain
- Frontend Framework
Streamlit
An open-source app framework for creating interactive web applications.
Installation instructions: pip install streamlit
- Vector Store
ChromaDB
Chosen for its efficient management and querying of embeddings.
Setup instructions:
pip install chromadb
- Embedding Model
Sentence Transformer
Local embedding model to generate embeddings from PDF content.
Installation:
pip install sentence-transformers
- Local Language Model (LLM)
Hugging Face Transformers
Integration of a local instance for processing and generating insights.
Installation:
pip install transformers
### 2. Initialization
#### Data Preparation
Download and preprocess the three provided PDF documents (Alphabet Inc., Tesla Inc., Uber Technologies Inc.).
#### Parsing Documents
Use PyMuPDF or PyPDF2 to extract text and structure from PDFs.
#### Generating Vectors
Utilize Sentence Transformer to create embeddings for document content.
#### Storing in Vector Store
Implement functions to persist embeddings into ChromaDB vector store.
### 3. Development
#### Configuring Query Engine
Define retrieval tasks based on document embeddings using ChromaDB.
#### Integrating LLM
Set up a local instance of a Large Language Model (LLM) for contextual insights.
#### Developing Chatbot Interface
Use Streamlit to create a user-friendly interface for querying and displaying comparative insights from documents.
### 3. Usage
- Clone the repository:
git clone https://github.com/yourusername/content-engine.git
cd content-engine - Install dependencies:
pip install -r requirements.txt - Run the Streamlit app:
streamlit run content_engine.py