https://github.com/sayamalt/content-engine

Successfully designed and developed a system which analyzes and compares multiple PDF documents, specifically identifying and highlighting their differences.
https://github.com/sayamalt/content-engine

chromadb content-engineering langchain-python llm rag

Last synced: about 1 month ago
JSON representation

Successfully designed and developed a system which analyzes and compares multiple PDF documents, specifically identifying and highlighting their differences.

Host: GitHub
URL: https://github.com/sayamalt/content-engine
Owner: SayamAlt
Created: 2024-06-30T15:12:58.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-06-30T15:25:11.000Z (almost 2 years ago)
Last Synced: 2025-09-10T12:21:47.130Z (9 months ago)
Topics: chromadb, content-engineering, langchain-python, llm, rag
Language: Jupyter Notebook
Homepage:
Size: 11.8 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

## Content Engine Documentation

### Overview

The Content Engine is designed to analyze and compare multiple PDF documents using Retrieval Augmented Generation (RAG) techniques. It integrates a backend framework, vector store, embedding model, and local language model (LLM), along with a Streamlit frontend for user interaction.

### 1. Setup
#### a) Backend Framework

LangChain
A powerful toolkit for building LLM applications with a focus on retrieval-augmented generation.
Installation instructions: pip install langchain

Frontend Framework
Streamlit
An open-source app framework for creating interactive web applications.
Installation instructions: pip install streamlit

Vector Store
ChromaDB
Chosen for its efficient management and querying of embeddings.
Setup instructions:
pip install chromadb

Embedding Model
Sentence Transformer
Local embedding model to generate embeddings from PDF content.
Installation:
pip install sentence-transformers

Local Language Model (LLM)
Hugging Face Transformers
Integration of a local instance for processing and generating insights.
Installation:
pip install transformers

### 2. Initialization

#### Data Preparation
Download and preprocess the three provided PDF documents (Alphabet Inc., Tesla Inc., Uber Technologies Inc.).

#### Parsing Documents
Use PyMuPDF or PyPDF2 to extract text and structure from PDFs.

#### Generating Vectors
Utilize Sentence Transformer to create embeddings for document content.

#### Storing in Vector Store
Implement functions to persist embeddings into ChromaDB vector store.

### 3. Development
#### Configuring Query Engine
Define retrieval tasks based on document embeddings using ChromaDB.

#### Integrating LLM
Set up a local instance of a Large Language Model (LLM) for contextual insights.

#### Developing Chatbot Interface
Use Streamlit to create a user-friendly interface for querying and displaying comparative insights from documents.

### 3. Usage

Clone the repository:

git clone https://github.com/yourusername/content-engine.git
cd content-engine

Install dependencies:
pip install -r requirements.txt

Run the Streamlit app:
streamlit run content_engine.py

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sayamalt/content-engine

Awesome Lists containing this project

README