An open API service indexing awesome lists of open source software.

https://github.com/sayamalt/content-engine

Successfully designed and developed a system which analyzes and compares multiple PDF documents, specifically identifying and highlighting their differences.
https://github.com/sayamalt/content-engine

chromadb content-engineering langchain-python llm rag

Last synced: about 1 month ago
JSON representation

Successfully designed and developed a system which analyzes and compares multiple PDF documents, specifically identifying and highlighting their differences.

Awesome Lists containing this project

README

          

## Content Engine Documentation

### Overview

The Content Engine is designed to analyze and compare multiple PDF documents using Retrieval Augmented Generation (RAG) techniques. It integrates a backend framework, vector store, embedding model, and local language model (LLM), along with a Streamlit frontend for user interaction.

### 1. Setup
#### a) Backend Framework


  1. LangChain
    A powerful toolkit for building LLM applications with a focus on retrieval-augmented generation.
    Installation instructions: pip install langchain

  2. Frontend Framework
    Streamlit
    An open-source app framework for creating interactive web applications.
    Installation instructions: pip install streamlit

  3. Vector Store
    ChromaDB
    Chosen for its efficient management and querying of embeddings.
    Setup instructions:
    pip install chromadb

  4. Embedding Model
    Sentence Transformer
    Local embedding model to generate embeddings from PDF content.
    Installation:
    pip install sentence-transformers

  5. Local Language Model (LLM)
    Hugging Face Transformers
    Integration of a local instance for processing and generating insights.
    Installation:
    pip install transformers

### 2. Initialization

#### Data Preparation
Download and preprocess the three provided PDF documents (Alphabet Inc., Tesla Inc., Uber Technologies Inc.).

#### Parsing Documents
Use PyMuPDF or PyPDF2 to extract text and structure from PDFs.

#### Generating Vectors
Utilize Sentence Transformer to create embeddings for document content.

#### Storing in Vector Store
Implement functions to persist embeddings into ChromaDB vector store.

### 3. Development
#### Configuring Query Engine
Define retrieval tasks based on document embeddings using ChromaDB.

#### Integrating LLM
Set up a local instance of a Large Language Model (LLM) for contextual insights.

#### Developing Chatbot Interface
Use Streamlit to create a user-friendly interface for querying and displaying comparative insights from documents.

### 3. Usage


  • Clone the repository:

    git clone https://github.com/yourusername/content-engine.git
    cd content-engine


  • Install dependencies:
    pip install -r requirements.txt

  • Run the Streamlit app:
    streamlit run content_engine.py