Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/runtime-error786/context-compression-retriever
https://github.com/runtime-error786/context-compression-retriever
huggingface-transformers langchain llama3-meta-ai
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/runtime-error786/context-compression-retriever
- Owner: runtime-error786
- Created: 2024-08-23T20:34:24.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-08-23T20:44:48.000Z (5 months ago)
- Last Synced: 2024-08-23T22:02:47.876Z (5 months ago)
- Topics: huggingface-transformers, langchain, llama3-meta-ai
- Language: Jupyter Notebook
- Homepage:
- Size: 6.2 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Document Compressor Pipeline
## Description
The LLM-Enhanced PDF Knowledge Extractor is an advanced tool designed to process and analyze large collections of PDF documents using state-of-the-art Large Language Models (LLMs). The tool splits text from PDFs into smaller chunks, embeds the chunks using HuggingFace's sentence-transformers, and stores them in a Chroma vector database. The stored embeddings are then queried using various retrieval mechanisms, including contextual compression and filtering, to provide accurate and relevant responses to user queries.This tool is ideal for tasks such as document indexing, information retrieval, and knowledge management across large document sets.
## Features
PDF Parsing: Automatically loads and parses multiple PDF files from a specified directory.
Text Chunking: Splits parsed text into manageable chunks for efficient processing.
Embeddings Generation: Uses sentence-transformers from HuggingFace to generate high-quality embeddings for each text chunk.
Vector Store Management: Stores and retrieves embeddings using Chroma, a high-performance vector database.
Contextual Compression: Utilizes contextual compression techniques to retrieve the most relevant information from large document sets.
Custom Retrieval Chains: Includes multiple retrieval chains like LLMChainExtractor, LLMChainFilter, EmbeddingsFilter, and DocumentCompressorPipeline for flexible query processing.