Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/centralfloridaattorney/zmongo_retriever
Use data from MongoDB in LangChain, Llama and OpenAI
https://github.com/centralfloridaattorney/zmongo_retriever
data-chunking data-retrieval database document-processing langchain llamacpp machine-learning mongo mongodb openai python
Last synced: about 1 month ago
JSON representation
Use data from MongoDB in LangChain, Llama and OpenAI
- Host: GitHub
- URL: https://github.com/centralfloridaattorney/zmongo_retriever
- Owner: CentralFloridaAttorney
- License: mit
- Created: 2024-03-06T23:04:49.000Z (9 months ago)
- Default Branch: master
- Last Pushed: 2024-03-31T16:19:34.000Z (8 months ago)
- Last Synced: 2024-10-10T18:41:58.783Z (about 1 month ago)
- Topics: data-chunking, data-retrieval, database, document-processing, langchain, llamacpp, machine-learning, mongo, mongodb, openai, python
- Language: Python
- Homepage: https://CentralFloridaAttorney.net
- Size: 352 KB
- Stars: 4
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# ZMongoRetriever
`ZMongoRetriever` is a Python library designed to facilitate the retrieval, processing, and encoding of documents from MongoDB collections. It's especially suited for handling large datasets that require chunking and embedding for advanced machine learning applications. Through an elegant interface, it supports document splitting, custom encoding with OpenAI models, and direct integration with MongoDB databases.
## Features
- **Document Retrieval:** Seamlessly fetch documents from MongoDB collections.
- **Dynamic Chunking:** Split documents into manageable chunks based on character count or embedding size.
- **Embedding Support:** Encode document chunks using OpenAI's embedding models for deep learning tasks.
- **Flexible Configuration:** Customize chunk sizes, token overlaps, and database connections to fit your project needs.
- **Metadata Conversion:** Convert JSON to structured metadata for enhanced document handling.## Installation
Before you begin, ensure you have MongoDB and Python 3.6+ installed on your system. Clone this repository or download the `ZMongoRetriever` module directly. Dependencies can be installed via pip:
```bash
pip install -r requirements.txt
```## Environment Variable File
You must have a file named '.env' with the appropriate values for the following:
```angular2html
OPENAI_API_KEY=___
```## Quick Start
To get started with `ZMongoRetriever`, follow these steps:
1. **Initialize MongoDB Connection:**
```python
from pymongo import MongoClient
from zconstants import MONGO_URIclient = MongoClient(MONGO_URI)
```2. **Create an Instance of ZMongoRetriever:**
```python
from zmongo_retriever import ZMongoRetrieverretriever = ZMongoRetriever(mongo_uri=MONGO_URI, db_name='your_database', collection_name='your_collection')
```3. **Retrieve and Process Documents:**
```python
object_ids = ["65f28c8103fc21342e2dc04d", "65f28c8403fc21342e2dc064"]
documents = retriever.invoke(object_ids=object_ids, page_content_key='report.details.content')
```## Advanced Usage
### Encoding Document Chunks - Not Fully Implemented
Enable encoding to process document chunks with OpenAI's embeddings:
```python
retriever.use_encoding = True
encoded_chunks = retriever.invoke(object_ids=object_ids, page_content_key='report.details.content')
```### Custom Chunking and Overlaps
Customize the chunk size and token overlap for nuanced control over document processing:
```python
retriever.chunk_size = 1024 # Characters
retriever.overlap_prior_chunks = 2 # Number of chunks repeated in a subsequent Document list
```## License
Distributed under the MIT License. See `LICENSE` for more information.