https://github.com/endevsols/long-trainer

Introducing LongTrainer, a sophisticated extension of the LangChain framework designed specifically for managing multiple bots and providing isolated, context-aware chat sessions. Ideal for developers and businesses looking to integrate complex conversational AI into their systems, LongTrainer simplifies the deployment and customization of LLMs.
https://github.com/endevsols/long-trainer

gpt langchain langchain-python llm-training longtrainer openai rag

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/endevsols/long-trainer
Owner: ENDEVSOLS
License: mit
Created: 2023-12-07T16:37:26.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2024-12-17T13:17:34.000Z (over 1 year ago)
Last Synced: 2025-06-24T04:04:54.920Z (about 1 year ago)
Topics: gpt, langchain, langchain-python, llm-training, longtrainer, openai, rag
Language: Python
Homepage: https://endevsols.github.io/Long-Trainer/
Size: 842 KB
Stars: 9
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md

Awesome Lists containing this project

README

          


  



LongTrainer 1.2.0 — Production-Ready RAG Framework




  Multi-tenant bots, streaming, tools, and persistent memory — all batteries included.





  

    

  

  

    

  

  

    

  

  

    

  

  

    

  

  

  

    

  

  

    

  





  Documentation •

  Quick Start •

  Features •

  Migration from 0.3.4 •

  Sponsor



---

## What is LongTrainer?

LongTrainer is a **production-ready RAG framework** that turns your documents into intelligent, multi-tenant chatbots — with **5 lines of code**.

Built on top of LangChain, LongTrainer handles the hard parts that every production RAG system needs: **multi-bot isolation, persistent MongoDB memory, FAISS vector search, streaming responses, custom tool calling, chat encryption, and vision support** — so you don't have to wire them together yourself.

### Why LongTrainer over raw LangChain / LlamaIndex?

| Problem | LangChain / LlamaIndex | LongTrainer |

|---|---|---|

| Multi-bot management | DIY — manage state per bot | Built-in: `initialize_bot_id()` → isolated bots |

| Persistent chat memory | Wire MongoDB/Redis yourself | Built-in: MongoDB-backed, encrypted, restorable |

| Document ingestion | Assemble loaders + splitters | One-liner: `add_document_from_path(path, bot_id)` |

| Streaming responses | Implement `astream` yourself | `get_response(stream=True)` yields chunks |

| Custom tool calling | Define tools, build agent | `add_tool(my_tool)` — plug and play |

| Web search augmentation | Find and integrate search | Built-in toggle: `web_search=True` |

| Vision chat | Complex multi-modal setup | `get_vision_response()` — pass images |

| Self-improving from chats | Not a concept | `train_chats()` feeds Q&A back into KB |

| Encryption at rest | DIY | `encrypt_chats=True` — Fernet out of the box |

---

## Installation

```bash

pip install longtrainer

```

**With agent/tool-calling support (optional):**

```bash

pip install longtrainer[agent]

```

### System Dependencies

Linux (Ubuntu/Debian)

```bash

sudo apt install libmagic-dev poppler-utils tesseract-ocr qpdf libreoffice pandoc

```

macOS

```bash

brew install libmagic poppler tesseract qpdf libreoffice pandoc

```

---

## Quick Start 🚀

### 1. Zero-Code CLI & API Server (New in 1.2.0!)

Manage bots, chat, and run a production API directly from your terminal—no Python required.

#### A. Interactive Terminal Chat

```bash

# 1. Initialize a new project and generate longtrainer.yaml

longtrainer init

# 2. Create a new bot

longtrainer bot create --prompt "You are a helpful assistant."

# 3. Add a document (PDF, link, etc.)

longtrainer add-doc  /path/to/document.pdf

# 4. Start chatting!

longtrainer chat 

```

#### B. FastAPI REST Server

Start a production-ready API server backed by your LongTrainer bots:

```bash

longtrainer serve

```

This starts a FastAPI server running on `http://localhost:8000` with **16 REST endpoints**, including:

- `/health`

- `/bots` (CRUD)

- `/bots/{id}/documents/path` (Ingest files)

- `/bots/{id}/chats` (Create sessions)

- `/bots/{id}/chats/{chat_id}` (Chat and Streaming)

Visit `http://localhost:8000/docs` to see the auto-generated Swagger UI and test the API directly!

### 2. Python SDK — Default RAG Mode

```python

from longtrainer.trainer import LongTrainer

import os

os.environ["OPENAI_API_KEY"] = "sk-..."

# Initialize

trainer = LongTrainer(mongo_endpoint="mongodb://localhost:27017/")

bot_id = trainer.initialize_bot_id()

# Add documents (PDF, DOCX, CSV, HTML, MD, TXT, URLs, YouTube, Wikipedia)

trainer.add_document_from_path("path/to/your/data.pdf", bot_id)

# Create bot and start chatting

trainer.create_bot(bot_id)

chat_id = trainer.new_chat(bot_id)

# Get response

answer, sources = trainer.get_response("What is this document about?", bot_id, chat_id)

print(answer)

```

### Streaming Responses

```python

# Stream tokens in real-time

for chunk in trainer.get_response("Summarize the key points", bot_id, chat_id, stream=True):

    print(chunk, end="", flush=True)

```

### Async Streaming

```python

async for chunk in trainer.aget_response("Explain the methodology", bot_id, chat_id):

    print(chunk, end="", flush=True)

```

### AgentBot automatically routes questions to tools like web search when necessary.

### 🌟 NEW: Dynamic ZERO CODE Tools

LongTrainer V2 now integrates LangChain's massive dynamic tool ecosystem **natively**:

```python

trainer.create_bot(

    "agent-id", 

    agent_mode=True, 

    tools=["tavily_search_results_json", "wikipedia", "arxiv", "PythonREPLTool", "yahoo_finance_news"]

)

```

LongTrainer will dynamically import and initialize ANY string-based tool from `langchain.agents.load_tools` natively on the backend!

You may still register custom tools globally or per-bot explicitly:

```python

from langchain.tools import tool

@tool

def get_weather(location: str):

```

### Agent Mode — With Custom Tools

```python

from longtrainer.tools import web_search

from langchain_core.tools import tool

# Add built-in web search tool

trainer.add_tool(web_search, bot_id)

# Add your own custom tool

@tool

def calculate(expression: str) -> str:

    """Evaluate a math expression."""

    return str(eval(expression))

trainer.add_tool(calculate, bot_id)

# Create bot in agent mode

trainer.create_bot(bot_id, agent_mode=True)

chat_id = trainer.new_chat(bot_id)

response, _ = trainer.get_response("What is 42 * 17?", bot_id, chat_id)

print(response)

```

### Vision Chat

```python

vision_id = trainer.new_vision_chat(bot_id)

response, sources = trainer.get_vision_response(

    "Describe what you see in this image",

    image_paths=["photo.jpg"],

    bot_id=bot_id,

    vision_chat_id=vision_id,

)

print(response)

```

### Per-Bot Customization

```python

from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Each bot can have its own LLM, embeddings, and retrieval config

trainer.create_bot(

    bot_id,

    llm=ChatOpenAI(model="gpt-4o-mini", temperature=0.2),

    embedding_model=OpenAIEmbeddings(model="text-embedding-3-small"),

    num_k=5,                    # retrieve 5 docs per query

    prompt_template="You are a helpful legal assistant. {context}",

    agent_mode=True,            # enable tool calling

    tools=[web_search],

)

```

---

## Features ✨

### Core

- ✅ **Dual Mode:** RAG (LCEL chain) for simple Q&A, Agent (LangGraph) for tool calling

- ✅ **Streaming Responses:** Sync and async streaming out of the box

- ✅ **Custom Tool Calling:** Add any LangChain `@tool` — web search, document reader, or your own

- ✅ **Multi-Bot Management:** Isolated bots with independent sessions, data, and configs

- ✅ **Persistent Memory:** MongoDB-backed chat history, fully restorable

- ✅ **Chat Encryption:** Fernet encryption for stored conversations

### Document Ingestion

- ✅ **Standard Formats:** PDF, DOCX, CSV, HTML, Markdown, TXT

- ✅ **Web & Crawling:** `add_document_from_link()`, `add_document_from_query()`, `add_document_from_crawl()`

- ✅ **Cloud & Enterprise:** S3 (`add_document_from_aws_s3`), Google Drive (`add_document_from_google_drive`), Confluence (`add_document_from_confluence`)

- ✅ **Structued Data:** Local Directory (`add_document_from_directory`), JSON & JQ (`add_document_from_json`), GitHub Repo (`add_document_from_github`)

- ✅ **Dynamic Integrations:** Inject ANY LangChain document loader class dynamically via `add_document_from_dynamic_loader()`

### RAG Pipeline & Vector DBs

- ✅ **Vector Databases:** FAISS, Pinecone, Chroma, Qdrant, **PGVector, MongoDB Atlas, Milvus, Elasticsearch, Weaviate**

- ✅ **Multi-Query Ensemble Retrieval:** Generates alternative queries for better recall

- ✅ **Self-Improving Memory:** `train_chats()` feeds past Q&A back into the knowledge base

### Customization

- ✅ **Per-bot LLM** — use different models for different bots

- ✅ **Per-bot Embeddings** — custom embedding models per bot

- ✅ **Per-bot Retrieval Config** — custom `num_k`, `chunk_size`, `chunk_overlap`

- ✅ **Custom Prompt Templates** — full control over system prompts

- ✅ **Vision Chat** — GPT-4 Vision support with image understanding

### Works with All LangChain-Compatible LLMs

- ✅ OpenAI (default)

- ✅ Anthropic

- ✅ Google VertexAI / Gemini

- ✅ AWS Bedrock

- ✅ HuggingFace

- ✅ Groq

- ✅ Together AI

- ✅ Ollama (local models)

- ✅ Any `BaseChatModel` implementation

---

## API Reference

### `LongTrainer` — Main Class

```python

trainer = LongTrainer(

    mongo_endpoint="mongodb://localhost:27017/",

    llm=None,                # default: ChatOpenAI(model="gpt-4o-2024-08-06")

    embedding_model=None,    # default: OpenAIEmbeddings()

    prompt_template=None,    # custom system prompt

    max_token_limit=32000,   # conversation memory limit

    num_k=3,                 # docs to retrieve per query

    chunk_size=2048,         # text splitter chunk size

    chunk_overlap=200,       # text splitter overlap

    ensemble=False,          # enable multi-query ensemble retrieval

    encrypt_chats=False,     # enable Fernet encryption

    encryption_key=None,     # custom encryption key (auto-generated if None)

)

```

### Key Methods

| Method | Description |

|---|---|

| `initialize_bot_id()` | Create a new bot, returns `bot_id` |

| `create_bot(bot_id, ...)` | Build the bot from loaded documents |

| `load_bot(bot_id)` | Restore an existing bot from MongoDB + FAISS |

| `new_chat(bot_id)` | Start a new chat session, returns `chat_id` |

| `get_response(query, bot_id, chat_id, stream=False)` | Get response (or stream) |

| `aget_response(query, bot_id, chat_id)` | Async streaming response |

| `add_document_from_path(path, bot_id)` | Ingest a file |

| `add_document_from_link(links, bot_id)` | Ingest URLs / YouTube links |

| `add_tool(tool, bot_id)` | Register a tool for a bot |

| `remove_tool(tool_name, bot_id)` | Remove a tool |

| `list_tools(bot_id)` | List registered tools |

| `train_chats(bot_id)` | Self-improve from chat history |

| `new_vision_chat(bot_id)` | Start a vision chat session |

| `get_vision_response(query, images, bot_id, vision_id)` | Vision response |

---

## Migration from 0.3.4

LongTrainer 1.0.0 is a major upgrade with breaking changes:

| 0.3.4 | 1.0.0 |

|---|---|

| `ConversationalRetrievalChain` | LCEL chain (`RAGBot`) or LangGraph agent (`AgentBot`) |

| `requirements.txt` + `setup.py` | `pyproject.toml` (UV/pip compatible) |

| No streaming | `stream=True` or `aget_response()` |

| No tool calling | `add_tool()` + `agent_mode=True` |

| `langchain.memory` | `langchain_core.chat_history` |

| Fixed LLM for all bots | Per-bot LLM, embeddings, and config |

**Upgrade path:**

```bash

pip install --upgrade longtrainer

```

The core API (`initialize_bot_id`, `create_bot`, `new_chat`, `get_response`) remains the same — existing code should work with minimal changes. The main difference is `get_response()` now returns `(answer, sources)` instead of `(answer, sources, web_sources)`.

---

## Support the Project 💖

LongTrainer is free and open-source. If it's useful to you, consider sponsoring its development:



  

    

  



Your sponsorship helps fund:

- 🚀 New features (CLI, API server, evaluation tools)

- 🐛 Bug fixes and maintenance

- 📖 Documentation and tutorials

- 🧪 CI/CD infrastructure

---

## Citation

```

@misc{longtrainer,

  author = {Endevsols},

  title = {LongTrainer: Production-Ready RAG Framework},

  year = {2024},

  publisher = {GitHub},

  journal = {GitHub repository},

  howpublished = {\url{https://github.com/ENDEVSOLS/Long-Trainer}},

}

```

## License

[MIT License](LICENSE)

## Contributing

We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/endevsols/long-trainer

Awesome Lists containing this project

README

LongTrainer 1.2.0 — Production-Ready RAG Framework