https://github.com/vincentkoc/datahub-langchain
Seamless LLM Lineage for DataHub with LangChain and LangSmith
https://github.com/vincentkoc/datahub-langchain
datahub langchain langsmith lineage observability
Last synced: 5 months ago
JSON representation
Seamless LLM Lineage for DataHub with LangChain and LangSmith
- Host: GitHub
- URL: https://github.com/vincentkoc/datahub-langchain
- Owner: vincentkoc
- License: gpl-3.0
- Created: 2024-11-23T04:24:16.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-11-25T13:42:04.000Z (over 1 year ago)
- Last Synced: 2025-10-14T16:52:51.357Z (9 months ago)
- Topics: datahub, langchain, langsmith, lineage, observability
- Language: Python
- Homepage:
- Size: 672 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
> [!CAUTION]
> This is an experimental project and not ready for production use. Use at your own risk.
# Datahub LLM Lineage ๐
Seamless LLM Lineage for DataHub with LangChain and LangSmith
Features โข
Installation โข
Quick Start โข
Usage โข
Architecture โข
Contributing โข
License
A comprehensive observability solution that integrates LangChain and LangSmith workflows into DataHub's metadata platform, providing deep visibility into your LLM operations.
## Features
- ๐ **Real-Time Observation**: Live monitoring of LangChain operations
- ๐ **Rich Metadata**: Detailed tracking of models, prompts, and chains
- ๐ **Deep Insights**: Comprehensive metrics and lineage tracking
- ๐ **Multiple Platforms**: Support for LangChain, LangSmith, and more
- ๐ **Extensible**: Easy to add new platforms and emitters
- ๐งช **Debug Mode**: Built-in debugging and dry run capabilities
## Installation
```bash
# Clone the repository
git clone
cd langchain-datahub-integration
# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Copy and configure environment
cp .env.example .env
```
## Quick Start
1. **Configure Environment**
```bash
# Required environment variables
LANGSMITH_API_KEY=ls-...
LANGCHAIN_TRACING_V2=true
LANGCHAIN_PROJECT=default
OPENAI_API_KEY=sk-...
DATAHUB_GMS_URL=http://localhost:8080
DATAHUB_TOKEN=your_token_here
```
2. **Run Basic Example**
```python
from langchain_openai import ChatOpenAI
from src.platforms.langchain import LangChainObserver
from src.emitters.datahub import DataHubEmitter
from src.config import ObservabilityConfig
# Setup observation
config = ObservabilityConfig(langchain_verbose=True)
emitter = DataHubEmitter(gms_server="http://localhost:8080")
observer = LangChainObserver(config=config, emitter=emitter)
# Initialize LLM with observer
llm = ChatOpenAI(callbacks=[observer])
# Run with automatic observation
response = llm.invoke("Tell me a joke")
```
## Architecture
The integration consists of three main components:
1. **Observers** (`src/platforms/`)
- Real-time monitoring of LLM operations
- Metric collection and event tracking
- Platform-specific adapters
2. **Emitters** (`src/emitters/`)
- DataHub metadata emission
- Console debugging output
- JSON file export
3. **Collectors** (`src/collectors/`)
- Historical data collection
- Batch processing
- Aggregated metrics
## Usage Examples
### Basic LangChain Integration
```python
# examples/langchain_basic.py
from langchain_openai import ChatOpenAI
from src.platforms.langchain import LangChainObserver
observer = LangChainObserver(config=config, emitter=emitter)
llm = ChatOpenAI(callbacks=[observer])
```
### RAG Pipeline Integration
```python
# examples/langchain_rag.py
from langchain.chains import RetrievalQA
from src.utils.metrics import MetricsAggregator
chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(),
callbacks=[observer]
)
```
### Historical Data Ingestion
```python
# examples/langsmith_ingest.py
from src.cli.ingest import ingest_logic
ingest_logic(
days=7,
platform='langsmith',
debug=True,
save_debug_data=True
)
```
## Customization
The integration is highly customizable through:
- **Configuration** (`src/config.py`): Environment and platform settings
- **Custom Emitters**: Implement `LLMMetadataEmitter` for new destinations
- **Platform Extensions**: Add new platforms by implementing `LLMPlatformConnector`
- **Metrics Collection**: Extend `MetricsAggregator` for custom metrics
## Contributing
1. Fork the repository
2. Create a feature branch
3. Run tests and linting:
```bash
make test
make lint
```
4. Submit a pull request
## License
This project is licensed under the GNU General Public License v3.0 - see the [LICENSE](LICENSE) file for details.
---
Made with โค๏ธ by Vincent Koc