https://github.com/mocksi/json-rag
Reference implementation for chunking nested JSON into RAG-friendly document structures
https://github.com/mocksi/json-rag
Last synced: about 1 year ago
JSON representation
Reference implementation for chunking nested JSON into RAG-friendly document structures
- Host: GitHub
- URL: https://github.com/mocksi/json-rag
- Owner: Mocksi
- License: mit
- Created: 2024-12-16T21:19:51.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-04T00:03:16.000Z (about 1 year ago)
- Last Synced: 2025-06-13T23:23:47.361Z (about 1 year ago)
- Language: Python
- Size: 1000 KB
- Stars: 3
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# JSON RAG Integration
A tool for efficiently loading and integrating nested JSON data structures into RAG (Retrieval-Augmented Generation) systems, with enhanced entity tracking, relationship detection, and context preservation.
## Key Features
* **Advanced Query Understanding**:
- Temporal patterns (exact dates, relative ranges, named periods)
- Metric aggregations (average, maximum, minimum, sum, count)
- Entity relationships (direct, semantic, and cross-file connections)
- State transitions and system conditions
- Hybrid search combining vector similarity, relationships, and filters
* **Smart Data Processing**:
- Automatic entity detection and relationship mapping
- Cross-file relationship detection and validation
- Key-value pair extraction for filtered searches
- Embedded metadata tracking
- Batch processing with change detection
* **Archetype-Aware Processing**:
- Pattern detection (entities, events, metrics, collections)
- Archetype-based scoring and ranking
- Relationship validation by archetype
- Context-aware embedding generation
- Archetype-specific traversal strategies
* **Hierarchical Data Management**:
- Full JSON structure preservation
- Parent-child relationship tracking
- Cross-file relationship mapping
- Contextual embedding with ancestry
- Path-based chunk identification
* **Enhanced Retrieval**:
- Vector similarity search using PGVector
- Relationship-aware context assembly
- Entity-aware result filtering
- Cross-file context expansion
- Confidence-based scoring and ranking
## Quick Start
1. Clone and install:
```bash
git clone https://github.com/Mocksi/json-rag.git
cd json_rag
uv venv rag_env
source rag_env/bin/activate # Windows: .\rag_env\Scripts\activate
uv pip install -r requirements.txt
```
2. Set up environment:
```bash
# Create .env file with:
OPENAI_API_KEY=your-key-here
POSTGRES_DB=crowllector
POSTGRES_USER=crowllector
POSTGRES_PASSWORD=yourpassword
POSTGRES_HOST=localhost
POSTGRES_DB_PORT=5432
```
3. Initialize and run:
```bash
python -m app.main --new # Truncates all tables and starts fresh
python -m app.main # Normal operation
```
## Architecture
```
app/
├── analysis/ # Analysis and pattern detection
│ ├── archetype.py # Pattern and archetype detection
│ └── relationships.py# Cross-file relationship analysis
├── core/ # Core system components
│ ├── config.py # Configuration settings
│ └── models.py # Data models
├── processing/ # Data processing modules
│ ├── json_parser.py # JSON structure parsing
│ ├── parsing.py # Document parsing and chunking
│ └── processor.py # Data processing pipeline
├── retrieval/ # Query processing and retrieval
│ ├── embedding.py # Vector embedding generation
│ └── retrieval.py # Query pipeline and execution
├── storage/ # Data persistence
│ └── database.py # PostgreSQL and vector storage
├── utils/ # Utility modules
│ └── logging_config.py # Logging configuration
├── __init__.py # Package initialization
├── chat.py # Chat interface and interactions
└── main.py # Application entry point
```
The codebase is organized into logical modules:
- **analysis/**: Modules for analyzing data patterns, cross-file relationships, and user intent
- **core/**: Core system configuration and shared components
- **processing/**: Data processing and relationship detection modules
- **retrieval/**: Relationship-aware search and context assembly
- **storage/**: Database interaction and relationship persistence
- **utils/**: Shared utility functions and helpers
Each module is designed to be independent with clear responsibilities, while working together through well-defined interfaces.
## Installation Requirements
- Python 3.8 or higher
- PostgreSQL 12 or higher with PGVector extension
- OpenAI API key
- Required Python packages (see requirements.txt)
## Documentation
The codebase features comprehensive inline documentation:
- Detailed module-level docstrings explaining key concepts
- Function and class documentation with examples
- Type hints and parameter descriptions
- Usage examples and implementation notes
## Contributing
We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details on:
- Setting up your development environment
- Code style guidelines
- Pull request process
- Development workflow
## Code of Conduct
This project follows the [Contributor Covenant Code of Conduct](CODE_OF_CONDUCT.md). By participating, you are expected to uphold this code. Please report unacceptable behavior.
## License
MIT License - see LICENSE file for details.
## Roadmap
- [x] Cross-file relationship detection
- [x] Archetype-aware retrieval
- [x] Relationship-based context expansion
- [x] Confidence scoring algorithm refinement
- [ ] State transition handling improvements
- [ ] Batch processing optimization
- [ ] Metric aggregation capabilities
- [ ] Entity filtering rules improvement
- [ ] Context assembly performance optimization
- [ ] Advanced archetype pattern detection
## Query Pipeline
The system implements a structured reasoning pipeline:
1. **Query Analysis**:
- Determines required data types
- Identifies needed operations (filtering, aggregation)
- Detects relationships and constraints
2. **Plan Creation**:
- Builds retrieval strategy
- Plans processing operations
- Determines result formatting
3. **Execution**:
- Retrieves relevant chunks
- Processes according to plan
- Assembles coherent response
This systematic approach ensures consistent and reliable query handling while preserving context and relationships.