https://github.com/samowolabi/chatrag-knowledge-graph
Intelligent RAG system with knowledge graphs. Processes documents β extracts entities & relationships β stores in Neo4j β enables semantic search & AI-powered Q&A with citations. π Where documents become knowledge graphs. Advanced RAG system that turns your files into an intelligent, queryable knowledge base.
https://github.com/samowolabi/chatrag-knowledge-graph
graphrag knowledge-graph llm-agent neo4j
Last synced: 2 months ago
JSON representation
Intelligent RAG system with knowledge graphs. Processes documents β extracts entities & relationships β stores in Neo4j β enables semantic search & AI-powered Q&A with citations. π Where documents become knowledge graphs. Advanced RAG system that turns your files into an intelligent, queryable knowledge base.
- Host: GitHub
- URL: https://github.com/samowolabi/chatrag-knowledge-graph
- Owner: samowolabi
- Created: 2025-09-21T14:27:01.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-09-21T22:52:52.000Z (9 months ago)
- Last Synced: 2025-10-05T01:54:44.193Z (9 months ago)
- Topics: graphrag, knowledge-graph, llm-agent, neo4j
- Language: TypeScript
- Homepage:
- Size: 1.52 MB
- Stars: 3
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ChatRAG Knowledge Graph
[](https://github.com/samowolabi/chatrag-knowledge-graph)
[](https://www.typescriptlang.org/)
[](https://neo4j.com/)
[](https://openai.com/)
π§ **Intelligent RAG system with knowledge graphs.** Processes documents β extracts entities & relationships β stores in Neo4j β enables semantic search & AI-powered Q&A with citations.
A comprehensive **Retrieval-Augmented Generation (RAG)** system that ingests documents, creates knowledge graphs, and provides intelligent querying capabilities using OpenAI embeddings and Neo4j graph database.
## ποΈ Architecture Overview
```
Documents β Ingestion Pipeline β Neo4j Graph β Query Pipeline β AI Responses
β β β β β
PDF/TXT Text Extraction Chunks + Semantic Generated
Files + Chunking Entities + Search + Answers
Relations Graph RAG with Sources
```
## π Table of Contents
- [Features](#-features)
- [Repository Structure](#-repository-structure)
- [System Architecture](#-system-architecture)
- [Prerequisites](#-prerequisites)
- [Installation](#-installation)
- [Configuration](#-configuration)
- [Data Ingestion Pipeline](#-data-ingestion-pipeline)
- [Query Pipeline](#-query-pipeline)
- [API Endpoints](#-api-endpoints)
- [Usage Examples](#-usage-examples)
- [Testing with Postman](#-testing-with-postman)
- [Architecture Details](#-architecture-details)
- [Troubleshooting](#-troubleshooting)
## β¨ Features
### π Data Ingestion
- **Multi-format document parsing** (PDF, TXT, DOC)
- **Intelligent text chunking** using LangChain
- **OpenAI embeddings generation** for semantic search
- **Entity and relationship extraction** using AI
- **Neo4j graph storage** for complex relationships
### π Query System
- **Semantic search** using vector similarity
- **RAG (Retrieval-Augmented Generation)** with source citation
- **Graph traversal** for entity relationships
- **Hybrid search** combining semantic and keyword matching
## π Repository Structure
```
chatrag-knowledge-graph/
βββ src/
β βββ config/
β β βββ environment.ts # Environment configuration
β βββ controllers/
β β βββ ingestDataController.ts # Document ingestion endpoints
β β βββ queryDataController.ts # Query and search endpoints
β βββ routes/
β β βββ graphRagRoutes.ts # Ingestion API routes
β β βββ queryRoutes.ts # Query API routes
β βββ services/
β β βββ graphService.ts # Neo4j graph operations
β β βββ langchainService.ts # Text chunking with LangChain
β β βββ neo4jService.ts # Neo4j database connection
β β βββ openaiService.ts # OpenAI API integration
β β βββ textExtractor.ts # Document parsing (PDF/TXT)
β βββ utils/
β β βββ chunkText.ts # Text processing utilities
β β βββ jsonToObjectParser.ts # JSON parsing utilities
β βββ index.ts # Express server entry point
βββ ChatRAG API.postman_collection.json # Postman test collection
βββ package.json # Dependencies and scripts
βββ tsconfig.json # TypeScript configuration
βββ nodemon.json # Development server config
βββ README.md # This file
```
## ποΈ System Architecture
### High-Level Architecture
```mermaid
graph TB
A[Documents] --> B[Text Extractor]
B --> C[LangChain Chunker]
C --> D[OpenAI Embeddings]
D --> E[Neo4j Graph DB]
F[User Query] --> G[OpenAI Embedding]
G --> H[Neo4j Vector Search]
H --> I[Graph Service]
I --> J[OpenAI RAG]
J --> K[AI Response with Citations]
E --> H
```
### Component Architecture
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Express.js Server β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Routes Layer β
β βββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ β
β β graphRagRoutes β β queryRoutes β β
β β (Ingestion) β β (Search & RAG) β β
β βββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Controllers Layer β
β βββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ β
β β ingestDataControllerβ β queryDataController β β
β β β’ extractText β β β’ semanticSearch β β
β β β’ breakTextIntoChunksβ β β’ ragQuery β β
β β β’ generateEmbeddingsβ β β’ (future: hybridSearch) β β
β β β’ storeInGraph β β β β
β β β’ fullPipeline β β β β
β βββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Services Layer β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ βββββββββββ β
β β textExtractorβ βlangchainSvc β β openaiServiceβ βgraphSvc β β
β β β’ parsePDF β β β’ splitText β β β’ embedText β ββ’ search β β
β β β’ parseTXT β β β’ chunkText β β β’ chatCompl β ββ’ store β β
β β β’ parseDoc β β β β β’ embedBatch β ββ’ entity β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ βββββββββββ β
β ββββββββββββββββ β
β β neo4jService β β
β β β’ connection β β
β β β’ queries β β
β β β’ transactionsβ β
β ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Data Layer β
β βββββββββββββββββββββββββββββββββββββββ β
β β Neo4j Graph Database β β
β β βββββββββββββββ ββββββββββββββββββββ β
β β β Chunks β β Entities ββ β
β β β β’ content β β β’ name/type ββ β
β β β β’ embedding β β β’ description ββ β
β β β β’ metadata β β β’ properties ββ β
β β βββββββββββββββ ββββββββββββββββββββ β
β β Relationships β β
β β β’ CONTAINS β β
β β β’ RELATES_TO β β
β β β’ Custom types β β
β βββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
### Service Responsibilities
| Service | Purpose | Key Methods |
|---------|---------|-------------|
| **textExtractor** | Document parsing | `parseDocument()`, `extractMetadata()` |
| **langchainService** | Text processing | `splitText()`, `createChunks()` |
| **openaiService** | AI operations | `embedText()`, `chatCompletion()`, `embedTextsBatch()` |
| **neo4jService** | Database layer | `executeQuery()`, `initialize()`, `testConnection()` |
| **graphService** | Graph operations | `semanticSearchChunks()`, `storeExtractedNodesAndRelationships()` |
## π οΈ Prerequisites
- **Node.js** 18+
- **Neo4j Database** 4.0+ (with optional GDS library)
- **OpenAI API Key**
- **TypeScript** knowledge
## π¦ Installation
```bash
# Clone the repository
git clone https://github.com/samowolabi/chatrag-knowledge-graph.git
cd chatrag-knowledge-graph
# Install dependencies
npm install
# Set up environment variables
cp .env.example .env
```
## βοΈ Configuration
Create a `.env` file with the following variables:
```env
# Server Configuration
PORT=3000
# OpenAI Configuration
OPENAI_API_KEY=your_openai_api_key_here
# Neo4j Configuration
NEO4J_URI=neo4j://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your_neo4j_password
# Optional: Neo4j AuraDB (cloud)
# NEO4J_URI=neo4j+s://your-instance.databases.neo4j.io
```
## π₯ Data Ingestion Pipeline
The ingestion system processes documents through multiple stages:
### Pipeline Stages
1. **Text Extraction** β Parse documents into structured text
2. **Text Chunking** β Split content into manageable pieces
3. **Embedding Generation** β Create vector representations
4. **Chunk Storage** β Save chunks with embeddings to Neo4j
5. **Entity Extraction** β Identify entities and relationships using AI
6. **Graph Construction** β Store entities and relationships in Neo4j
### Available Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/graphrag/extract-text` | POST | Parse documents (PDF/TXT/DOC) |
| `/graphrag/break-text` | POST | Split text into chunks |
| `/graphrag/embed-chunks-with-openai` | POST | Generate embeddings |
| `/graphrag/store-chunks-embeddings-graph` | POST | Store chunks in Neo4j |
| `/graphrag/extract-document-nodes-relationships-graph` | POST | Extract entities/relationships |
| `/graphrag/store-extracted-nodes-relationships-graph` | POST | Store graph data |
| `/graphrag/process-document-pipeline` | POST | **Full end-to-end processing** |
### Individual Steps Usage
#### 1. Extract Text from Document
```bash
curl -X POST http://localhost:3000/graphrag/extract-text \
-H "Content-Type: application/json" \
-d '{
"filePath": "/path/to/document.pdf",
"type": "pdf"
}'
```
#### 2. Break Text into Chunks
```bash
curl -X POST http://localhost:3000/graphrag/break-text \
-H "Content-Type: application/json" \
-d '{
"text": "Your document content here..."
}'
```
#### 3. Generate Embeddings
```bash
curl -X POST http://localhost:3000/graphrag/embed-chunks-with-openai \
-H "Content-Type: application/json" \
-d '{
"chunks": [
{"id": "chunk_1", "content": "First chunk content"},
{"id": "chunk_2", "content": "Second chunk content"}
]
}'
```
#### 4. Store Chunks with Embeddings
```bash
curl -X POST http://localhost:3000/graphrag/store-chunks-embeddings-graph \
-H "Content-Type: application/json" \
-d '{
"chunks": [
{
"id": "chunk_1",
"content": "Content here",
"embedding": [0.1, 0.2, 0.3, ...]
}
]
}'
```
#### 5. Full Pipeline (Recommended)
```bash
curl -X POST http://localhost:3000/graphrag/process-document-pipeline \
-H "Content-Type: application/json" \
-d '{
"filePath": "/path/to/document.pdf",
"type": "pdf"
}'
```
## π Query Pipeline
The query system provides multiple ways to retrieve and generate responses:
### Available Query Types
| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/query/semantic` | POST | Vector similarity search |
| `/query/rag` | POST | AI-generated answers with sources |
### Query Examples
#### Semantic Search
Find similar content using vector embeddings:
```bash
curl -X POST http://localhost:3000/query/semantic \
-H "Content-Type: application/json" \
-d '{
"query": "artificial intelligence applications",
"limit": 5
}'
```
**Response:**
```json
{
"success": true,
"data": {
"query": "artificial intelligence applications",
"results": [
{
"id": "chunk_1",
"content": "AI applications in healthcare include...",
"similarity": 0.89,
"score": 0.89
}
],
"count": 5
}
}
```
#### RAG Query
Get AI-generated answers with source citations:
```bash
curl -X POST http://localhost:3000/query/rag \
-H "Content-Type: application/json" \
-d '{
"query": "What are the benefits of machine learning?",
"limit": 3,
"includeContext": true
}'
```
**Response:**
```json
{
"success": true,
"data": {
"query": "What are the benefits of machine learning?",
"answer": "Based on the provided context, machine learning offers several key benefits [1][2]: automation of complex tasks, pattern recognition in large datasets...",
"sources": [
{
"id": "chunk_1",
"content": "Machine learning enables automated...",
"similarity": 0.92,
"sourceNumber": 1
}
],
"metadata": {
"chunksRetrieved": 3,
"avgSimilarity": 0.87
}
}
}
```
## π Quick Start
### 1. Start the Server
```bash
npm run dev
```
### 2. Ingest Your First Document
```bash
curl -X POST http://localhost:3000/graphrag/process-document-pipeline \
-H "Content-Type: application/json" \
-d '{
"filePath": "/path/to/your/document.pdf",
"type": "pdf"
}'
```
### 3. Query Your Data
```bash
curl -X POST http://localhost:3000/query/rag \
-H "Content-Type: application/json" \
-d '{
"query": "What is this document about?",
"limit": 5
}'
```
## π Neo4j Graph Schema
### Node Types
- **Chunk**: Document fragments with embeddings
- **Entity**: Extracted entities (PERSON, ORGANIZATION, LOCATION, CONCEPT)
### Relationship Types
- **CONTAINS**: Document contains entities
- **RELATES**: Entity relationships
- **Custom types**: Based on extracted relationships
### Example Cypher Queries
```cypher
// View all chunks
MATCH (c:Chunk) RETURN c LIMIT 10
// Find entities of a specific type
MATCH (e:PERSON) RETURN e.name, e.description
// Explore relationships
MATCH (a)-[r]->(b) RETURN a.name, type(r), b.name LIMIT 10
```
## π§ͺ Testing with Postman
### Import Collection
1. **Download**: [`ChatRAG API.postman_collection.json`](./ChatRAG%20API.postman_collection.json)
2. **Import to Postman**: File β Import β Upload the JSON file
3. **Set Environment**: Update `base_url` variable to `http://localhost:3000`
### Available Test Collections
The Postman collection includes comprehensive tests for:
#### **π₯ Ingestion Endpoints**
- β
Health Check
- β
Extract Text from Document
- β
Break Text into Chunks
- β
Generate Embeddings
- β
Store Chunks & Embeddings
- β
Extract Entities & Relationships
- β
Store Graph Data
- β
**Full Pipeline Test** (End-to-end)
#### **π Query Endpoints**
- β
Semantic Search
- β
RAG Query with Citations
- β
(Future: Hybrid Search, Entity Search)
### Test Sequence
1. **Health Check** β Verify server is running
2. **Full Pipeline** β Process a sample document
3. **Semantic Search** β Test vector search
4. **RAG Query** β Test AI response generation
### Sample Test Data Included
- Realistic document content examples
- Pre-configured embeddings for testing
- Entity/relationship samples
- Various query examples
## π§ Architecture Details
### Request/Response Flow
#### Ingestion Flow
```
1. POST /graphrag/process-document-pipeline
β
2. ingestDataController.processDocumentPipeline()
β
3. textExtractor.parseDocument() β Document object
β
4. langchainService.splitText() β Text chunks
β
5. openaiService.embedTextsBatch() β Vector embeddings
β
6. graphService.storeBatchChunksWithEmbeddings() β Neo4j storage
β
7. graphService.extractEntitiesAndRelationshipsFromText() β AI extraction
β
8. graphService.storeExtractedNodesAndRelationships() β Graph relationships
```
#### Query Flow
```
1. POST /query/rag
β
2. queryDataController.ragQuery()
β
3. openaiService.embedText() β Query embedding
β
4. graphService.semanticSearchChunks() β Similar chunks
β
5. Build context from retrieved chunks
β
6. openaiService.chatCompletion() β AI response
β
7. Return answer with source citations
```
### Core Services
#### TextExtractor Service
- **Purpose**: Multi-format document parsing
- **Formats**: PDF, TXT, DOC files
- **Methods**:
- `parseDocument(filePath, type)` - Main parsing method
- `extractMetadata()` - Document metadata extraction
- **Dependencies**: pdf-parse, fs
#### LangChain Service
- **Purpose**: Intelligent text chunking
- **Features**: Configurable chunk size, overlap handling
- **Methods**:
- `splitText(text)` - Split text into chunks
- `createChunks()` - Create chunk objects with IDs
- **Dependencies**: LangChain TextSplitter
#### OpenAI Service
- **Purpose**: AI operations and embeddings
- **Models**: text-embedding-ada-002, gpt-4
- **Methods**:
- `embedText(text)` - Single text embedding
- `embedTextsBatch(texts[])` - Batch embedding processing
- `chatCompletion(messages, options)` - AI response generation
- **Features**: Automatic retry, error handling, batch optimization
#### Neo4j Service
- **Purpose**: Database connection and query execution
- **Features**: Connection pooling, transaction management
- **Methods**:
- `initialize()` - Database setup
- `executeQuery(query, params)` - Query execution
- `testConnection()` - Health check
- **Configuration**: Supports local and AuraDB cloud instances
#### Graph Service
- **Purpose**: High-level graph operations
- **Features**: Vector search, entity management, relationship handling
- **Methods**:
- `semanticSearchChunks(embedding, limit)` - Vector similarity search
- `storeExtractedNodesAndRelationships()` - Graph data storage
- `extractEntitiesAndRelationshipsFromText()` - AI-powered extraction
- **Fallbacks**: Manual cosine similarity if GDS not available
### Data Models
#### Chunk Model
```typescript
interface Chunk {
id: string;
content: string;
embedding: number[];
metadata: Record;
}
```
#### Entity Model
```typescript
interface GraphEntity {
id: string;
name: string;
type: string; // PERSON, ORGANIZATION, LOCATION, CONCEPT
description: string;
properties: Record;
}
```
#### Relationship Model
```typescript
interface GraphRelationship {
id: string;
source: string;
target: string;
type: string; // RELATES_TO, WORKS_FOR, etc.
description: string;
properties: Record;
}
```
## π Troubleshooting
### Common Issues
#### Neo4j Connection Failed
```bash
# Check Neo4j status
neo4j status
# Start Neo4j
neo4j start
# Verify credentials in .env file
```
#### OpenAI API Errors
```bash
# Verify API key
echo $OPENAI_API_KEY
# Check API quota and billing
```
#### Vector Search Issues
If GDS functions fail, the system automatically falls back to manual cosine similarity calculation.
#### Memory Issues with Large Documents
- Reduce chunk size in LangChain configuration
- Process documents in smaller batches
- Increase Node.js memory limit: `node --max-old-space-size=4096`
### Debug Mode
```bash
# Enable detailed logging
DEBUG=* npm run dev
```
### Health Check
```bash
curl http://localhost:3000/
```
## π Performance Optimization
### Neo4j Optimization
```cypher
// Create vector index for faster similarity search
CREATE VECTOR INDEX chunk_embeddings FOR (c:Chunk) ON (c.embedding)
OPTIONS {indexConfig: {`vector.dimensions`: 1536, `vector.similarity_function`: 'cosine'}}
// Create text index for keyword search
CREATE TEXT INDEX chunk_content FOR (c:Chunk) ON (c.content)
```
### Batch Processing
- Process documents in batches of 10-50 chunks
- Use Promise.all for parallel embedding generation
- Implement pagination for large query results
## π€ Contributing
### Development Setup
```bash
# Clone the repository
git clone https://github.com/samowolabi/chatrag-knowledge-graph.git
cd chatrag-knowledge-graph
# Install dependencies
npm install
# Set up environment
cp .env.example .env
# Start development server
npm run dev
```
### Adding New Features
#### Adding New Query Types
1. Implement controller in `src/controllers/queryDataController.ts`
2. Add route in `src/routes/queryRoutes.ts`
3. Update Postman collection
4. Update this README
#### Adding New Document Types
1. Extend `textExtractor` service in `src/services/textExtractor.ts`
2. Update type definitions
3. Test with sample documents
4. Add tests to Postman collection
#### Adding New Services
1. Create service file in `src/services/`
2. Implement proper error handling
3. Add to dependency injection in controllers
4. Document in README architecture section
### Code Style
- TypeScript strict mode
- Async/await pattern
- Proper error handling with try/catch
- Descriptive variable and function names
- JSDoc comments for public methods
### Pull Request Process
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Implement changes with proper testing
4. Update documentation (README, Postman collection)
5. Commit changes (`git commit -m 'Add amazing feature'`)
6. Push to branch (`git push origin feature/amazing-feature`)
7. Open a Pull Request
## π License
This project is licensed under the ISC License - see the [LICENSE](LICENSE) file for details.
## π Acknowledgments
- **OpenAI** for powerful embeddings and language models
- **Neo4j** for graph database capabilities
- **LangChain** for text processing utilities
- **TypeScript** community for excellent tooling
---
**Repository**: [https://github.com/samowolabi/chatrag-knowledge-graph](https://github.com/samowolabi/chatrag-knowledge-graph)
**Happy querying! π**
For support, please check the [troubleshooting section](#-troubleshooting) or create an issue in the repository.# chatrag-knowledge-graph