An open API service indexing awesome lists of open source software.

https://github.com/samowolabi/chatrag-knowledge-graph

Intelligent RAG system with knowledge graphs. Processes documents β†’ extracts entities & relationships β†’ stores in Neo4j β†’ enables semantic search & AI-powered Q&A with citations. πŸ”— Where documents become knowledge graphs. Advanced RAG system that turns your files into an intelligent, queryable knowledge base.
https://github.com/samowolabi/chatrag-knowledge-graph

graphrag knowledge-graph llm-agent neo4j

Last synced: 2 months ago
JSON representation

Intelligent RAG system with knowledge graphs. Processes documents β†’ extracts entities & relationships β†’ stores in Neo4j β†’ enables semantic search & AI-powered Q&A with citations. πŸ”— Where documents become knowledge graphs. Advanced RAG system that turns your files into an intelligent, queryable knowledge base.

Awesome Lists containing this project

README

          

# ChatRAG Knowledge Graph

[![GitHub Repository](https://img.shields.io/badge/GitHub-chatrag--knowledge--graph-blue?logo=github)](https://github.com/samowolabi/chatrag-knowledge-graph)
[![TypeScript](https://img.shields.io/badge/TypeScript-007ACC?logo=typescript&logoColor=white)](https://www.typescriptlang.org/)
[![Neo4j](https://img.shields.io/badge/Neo4j-008CC1?logo=neo4j&logoColor=white)](https://neo4j.com/)
[![OpenAI](https://img.shields.io/badge/OpenAI-412991?logo=openai&logoColor=white)](https://openai.com/)

🧠 **Intelligent RAG system with knowledge graphs.** Processes documents β†’ extracts entities & relationships β†’ stores in Neo4j β†’ enables semantic search & AI-powered Q&A with citations.

A comprehensive **Retrieval-Augmented Generation (RAG)** system that ingests documents, creates knowledge graphs, and provides intelligent querying capabilities using OpenAI embeddings and Neo4j graph database.

## πŸ—οΈ Architecture Overview

```
Documents β†’ Ingestion Pipeline β†’ Neo4j Graph β†’ Query Pipeline β†’ AI Responses
↓ ↓ ↓ ↓ ↓
PDF/TXT Text Extraction Chunks + Semantic Generated
Files + Chunking Entities + Search + Answers
Relations Graph RAG with Sources
```

## πŸ“‹ Table of Contents

- [Features](#-features)
- [Repository Structure](#-repository-structure)
- [System Architecture](#-system-architecture)
- [Prerequisites](#-prerequisites)
- [Installation](#-installation)
- [Configuration](#-configuration)
- [Data Ingestion Pipeline](#-data-ingestion-pipeline)
- [Query Pipeline](#-query-pipeline)
- [API Endpoints](#-api-endpoints)
- [Usage Examples](#-usage-examples)
- [Testing with Postman](#-testing-with-postman)
- [Architecture Details](#-architecture-details)
- [Troubleshooting](#-troubleshooting)

## ✨ Features

### πŸ“Š Data Ingestion
- **Multi-format document parsing** (PDF, TXT, DOC)
- **Intelligent text chunking** using LangChain
- **OpenAI embeddings generation** for semantic search
- **Entity and relationship extraction** using AI
- **Neo4j graph storage** for complex relationships

### πŸ” Query System
- **Semantic search** using vector similarity
- **RAG (Retrieval-Augmented Generation)** with source citation
- **Graph traversal** for entity relationships
- **Hybrid search** combining semantic and keyword matching

## πŸ“ Repository Structure

```
chatrag-knowledge-graph/
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ config/
β”‚ β”‚ └── environment.ts # Environment configuration
β”‚ β”œβ”€β”€ controllers/
β”‚ β”‚ β”œβ”€β”€ ingestDataController.ts # Document ingestion endpoints
β”‚ β”‚ └── queryDataController.ts # Query and search endpoints
β”‚ β”œβ”€β”€ routes/
β”‚ β”‚ β”œβ”€β”€ graphRagRoutes.ts # Ingestion API routes
β”‚ β”‚ └── queryRoutes.ts # Query API routes
β”‚ β”œβ”€β”€ services/
β”‚ β”‚ β”œβ”€β”€ graphService.ts # Neo4j graph operations
β”‚ β”‚ β”œβ”€β”€ langchainService.ts # Text chunking with LangChain
β”‚ β”‚ β”œβ”€β”€ neo4jService.ts # Neo4j database connection
β”‚ β”‚ β”œβ”€β”€ openaiService.ts # OpenAI API integration
β”‚ β”‚ └── textExtractor.ts # Document parsing (PDF/TXT)
β”‚ β”œβ”€β”€ utils/
β”‚ β”‚ β”œβ”€β”€ chunkText.ts # Text processing utilities
β”‚ β”‚ └── jsonToObjectParser.ts # JSON parsing utilities
β”‚ └── index.ts # Express server entry point
β”œβ”€β”€ ChatRAG API.postman_collection.json # Postman test collection
β”œβ”€β”€ package.json # Dependencies and scripts
β”œβ”€β”€ tsconfig.json # TypeScript configuration
β”œβ”€β”€ nodemon.json # Development server config
└── README.md # This file
```

## πŸ—οΈ System Architecture

### High-Level Architecture

```mermaid
graph TB
A[Documents] --> B[Text Extractor]
B --> C[LangChain Chunker]
C --> D[OpenAI Embeddings]
D --> E[Neo4j Graph DB]

F[User Query] --> G[OpenAI Embedding]
G --> H[Neo4j Vector Search]
H --> I[Graph Service]
I --> J[OpenAI RAG]
J --> K[AI Response with Citations]

E --> H
```

### Component Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Express.js Server β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Routes Layer β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ graphRagRoutes β”‚ β”‚ queryRoutes β”‚ β”‚
β”‚ β”‚ (Ingestion) β”‚ β”‚ (Search & RAG) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Controllers Layer β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ ingestDataControllerβ”‚ β”‚ queryDataController β”‚ β”‚
β”‚ β”‚ β€’ extractText β”‚ β”‚ β€’ semanticSearch β”‚ β”‚
β”‚ β”‚ β€’ breakTextIntoChunksβ”‚ β”‚ β€’ ragQuery β”‚ β”‚
β”‚ β”‚ β€’ generateEmbeddingsβ”‚ β”‚ β€’ (future: hybridSearch) β”‚ β”‚
β”‚ β”‚ β€’ storeInGraph β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β€’ fullPipeline β”‚ β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Services Layer β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ textExtractorβ”‚ β”‚langchainSvc β”‚ β”‚ openaiServiceβ”‚ β”‚graphSvc β”‚ β”‚
β”‚ β”‚ β€’ parsePDF β”‚ β”‚ β€’ splitText β”‚ β”‚ β€’ embedText β”‚ β”‚β€’ search β”‚ β”‚
β”‚ β”‚ β€’ parseTXT β”‚ β”‚ β€’ chunkText β”‚ β”‚ β€’ chatCompl β”‚ β”‚β€’ store β”‚ β”‚
β”‚ β”‚ β€’ parseDoc β”‚ β”‚ β”‚ β”‚ β€’ embedBatch β”‚ β”‚β€’ entity β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ neo4jService β”‚ β”‚
β”‚ β”‚ β€’ connection β”‚ β”‚
β”‚ β”‚ β€’ queries β”‚ β”‚
β”‚ β”‚ β€’ transactionsβ”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Data Layer β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Neo4j Graph Database β”‚ β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚ β”‚
β”‚ β”‚ β”‚ Chunks β”‚ β”‚ Entities β”‚β”‚ β”‚
β”‚ β”‚ β”‚ β€’ content β”‚ β”‚ β€’ name/type β”‚β”‚ β”‚
β”‚ β”‚ β”‚ β€’ embedding β”‚ β”‚ β€’ description β”‚β”‚ β”‚
β”‚ β”‚ β”‚ β€’ metadata β”‚ β”‚ β€’ properties β”‚β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚ β”‚
β”‚ β”‚ Relationships β”‚ β”‚
β”‚ β”‚ β€’ CONTAINS β”‚ β”‚
β”‚ β”‚ β€’ RELATES_TO β”‚ β”‚
β”‚ β”‚ β€’ Custom types β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### Service Responsibilities

| Service | Purpose | Key Methods |
|---------|---------|-------------|
| **textExtractor** | Document parsing | `parseDocument()`, `extractMetadata()` |
| **langchainService** | Text processing | `splitText()`, `createChunks()` |
| **openaiService** | AI operations | `embedText()`, `chatCompletion()`, `embedTextsBatch()` |
| **neo4jService** | Database layer | `executeQuery()`, `initialize()`, `testConnection()` |
| **graphService** | Graph operations | `semanticSearchChunks()`, `storeExtractedNodesAndRelationships()` |

## πŸ› οΈ Prerequisites

- **Node.js** 18+
- **Neo4j Database** 4.0+ (with optional GDS library)
- **OpenAI API Key**
- **TypeScript** knowledge

## πŸ“¦ Installation

```bash
# Clone the repository
git clone https://github.com/samowolabi/chatrag-knowledge-graph.git
cd chatrag-knowledge-graph

# Install dependencies
npm install

# Set up environment variables
cp .env.example .env
```

## βš™οΈ Configuration

Create a `.env` file with the following variables:

```env
# Server Configuration
PORT=3000

# OpenAI Configuration
OPENAI_API_KEY=your_openai_api_key_here

# Neo4j Configuration
NEO4J_URI=neo4j://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your_neo4j_password

# Optional: Neo4j AuraDB (cloud)
# NEO4J_URI=neo4j+s://your-instance.databases.neo4j.io
```

## πŸ“₯ Data Ingestion Pipeline

The ingestion system processes documents through multiple stages:

### Pipeline Stages

1. **Text Extraction** β†’ Parse documents into structured text
2. **Text Chunking** β†’ Split content into manageable pieces
3. **Embedding Generation** β†’ Create vector representations
4. **Chunk Storage** β†’ Save chunks with embeddings to Neo4j
5. **Entity Extraction** β†’ Identify entities and relationships using AI
6. **Graph Construction** β†’ Store entities and relationships in Neo4j

### Available Endpoints

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/graphrag/extract-text` | POST | Parse documents (PDF/TXT/DOC) |
| `/graphrag/break-text` | POST | Split text into chunks |
| `/graphrag/embed-chunks-with-openai` | POST | Generate embeddings |
| `/graphrag/store-chunks-embeddings-graph` | POST | Store chunks in Neo4j |
| `/graphrag/extract-document-nodes-relationships-graph` | POST | Extract entities/relationships |
| `/graphrag/store-extracted-nodes-relationships-graph` | POST | Store graph data |
| `/graphrag/process-document-pipeline` | POST | **Full end-to-end processing** |

### Individual Steps Usage

#### 1. Extract Text from Document
```bash
curl -X POST http://localhost:3000/graphrag/extract-text \
-H "Content-Type: application/json" \
-d '{
"filePath": "/path/to/document.pdf",
"type": "pdf"
}'
```

#### 2. Break Text into Chunks
```bash
curl -X POST http://localhost:3000/graphrag/break-text \
-H "Content-Type: application/json" \
-d '{
"text": "Your document content here..."
}'
```

#### 3. Generate Embeddings
```bash
curl -X POST http://localhost:3000/graphrag/embed-chunks-with-openai \
-H "Content-Type: application/json" \
-d '{
"chunks": [
{"id": "chunk_1", "content": "First chunk content"},
{"id": "chunk_2", "content": "Second chunk content"}
]
}'
```

#### 4. Store Chunks with Embeddings
```bash
curl -X POST http://localhost:3000/graphrag/store-chunks-embeddings-graph \
-H "Content-Type: application/json" \
-d '{
"chunks": [
{
"id": "chunk_1",
"content": "Content here",
"embedding": [0.1, 0.2, 0.3, ...]
}
]
}'
```

#### 5. Full Pipeline (Recommended)
```bash
curl -X POST http://localhost:3000/graphrag/process-document-pipeline \
-H "Content-Type: application/json" \
-d '{
"filePath": "/path/to/document.pdf",
"type": "pdf"
}'
```

## πŸ” Query Pipeline

The query system provides multiple ways to retrieve and generate responses:

### Available Query Types

| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/query/semantic` | POST | Vector similarity search |
| `/query/rag` | POST | AI-generated answers with sources |

### Query Examples

#### Semantic Search
Find similar content using vector embeddings:

```bash
curl -X POST http://localhost:3000/query/semantic \
-H "Content-Type: application/json" \
-d '{
"query": "artificial intelligence applications",
"limit": 5
}'
```

**Response:**
```json
{
"success": true,
"data": {
"query": "artificial intelligence applications",
"results": [
{
"id": "chunk_1",
"content": "AI applications in healthcare include...",
"similarity": 0.89,
"score": 0.89
}
],
"count": 5
}
}
```

#### RAG Query
Get AI-generated answers with source citations:

```bash
curl -X POST http://localhost:3000/query/rag \
-H "Content-Type: application/json" \
-d '{
"query": "What are the benefits of machine learning?",
"limit": 3,
"includeContext": true
}'
```

**Response:**
```json
{
"success": true,
"data": {
"query": "What are the benefits of machine learning?",
"answer": "Based on the provided context, machine learning offers several key benefits [1][2]: automation of complex tasks, pattern recognition in large datasets...",
"sources": [
{
"id": "chunk_1",
"content": "Machine learning enables automated...",
"similarity": 0.92,
"sourceNumber": 1
}
],
"metadata": {
"chunksRetrieved": 3,
"avgSimilarity": 0.87
}
}
}
```

## πŸš€ Quick Start

### 1. Start the Server
```bash
npm run dev
```

### 2. Ingest Your First Document
```bash
curl -X POST http://localhost:3000/graphrag/process-document-pipeline \
-H "Content-Type: application/json" \
-d '{
"filePath": "/path/to/your/document.pdf",
"type": "pdf"
}'
```

### 3. Query Your Data
```bash
curl -X POST http://localhost:3000/query/rag \
-H "Content-Type: application/json" \
-d '{
"query": "What is this document about?",
"limit": 5
}'
```

## πŸ“Š Neo4j Graph Schema

### Node Types
- **Chunk**: Document fragments with embeddings
- **Entity**: Extracted entities (PERSON, ORGANIZATION, LOCATION, CONCEPT)

### Relationship Types
- **CONTAINS**: Document contains entities
- **RELATES**: Entity relationships
- **Custom types**: Based on extracted relationships

### Example Cypher Queries
```cypher
// View all chunks
MATCH (c:Chunk) RETURN c LIMIT 10

// Find entities of a specific type
MATCH (e:PERSON) RETURN e.name, e.description

// Explore relationships
MATCH (a)-[r]->(b) RETURN a.name, type(r), b.name LIMIT 10
```

## πŸ§ͺ Testing with Postman

### Import Collection
1. **Download**: [`ChatRAG API.postman_collection.json`](./ChatRAG%20API.postman_collection.json)
2. **Import to Postman**: File β†’ Import β†’ Upload the JSON file
3. **Set Environment**: Update `base_url` variable to `http://localhost:3000`

### Available Test Collections

The Postman collection includes comprehensive tests for:

#### **πŸ“₯ Ingestion Endpoints**
- βœ… Health Check
- βœ… Extract Text from Document
- βœ… Break Text into Chunks
- βœ… Generate Embeddings
- βœ… Store Chunks & Embeddings
- βœ… Extract Entities & Relationships
- βœ… Store Graph Data
- βœ… **Full Pipeline Test** (End-to-end)

#### **πŸ” Query Endpoints**
- βœ… Semantic Search
- βœ… RAG Query with Citations
- βœ… (Future: Hybrid Search, Entity Search)

### Test Sequence
1. **Health Check** β†’ Verify server is running
2. **Full Pipeline** β†’ Process a sample document
3. **Semantic Search** β†’ Test vector search
4. **RAG Query** β†’ Test AI response generation

### Sample Test Data Included
- Realistic document content examples
- Pre-configured embeddings for testing
- Entity/relationship samples
- Various query examples

## πŸ”§ Architecture Details

### Request/Response Flow

#### Ingestion Flow
```
1. POST /graphrag/process-document-pipeline
↓
2. ingestDataController.processDocumentPipeline()
↓
3. textExtractor.parseDocument() β†’ Document object
↓
4. langchainService.splitText() β†’ Text chunks
↓
5. openaiService.embedTextsBatch() β†’ Vector embeddings
↓
6. graphService.storeBatchChunksWithEmbeddings() β†’ Neo4j storage
↓
7. graphService.extractEntitiesAndRelationshipsFromText() β†’ AI extraction
↓
8. graphService.storeExtractedNodesAndRelationships() β†’ Graph relationships
```

#### Query Flow
```
1. POST /query/rag
↓
2. queryDataController.ragQuery()
↓
3. openaiService.embedText() β†’ Query embedding
↓
4. graphService.semanticSearchChunks() β†’ Similar chunks
↓
5. Build context from retrieved chunks
↓
6. openaiService.chatCompletion() β†’ AI response
↓
7. Return answer with source citations
```

### Core Services

#### TextExtractor Service
- **Purpose**: Multi-format document parsing
- **Formats**: PDF, TXT, DOC files
- **Methods**:
- `parseDocument(filePath, type)` - Main parsing method
- `extractMetadata()` - Document metadata extraction
- **Dependencies**: pdf-parse, fs

#### LangChain Service
- **Purpose**: Intelligent text chunking
- **Features**: Configurable chunk size, overlap handling
- **Methods**:
- `splitText(text)` - Split text into chunks
- `createChunks()` - Create chunk objects with IDs
- **Dependencies**: LangChain TextSplitter

#### OpenAI Service
- **Purpose**: AI operations and embeddings
- **Models**: text-embedding-ada-002, gpt-4
- **Methods**:
- `embedText(text)` - Single text embedding
- `embedTextsBatch(texts[])` - Batch embedding processing
- `chatCompletion(messages, options)` - AI response generation
- **Features**: Automatic retry, error handling, batch optimization

#### Neo4j Service
- **Purpose**: Database connection and query execution
- **Features**: Connection pooling, transaction management
- **Methods**:
- `initialize()` - Database setup
- `executeQuery(query, params)` - Query execution
- `testConnection()` - Health check
- **Configuration**: Supports local and AuraDB cloud instances

#### Graph Service
- **Purpose**: High-level graph operations
- **Features**: Vector search, entity management, relationship handling
- **Methods**:
- `semanticSearchChunks(embedding, limit)` - Vector similarity search
- `storeExtractedNodesAndRelationships()` - Graph data storage
- `extractEntitiesAndRelationshipsFromText()` - AI-powered extraction
- **Fallbacks**: Manual cosine similarity if GDS not available

### Data Models

#### Chunk Model
```typescript
interface Chunk {
id: string;
content: string;
embedding: number[];
metadata: Record;
}
```

#### Entity Model
```typescript
interface GraphEntity {
id: string;
name: string;
type: string; // PERSON, ORGANIZATION, LOCATION, CONCEPT
description: string;
properties: Record;
}
```

#### Relationship Model
```typescript
interface GraphRelationship {
id: string;
source: string;
target: string;
type: string; // RELATES_TO, WORKS_FOR, etc.
description: string;
properties: Record;
}
```

## πŸ› Troubleshooting

### Common Issues

#### Neo4j Connection Failed
```bash
# Check Neo4j status
neo4j status

# Start Neo4j
neo4j start

# Verify credentials in .env file
```

#### OpenAI API Errors
```bash
# Verify API key
echo $OPENAI_API_KEY

# Check API quota and billing
```

#### Vector Search Issues
If GDS functions fail, the system automatically falls back to manual cosine similarity calculation.

#### Memory Issues with Large Documents
- Reduce chunk size in LangChain configuration
- Process documents in smaller batches
- Increase Node.js memory limit: `node --max-old-space-size=4096`

### Debug Mode
```bash
# Enable detailed logging
DEBUG=* npm run dev
```

### Health Check
```bash
curl http://localhost:3000/
```

## πŸ“ˆ Performance Optimization

### Neo4j Optimization
```cypher
// Create vector index for faster similarity search
CREATE VECTOR INDEX chunk_embeddings FOR (c:Chunk) ON (c.embedding)
OPTIONS {indexConfig: {`vector.dimensions`: 1536, `vector.similarity_function`: 'cosine'}}

// Create text index for keyword search
CREATE TEXT INDEX chunk_content FOR (c:Chunk) ON (c.content)
```

### Batch Processing
- Process documents in batches of 10-50 chunks
- Use Promise.all for parallel embedding generation
- Implement pagination for large query results

## 🀝 Contributing

### Development Setup
```bash
# Clone the repository
git clone https://github.com/samowolabi/chatrag-knowledge-graph.git
cd chatrag-knowledge-graph

# Install dependencies
npm install

# Set up environment
cp .env.example .env

# Start development server
npm run dev
```

### Adding New Features

#### Adding New Query Types
1. Implement controller in `src/controllers/queryDataController.ts`
2. Add route in `src/routes/queryRoutes.ts`
3. Update Postman collection
4. Update this README

#### Adding New Document Types
1. Extend `textExtractor` service in `src/services/textExtractor.ts`
2. Update type definitions
3. Test with sample documents
4. Add tests to Postman collection

#### Adding New Services
1. Create service file in `src/services/`
2. Implement proper error handling
3. Add to dependency injection in controllers
4. Document in README architecture section

### Code Style
- TypeScript strict mode
- Async/await pattern
- Proper error handling with try/catch
- Descriptive variable and function names
- JSDoc comments for public methods

### Pull Request Process
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Implement changes with proper testing
4. Update documentation (README, Postman collection)
5. Commit changes (`git commit -m 'Add amazing feature'`)
6. Push to branch (`git push origin feature/amazing-feature`)
7. Open a Pull Request

## πŸ“„ License

This project is licensed under the ISC License - see the [LICENSE](LICENSE) file for details.

## πŸ™ Acknowledgments

- **OpenAI** for powerful embeddings and language models
- **Neo4j** for graph database capabilities
- **LangChain** for text processing utilities
- **TypeScript** community for excellent tooling

---

**Repository**: [https://github.com/samowolabi/chatrag-knowledge-graph](https://github.com/samowolabi/chatrag-knowledge-graph)

**Happy querying! πŸš€**

For support, please check the [troubleshooting section](#-troubleshooting) or create an issue in the repository.# chatrag-knowledge-graph