An open API service indexing awesome lists of open source software.

https://github.com/ksm26/dr-x-nlp-pipeline

A fully offline NLP pipeline for extracting, chunking, embedding, querying, summarizing, and translating research documents using local LLMs. Inspired by the fictional mystery of Dr. X, the system supports multi-format files, local RAG-based Q&A, Arabic translation, and ROUGE-based summarization โ€” all without cloud dependencies.
https://github.com/ksm26/dr-x-nlp-pipeline

chromadb document-analysis llm local-llm modular-ai multilingual-ai-model nlp offlineai ollama opensource-ai rag textsummarization

Last synced: 6 months ago
JSON representation

A fully offline NLP pipeline for extracting, chunking, embedding, querying, summarizing, and translating research documents using local LLMs. Inspired by the fictional mystery of Dr. X, the system supports multi-format files, local RAG-based Q&A, Arabic translation, and ROUGE-based summarization โ€” all without cloud dependencies.

Awesome Lists containing this project

README

          

# ๐Ÿง  The Enigmatic Research of Dr. X โ€” NLP Pipeline (Local LLMs)

This project is a full-featured NLP pipeline designed to analyze the mysterious research documents left behind by **Dr. X**, a fictional scientist who vanished under mysterious circumstances. The goal is to extract, summarize, understand, and translate his research using **local, offline NLP tools** โ€” no internet or cloud APIs required.

---

## ๐Ÿš€ Features

- โœ… Multi-format file ingestion (`.pdf`, `.docx`, `.csv`, `.xlsx`, `.xls`, `.xlsm`)
- โœ… Token-based chunking with metadata (filename, page, chunk number)
- โœ… Local vector search using `ChromaDB`
- โœ… RAG Q&A system powered by **local LLaMA (via Ollama)**
- โœ… Automatic translation of English answers to **Arabic**
- โœ… Local summarization of full documents
- โœ… ROUGE metric evaluation
- โœ… Performance logging (tokens/sec for all major components)
- โœ… Fully modular & offline

---

## ๐Ÿงฑ Architecture

โ”œโ”€โ”€ file_reader.py # Extracts text & tables from all formats \
โ”œโ”€โ”€ chunker.py # Tokenizes and chunks text with cl100k_base \
โ”œโ”€โ”€ embedding_pipeline.py # Embeds chunks and stores in ChromaDB \
โ”œโ”€โ”€ rag_qa_system.py # Runs Q&A retrieval + local LLaMA generation \
โ”œโ”€โ”€ translation_utils.py # Translates answers to Arabic (offline) \
โ”œโ”€โ”€ summarizer.py # Summarizes files + evaluates with ROUGE \
โ””โ”€โ”€ requirements.txt # All dependencies\
๐Ÿ“ files/ \
โ””โ”€โ”€ All input files (.pdf, .docx, .csv, etc.)

---

## ๐Ÿง  Tech Stack

| Component | Tool/Library |
|------------------|--------------------------------------|
| **LLM (local)** | `Ollama` (e.g. `llama2`, `tinyllama`) |
| **Embedding** | `sentence-transformers` (`MiniLM`) |
| **Vector DB** | `ChromaDB (PersistentClient)` |
| **Translation** | `argos-translate` (EN โž AR) |
| **Summarization** | `Falconsai/text_summarization` |
| **Metrics** | `tiktoken`, `rouge-score`, `time` |

---

## ๐Ÿ’ก How It Works

1. **Extract** text + tables from PDFs, Word, and Excel files.
2. **Chunk** the text based on tokens (cl100k_base).
3. **Embed** chunks using MiniLM and store in a local ChromaDB.
4. **Ask Questions** via a CLI โ€” the system retrieves relevant chunks and generates an answer using LLaMA.
5. **Translate** the answer into Arabic.
6. **Summarize** full documents and measure summary quality with ROUGE.

---

## ๐Ÿงช Example: CLI Output

```bash
โ“ Ask a question about Dr. X's documents:
> What was his last known research?

๐Ÿ’ฌ English Answer:
Dr. Xโ€™s final study focused on zero-point energy manipulation using ancient resonance systems.

๐Ÿ—ฃ๏ธ Arabic Translation:
ุฑูƒุฒุช ุงู„ุฏุฑุงุณุฉ ุงู„ุฃุฎูŠุฑุฉ ู„ู„ุฏูƒุชูˆุฑ ุฅูƒุณ ุนู„ู‰ ุงู„ุชู„ุงุนุจ ุจุทุงู‚ุฉ ุงู„ู†ู‚ุทุฉ ุงู„ุตูุฑูŠุฉ ุจุงุณุชุฎุฏุงู… ุฃู†ุธู…ุฉ ุงู„ุฑู†ูŠู† ุงู„ู‚ุฏูŠู…ุฉ.
```

## ๐Ÿ“Š Performance Metrics

| Task | Tokens | Time | TPS |
|---------------|--------|----------|-----------|
| Embedding | 1,200 | 1.8 sec | ~666 TPS |
| RAG Generation| 620 | 1.2 sec | ~516 TPS |
| Summarization | 1,500 | 3.0 sec | ~500 TPS |

## Supported Formats
- โœ… PDF (.pdf) \
- โœ… Word (.docx) \
- โœ… Excel (.xlsx, .xls, .xlsm) \
- โœ… CSV (.csv) \
- โœ… Multi-sheet support with pandas \

## ๐Ÿ› ๏ธ Setup Instructions

### Install Requirements
```bash
pip install -r requirements.txt
```

### Setup Ollama
- install Ollama: https://ollama.com/download
```bash
ollama pull tinyllama
```

### Run Embedding
```bash
python embedding_pipeline.py
```

### Ask Questions (RAG + Arabic)
```bash
python rag_qa_system.py
```

### Summarize a Document
```bash
python summarizer.py
```

## โœ… Evaluation Criteria Coverage
- โœ… Executes correctly across all modules
- โœ… Efficient + logs tokens/sec
- โœ… Translates and summarizes with high fluency
- โœ… Handles all required file formats
- โœ… Uses appropriate local LLMs and vector DB
- โœ… Clean code, modular design, creative solution