https://github.com/zwh20081/bookdatamaker

A powerful CLI tool for extracting text from documents using DeepSeek OCR and generating high-quality datasets with LLM assistance.
https://github.com/zwh20081/bookdatamaker

dataset-generation knowledge-extraction llm-pipeline python-cli self-hosted-ocr

Last synced: 3 months ago
JSON representation

A powerful CLI tool for extracting text from documents using DeepSeek OCR and generating high-quality datasets with LLM assistance.

Host: GitHub
URL: https://github.com/zwh20081/bookdatamaker
Owner: zwh20081
License: mit
Created: 2025-11-13T04:45:21.000Z (7 months ago)
Default Branch: main
Last Pushed: 2025-11-21T04:09:16.000Z (6 months ago)
Last Synced: 2025-11-21T05:22:02.534Z (6 months ago)
Topics: dataset-generation, knowledge-extraction, llm-pipeline, python-cli, self-hosted-ocr
Language: Python
Homepage:
Size: 140 KB
Stars: 2
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Book Data Maker

A powerful CLI tool for extracting text from documents using DeepSeek OCR and generating high-quality datasets with LLM assistance.

## Table of Contents

### 🚀 Getting Started
- [Features](#features)
- [Quick Start](#quick-start)
- [Installation](#installation)

### 📖 User Guide
- [Extract Text (Stage 1)](#extract-text-stage-1)
- [Generate Dataset (Stage 2)](#generate-dataset-stage-2)
- [Export Dataset](#export-dataset)

### 🔧 Advanced
- [Position Distribution](#position-distribution)
- [Performance Tuning](#performance-tuning)
- [Interactive Chat](#interactive-chat)

---

## Features

- 📄 **Multi-Format Support**: PDF, EPUB, and images
- 🏠 **Self-Hosted OCR**: Local transformers for DeepSeek-OCR (no API costs)
- 🤖 **Parallel Generation**: Multiple LLM threads explore documents simultaneously
- 🎯 **Smart Distribution**: Control thread starting positions
- 💾 **SQLite Storage**: Real-time dataset storage with flexible export
- 📊 **Multiple Formats**: JSONL, Parquet, CSV, JSON
- 🌐 **Flexible Modes**: API or self-hosted for both stages
- 📈 **Progress Tracking**: Real-time progress bars

## Installation

### From PyPI (Recommended)

```bash
pip install bookdatamaker
```

### From Source

```bash
git clone https://github.com/yourusername/bookdatamaker.git
cd bookdatamaker
pip install -r requirements.txt
pip install -e .
```

### Optional: Local Inference Support

```bash
# For self-hosted OCR and LLM generation
pip install bookdatamaker[local] # From PyPI
# OR
pip install -e ".[local]" # From source - installs transformers==4.46.3, torch, flash-attn, etc.
```

**Note**: The project requires `transformers==4.46.3` for optimal compatibility with DeepSeek-OCR. A warning will be displayed if a different version is detected.

### System Requirements

**For API Mode:**
- Python 3.10+
- API keys (OpenAI, DeepSeek, etc.)

**For Local Mode:**
- Python 3.10-3.12 (3.13 not supported due to vLLM compatibility)
- NVIDIA GPU with CUDA support (or CPU, though slower)
- 16GB+ VRAM recommended for GPU
- transformers==4.46.3
- Linux or WSL2 (recommended)

---

## Quick Start

### Prerequisites

```bash
# Set API keys (choose one based on your mode)
export OPENAI_API_KEY=your_openai_key # For API mode
export DEEPSEEK_API_KEY=your_deepseek_key # For API OCR mode
```

### Option 1: API Mode (Fastest Setup)

```bash
# 1. Install
pip install bookdatamaker

# 2. Extract → Generate → Export
bookdatamaker extract book.pdf -o ./extracted
bookdatamaker generate ./extracted -d dataset.db --distribution "10,10,20,30,20,10"
bookdatamaker export-dataset dataset.db -o output.parquet
```

### Option 2: Self-Hosted Mode (Free, Private)

```bash
# 1. Install with local dependencies
pip install bookdatamaker[local]

# 2. Extract with local OCR
bookdatamaker extract book.pdf --mode local --batch-size 8 -o ./extracted

# 3. Generate with vLLM
bookdatamaker generate ./extracted \
--mode vllm \
--vllm-model-path meta-llama/Llama-3-8B-Instruct \
--distribution "25,25,25,25" \
-d dataset.db

# 4. Export
bookdatamaker export-dataset dataset.db -o output.parquet
```

---

## System Requirements

**For API Mode:**
- Python 3.10+
- API keys (OpenAI, DeepSeek, etc.)

---

## Extract Text (Stage 1)

Extract text from documents using DeepSeek OCR.

### Supported Formats

- **PDF**: Text extraction or OCR from rendered pages
- **EPUB**: E-book text extraction
- **Images**: JPG, PNG, BMP, TIFF, WebP

### API Mode

**Note**: DeepSeek does not provide an official OCR API. You need to self-host DeepSeek-OCR using vLLM.

#### Setup vLLM OCR Server

Follow the [vLLM DeepSeek-OCR recipe](https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-OCR.html) to set up your server

#### Use the API

Once your vLLM server is running:

```bash
# Basic usage (default: http://localhost:8000/v1)
bookdatamaker extract book.pdf -o ./extracted

# Custom vLLM endpoint
bookdatamaker extract book.pdf \
--deepseek-api-url http://your-server:8000/v1 \
-o ./extracted

# Adjust concurrency for faster processing
bookdatamaker extract book.pdf \
--api-concurrency 8 \
-o ./extracted
```

**Performance Options:**
- `--api-concurrency N`: Number of concurrent API requests (default: 4)
- Higher values = faster processing (if your server can handle it)
- Adjust based on your vLLM server capacity and network bandwidth
- Example: 8-16 for powerful servers, 2-4 for smaller setups

### Local Mode (Transformers)

Use local transformers model for OCR (DeepSeek-OCR, no API calls):

```bash
# Basic usage - uses transformers AutoModel with flash_attention_2
bookdatamaker extract book.pdf --mode local -o ./extracted

# With custom batch size (adjust based on GPU memory)
bookdatamaker extract book.pdf --mode local --batch-size 12 -o ./extracted

# Use CPU instead of GPU
bookdatamaker extract book.pdf --mode local --device cpu -o ./extracted

# Use specific GPU
bookdatamaker extract book.pdf --mode local --device cuda:1 -o ./extracted

# Process directory of images
bookdatamaker extract ./images/ --mode local -o ./extracted
```

**Performance Options:**
- `--batch-size N`: Number of images to process in parallel (default: 8)
- Higher values = faster processing but more GPU memory
- Adjust based on available VRAM
- Example: 4 for 8GB VRAM, 8-16 for 24GB+ VRAM

**Device Options:**
- `cuda` (default): Use default CUDA GPU
- `cuda:0`, `cuda:1`, etc.: Use specific GPU
- `cpu`: Use CPU (slower, no GPU required)
- `xpu`: Use Intel XPU

### Plain Text Mode (No OCR)

For PDF with embedded text, skip OCR and extract text directly (much faster):

```bash
# Extract plain text from PDF without OCR
bookdatamaker extract book.pdf --plain-text -o ./extracted
```

**Note**: EPUB files are **automatically extracted as plain text** (no OCR needed, no `--plain-text` flag required):

```bash
# EPUB always uses plain text extraction
bookdatamaker extract book.epub -o ./extracted
```

**When to use `--plain-text` (for PDF):**
- ✅ PDF with embedded text (e.g., born-digital documents)
- ✅ Fast extraction without GPU/API requirements
- ✅ Text-only documents

**When NOT to use `--plain-text`:**
- ❌ Scanned PDFs (images of text)
- ❌ PDFs with complex layouts requiring OCR
- ❌ Documents where text extraction quality is poor

### Output Structure

```
./extracted/
├── page_001/
│ ├── page_001.png # Page image
│ └── result.mmd # Extracted text in markdown
├── page_002/
│ ├── page_002.png
│ └── result.mmd
└── ...
```

**Note**: Each page is stored in its own subdirectory with the extracted text in `result.mmd` format.

---

## Generate Dataset (Stage 2)

Generate Q&A datasets using parallel LLM threads with **page-based navigation**.

### Navigation Model

The system uses **page navigation**:
- LLM threads navigate through document pages
- Tools available: `get_current_page`, `next_page`, `previous_page`, `jump_to_page`, `get_page_context`
- Each thread starts at a specific page based on distribution
- Threads can move forward/backward through pages to explore content

### Checkpoint & Resume

The generation process **automatically saves checkpoints** to the database:
- Thread state is saved after each successful Q&A submission
- If interrupted (Ctrl+C, crash, etc.), simply rerun the same command
- You'll be prompted to resume from checkpoint or start fresh

```bash
# First run (interrupted at 50%)
bookdatamaker generate ./extracted -d dataset.db --distribution "25,25,25,25"
# ^C (interrupted)

# Resume from checkpoint
bookdatamaker generate ./extracted -d dataset.db --distribution "25,25,25,25"
# ⚠️ Found 4 incomplete thread(s) in database:
# Thread 0: 8/20 pairs, last updated 2024-01-15 10:30:45
# Thread 1: 10/20 pairs, last updated 2024-01-15 10:30:48
# Thread 2: 12/20 pairs, last updated 2024-01-15 10:30:50
# Thread 3: 7/20 pairs, last updated 2024-01-15 10:30:43
#
# Do you want to resume from checkpoint? [Y/n]: y
# ✓ Resuming from checkpoint...
```

**Features:**
- 💾 Automatic checkpoint after each Q&A pair submission
- 🔄 Resume from last position in document
- 💬 Preserves conversation history
- 🎯 Tracks progress per thread

### Basic Usage

```bash
# 6 threads (from distribution), 20 Q&A pairs per thread
bookdatamaker generate ./extracted \
-d dataset.db \
--distribution "10,10,20,30,20,10" \
--datasets-per-thread 20
```
**Key Concept**: Thread count is determined by the number of comma-separated values in `--distribution`.

### API Mode Examples

```bash
# OpenAI/Azure
bookdatamaker generate ./extracted \
-d dataset.db \
--openai-api-url https://api.openai.com/v1 \
--model gpt-4 \
--distribution "10,10,20,30,20,10"

# Custom API endpoint
bookdatamaker generate ./extracted \
--openai-api-url http://localhost:8000/v1 \
--model your-model-name \
--distribution "25,25,25,25"
```

### vLLM Direct Mode (Self-Hosted)

Use vLLM directly without API server:

```bash
# Single GPU
bookdatamaker generate ./extracted \
--mode vllm \
--vllm-model-path meta-llama/Llama-3-8B-Instruct \
--distribution "25,25,25,25" \
-d dataset.db

# Multi-GPU (4 GPUs, 6 threads)
bookdatamaker generate ./extracted \
--mode vllm \
--vllm-model-path meta-llama/Llama-3-70B-Instruct \
--tensor-parallel-size 4 \
--distribution "10,10,20,30,20,10" \
-d dataset.db
```

### Custom Prompts

Add specific instructions to guide LLM behavior:

```bash
# Language specification
bookdatamaker generate ./extracted \
--custom-prompt "Generate all Q&A in Chinese with simplified characters"

# Format specification
bookdatamaker generate ./extracted \
--custom-prompt "Questions should be multiple-choice with 4 options"

# Multiple requirements
bookdatamaker generate ./extracted \
--custom-prompt "Requirements:
1. Generate questions in English
2. Focus on practical applications
3. Include code examples
4. Answer length: 50-150 words
5. Difficulty: intermediate"
```

### Message History Management

Control conversation history to prevent token overflow:

```bash
# Limit conversation to 50 messages (keeps system prompt + last 10 when exceeded)
bookdatamaker generate ./extracted \
--max-messages 50 \
-d dataset.db

# For models with limited context windows
bookdatamaker generate ./extracted \
--max-messages 30 \
--model gpt-3.5-turbo
```

**How it works:**
- When message count exceeds `--max-messages`, history is pruned automatically
- System prompt is always preserved
- Last 10 messages are kept for continuity
- Prevents token overflow errors during long generation sessions
- Useful for models with limited context windows (e.g., 4K, 8K tokens)

---

## Export Dataset

Export from SQLite database to your preferred format:

```bash
# Parquet (recommended for data analysis, default: zstd compression)
bookdatamaker export-dataset dataset.db -o output.parquet

# Parquet with different compression methods
bookdatamaker export-dataset dataset.db -o output.parquet -c snappy # Faster, larger files
bookdatamaker export-dataset dataset.db -o output.parquet -c gzip # Smaller, slower
bookdatamaker export-dataset dataset.db -o output.parquet -c brotli # Best compression
bookdatamaker export-dataset dataset.db -o output.parquet -c none # No compression

# JSON Lines (easy to stream)
bookdatamaker export-dataset dataset.db -o output.jsonl -f jsonl

# CSV (Excel-friendly)
bookdatamaker export-dataset dataset.db -o output.csv -f csv

# JSON with metadata
bookdatamaker export-dataset dataset.db -o output.json -f json --include-metadata
```

### Compression Comparison

**For Parquet files:**

| Method | Speed | Size | Use Case |
|--------|-------|------|----------|
| `zstd` (default) | Fast | Small | Best balance, recommended |
| `snappy` | Fastest | Larger | Real-time processing |
| `gzip` | Medium | Smaller | Network transfer |
| `brotli` | Slowest | Smallest | Archival storage |
| `none` | Instant | Largest | Debug/testing only |

## Position Distribution

Control where threads start in the document using distribution percentages.

### How It Works

```
Document: 100 pages
Distribution: "10,10,20,30,20,10" (6 threads)

Thread 0: Start at 0% → Page 1
Thread 1: Start at 10% → Page 10
Thread 2: Start at 20% → Page 20
Thread 3: Start at 50% → Page 50
Thread 4: Start at 70% → Page 70
Thread 5: Start at 80% → Page 80
```

### Distribution Strategies

```bash
# Even distribution (4 threads)
--distribution "25,25,25,25"
# Start at: 0%, 25%, 50%, 75%

# Front-heavy (4 threads) - focus on beginning
--distribution "40,30,20,10"
# Start at: 0%, 40%, 70%, 90%

# Middle-heavy (5 threads) - focus on middle
--distribution "10,20,40,20,10"
# Start at: 0%, 10%, 30%, 70%, 90%

# Dense sampling (10 threads) - fine-grained coverage
--distribution "10,10,10,10,10,10,10,10,10,10"
```

### Thread Count Guidelines

- **Small documents** (<50 pages): 2-4 threads
- **Medium documents** (50-200 pages): 4-8 threads
- **Large documents** (>200 pages): 8-16 threads

---

## Performance Tuning

Optimize extraction and generation speeds based on your hardware and requirements.

### Stage 1: OCR Extraction

**API Mode (vLLM):**
```bash
# Increase concurrent requests (default: 4)
bookdatamaker extract book.pdf --api-concurrency 8

# Guidelines:
# - 2-4: Small vLLM server (1-2 GPUs)
# - 4-8: Medium server (2-4 GPUs)
# - 8-16: Large server (4+ GPUs)
# - Monitor server load and adjust accordingly
```

**Local Mode (Transformers):**
```bash
# Increase batch size (default: 8)
bookdatamaker extract book.pdf --mode local --batch-size 16

# Guidelines based on GPU VRAM:
# - 8GB VRAM: batch-size 2-4
# - 16GB VRAM: batch-size 4-8
# - 24GB VRAM: batch-size 8-12
# - 40GB+ VRAM: batch-size 12-16
```

### Stage 2: Dataset Generation

**Thread Count:**
```bash
# More threads = faster generation (if LLM server can handle it)
bookdatamaker generate ./extracted \
--distribution "10,10,10,10,10,10,10,10,10,10" \
--threads 10

# Guidelines:
# - API mode: 4-16 threads (based on rate limits)
# - vLLM mode: 4-8 threads (based on GPU capacity)
# - Local mode: 2-4 threads (memory intensive)
```

**Message History Management:**
```bash
# Limit conversation history to prevent memory issues
bookdatamaker generate ./extracted \
--max-messages 20 \
-d dataset.db

# Default: 20 messages (system message + last 10 exchanges)
# Lower values = less memory, potentially less context
# Higher values = more memory, better context retention
```

**Duplicate Detection:**
- Automatically enabled with 95% similarity threshold
- Uses rapidfuzz for efficient fuzzy matching
- Prevents redundant Q&A pairs in the dataset

### Performance Tips

1. **Start Small**: Test with small concurrency/batch sizes first
2. **Monitor Resources**: Watch GPU memory, CPU usage, and network
3. **Balance Quality vs Speed**: Higher concurrency may reduce quality
4. **Network Bandwidth**: API mode performance depends on network speed
5. **vLLM Configuration**: Use tensor parallelism for multi-GPU setups

---

## Interactive Chat

Chat with an LLM that can access your document through MCP tools. Perfect for exploring documents interactively or testing Q&A generation.

### Start Chat Session

```bash
# Basic chat with GPT-4
bookdatamaker chat ./extracted

# With vLLM server
bookdatamaker chat ./extracted \
--openai-api-url http://localhost:8000/v1 \
--model Qwen/Qwen3-4B-Thinking-2507

# With custom database
bookdatamaker chat ./extracted --db my_dataset.db
```

### Debug Mode

Set environment variable for verbose logging:

```bash
export LOG_LEVEL=DEBUG
bookdatamaker generate ./extracted -d dataset.db
```

---

## Development

### Project Structure

```
bookdatamaker/
├── src/bookdatamaker/
│ ├── cli.py # CLI interface
│ ├── ocr/
│ │ ├── extractor.py # OCR extraction
│ │ └── document_parser.py # Document parsing
│ ├── mcp/
│ │ └── server.py # MCP server
│ ├── llm/
│ │ └── parallel_generator.py # Parallel generation
│ ├── dataset/
│ │ ├── builder.py # Dataset building
│ │ └── dataset_manager.py # SQLite management
│ └── utils/
│ ├── page_manager.py # Page navigation
│ └── status.py # Progress indicators
└── tests/ # Test files
```

### Development Setup

```bash
# Clone repository
git clone https://github.com/yourusername/bookdatamaker.git
cd bookdatamaker

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Code formatting
black src/
ruff check src/

# Type checking
mypy src/
```

### Contributing

Contributions welcome! Please:
1. Fork the repository
2. Create a feature branch
3. Add tests for new features
4. Ensure all tests pass
5. Submit a pull request

### Testing

```bash
# Run all tests
pytest

# Run specific test file
pytest tests/test_ocr.py

# Run with coverage
pytest --cov=bookdatamaker tests/
```

---

## License

MIT License - see LICENSE file for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zwh20081/bookdatamaker

Awesome Lists containing this project

README