https://github.com/pspdfkit/nutrient-pdf-mcp-server

A powerful Model Context Protocol server for LLM-driven PDF document analysis and exploration
https://github.com/pspdfkit/nutrient-pdf-mcp-server

ai-integration llm-tools mcp pdf pdf-parser pdf-tools python

Last synced: 6 months ago
JSON representation

A powerful Model Context Protocol server for LLM-driven PDF document analysis and exploration

Host: GitHub
URL: https://github.com/pspdfkit/nutrient-pdf-mcp-server
Owner: PSPDFKit
License: mit
Created: 2025-06-30T18:03:25.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-08-01T19:44:54.000Z (7 months ago)
Last Synced: 2025-08-01T21:44:34.138Z (7 months ago)
Topics: ai-integration, llm-tools, mcp, pdf, pdf-parser, pdf-tools, python
Language: Python
Homepage:
Size: 52.7 KB
Stars: 3
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Nutrient PDF MCP Server

> **A powerful Model Context Protocol server for LLM-driven PDF document analysis and exploration**

A [Model Context Protocol (MCP)](https://modelcontextprotocol.io) server for investigating PDF object trees with lazy loading support. This tool allows LLMs to efficiently explore PDF document structure without overwhelming token limits.

## Features

- **Lazy Loading**: Explore PDF structure without loading entire object trees
- **Path Navigation**: Navigate through PDF objects using dot notation (e.g., `Pages.Kids.0`)
- **Selective Resolution**: Resolve specific indirect objects on demand
- **Token Efficient**: Massive reduction in response sizes compared to full tree dumps
- **Type Safe**: Comprehensive type hints and error handling

## Installation

### Optional `asdf` setup

You'll need `python` and `nodejs` installed on your machine. You can optionally use `asdf`.

- [Install and configure `asdf` version manager](https://asdf-vm.com/guide/getting-started.html)
- [Install `asdf` `nodejs` plugin](https://github.com/asdf-vm/asdf-nodejs)
- [Install `asdf` `python` plugin](https://github.com/asdf-community/asdf-python)

Finally install required tools with:

```sh
git clone https://github.com/PSPDFKit/nutrient-pdf-mcp-server.git
cd nutrient-pdf-mcp-server
asdf install

# Install pipx for Python
python -m pip install --user pipx
```

Proceed with the rest of the installation after that.

### Quick Start

```bash
git clone https://github.com/PSPDFKit/nutrient-pdf-mcp-server.git
cd nutrient-pdf-mcp-server
make install-dev # Sets up development environment
```

### For Claude Code CLI

**Recommended: Build and Install**

```bash
pip install build
make build
pipx install dist/nutrient_pdf_mcp-1.0.0-py3-none-any.whl
claude mcp add nutrient-pdf-mcp nutrient-pdf-mcp
```

If using `asdf`, you might need to configure `pipx` with the following before running:

```sh
export PIPX_DEFAULT_PYTHON=$(asdf which python)
pipx install dist/nutrient_pdf_mcp-1.0.0-py3-none-any.whl
```

**Development Mode**

```bash
make install-dev
claude mcp add nutrient-pdf-mcp "$(pwd)/venv/bin/python" -m pdf_mcp.server
```

#### Manual Configuration

```json
{
"mcpServers": {
"nutrient-pdf-mcp": {
"command": "python",
"args": ["-m", "pdf_mcp.server"]
}
}
}
```

### Available Tools

#### `get_pdf_object_tree`

Nutrient PDF MCP Server - Get JSON representation of PDF object tree with lazy loading.

**Parameters:**

- `pdf_path` (required): Path to the PDF file
- `object_id` (optional): Specific object ID to retrieve (e.g., '1 0')
- `path` (optional): Object path to navigate (e.g., 'Pages.Kids.0')
- `mode` (optional): Parsing mode - 'lazy' (default) or 'full'

**Examples:**

```json
{
"pdf_path": "document.pdf",
"mode": "lazy"
}
```

```json
{
"pdf_path": "document.pdf",
"path": "Pages.Kids.0",
"mode": "lazy"
}
```

#### `resolve_indirect_object`

Nutrient PDF MCP Server - Resolve a specific indirect object by its object and generation numbers.

**Parameters:**

- `pdf_path` (required): Path to the PDF file
- `objnum` (required): PDF object number (e.g., 3)
- `gennum` (optional): PDF generation number (defaults to 0)
- `depth` (optional): Resolution depth - 'shallow' (default) or 'deep'

**Examples:**

```json
{
"pdf_path": "document.pdf",
"objnum": 3,
"gennum": 0,
"depth": "shallow"
}
```

### Command Line Usage

```bash
# Run the server
make serve

# Or run with debug logging
make serve-debug
```

## Architecture

### Core Components

- **`parser.py`**: Main PDF parsing logic with lazy loading support
- **`server.py`**: MCP server implementation
- **`types.py`**: Type definitions for PDF objects and responses
- **`exceptions.py`**: Custom exception classes

### Response Types

All PDF objects are serialized into a consistent JSON format:

```json
{
"type": "dict",
"value": {
"/Type": { "type": "name", "value": "/Pages" },
"/Kids": {
"type": "array",
"value": [{ "type": "indirect_ref", "objnum": 2, "gennum": 0 }]
}
}
}
```

### Token Efficiency

The lazy loading system provides massive token savings:

- **Lazy mode**: ~5-50 lines (minimal tokens)
- **Shallow resolution**: ~50-100 lines (reasonable tokens)
- **Deep resolution**: 500+ lines (use sparingly)

## Examples

### Exploring PDF Structure

1. **Get overview**: `get_pdf_object_tree(path="document.pdf", mode="lazy")`
2. **Navigate to pages**: `get_pdf_object_tree(path="document.pdf", path="Pages", mode="lazy")`
3. **Resolve specific page**: `resolve_indirect_object(objnum=3, gennum=0, depth="shallow")`
4. **Deep dive when needed**: `resolve_indirect_object(objnum=3, gennum=0, depth="deep")`

### Path Navigation Examples

- `"Pages"` - Navigate to Pages object
- `"Pages.Kids"` - Get Kids array from Pages
- `"Pages.Kids.0"` - Get first page
- `"Pages.Kids.0.MediaBox.2"` - Get width from MediaBox array

## Development

### Quick Start

```bash
# Set up development environment
make install-dev

# Run all quality checks (format, lint, typecheck, test)
make quality

# Or run individual commands
make test # Run tests
make format # Format code
make lint # Run linter
make typecheck # Type checking
```

### Project Structure

```
nutrient-pdf-mcp-server/
├── pdf_mcp/
│ ├── __init__.py
│ ├── server.py # MCP server
│ ├── parser.py # PDF parsing logic
│ ├── types.py # Type definitions
│ └── exceptions.py # Custom exceptions
├── tests/ # Test suite
├── res/ # Sample PDFs
├── pyproject.toml # Project configuration
└── README.md
```

## Publishing to PyPI

```bash
# Build the package
make build

# Upload to test PyPI first
twine upload --repository testpypi dist/*

# Upload to production PyPI
twine upload dist/*
```

After publishing, users can install with:

```bash
pipx install nutrient-pdf-mcp
# or
pip install --user nutrient-pdf-mcp
```

## Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes with tests
4. Ensure code quality checks pass
5. Submit a pull request

## License

MIT License - see LICENSE file for details.

## Related Projects

- [Model Context Protocol](https://modelcontextprotocol.io)
- [PyPDF](https://pypdf.readthedocs.io/)
- [Claude Code](https://claude.ai/code)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/pspdfkit/nutrient-pdf-mcp-server

Awesome Lists containing this project

README