An open API service indexing awesome lists of open source software.

https://github.com/rithulkamesh/docproc

Opinionated and Sophisticated Document Region Analyzer.
https://github.com/rithulkamesh/docproc

content-extraction data-extraction document-analysis document-parsing equation-detection layout-analysis machine-learning mathematical-symbols ocr pdf-processing pdf-text-extraction python region-detection text-classification text-extraction

Last synced: 2 months ago
JSON representation

Opinionated and Sophisticated Document Region Analyzer.

Awesome Lists containing this project

README

          

# Docproc

A Python-based document region analyzer and content extraction tool.

> [!WARNING]
> Project is under active development so most of the features aren't implemented, The readme is written to understand project scope.

## Overview

Docproc is an opinionated document region analyzer that helps extract text, equations, images and handwriting from documents. It provides both a library interface and a command-line tool.

## Repository Flow

```mermaid
flowchart TD
%% User Interface Layer
subgraph "User Interface"
UI["User Input"]:::cli
CLI["CLI"]:::cli
end
UI -->|"initiates"| CLI

%% Core Processing Layer
subgraph "Core Processing"
DA["Document Analyzer"]:::core
ED["Equations Detector"]:::core
HD["Handwriting Detector"]:::core
RD["Regions Detector"]:::core
end
CLI -->|"processes"| DA
DA -->|"detects"| ED
DA -->|"detects"| HD
DA -->|"detects"| RD

%% Output Generation Layer
subgraph "Output Generation"
CSV["CSV Writer"]:::writer
JSON["JSON Writer"]:::writer
SQLITE["SQLite Writer"]:::writer
FILE["Generic File Writer"]:::writer
end
DA -->|"exports"| CSV
DA -->|"exports"| JSON
DA -->|"exports"| SQLITE
DA -->|"exports"| FILE

%% Environment & Testing Layer
subgraph "Environment & Testing"
DE1["pyproject.toml"]:::env
DE2["shell.nix"]:::env
TS["Test Suite"]:::test
CI[".github Directory"]:::env
end
DE1 -.->|"env"| CLI
DE2 -.->|"env"| CLI
CI -.->|"CI"| CLI
TS -.->|"tests"| DA

%% Styles
classDef cli fill:#ADD8E6,stroke:#000,stroke-width:1px;
classDef core fill:#90EE90,stroke:#000,stroke-width:1px;
classDef writer fill:#FFD700,stroke:#000,stroke-width:1px;
classDef env fill:#D3D3D3,stroke:#000,stroke-width:1px;
classDef test fill:#FFB6C1,stroke:#000,stroke-width:1px;

%% Click Events
click CLI "https://github.com/rithulkamesh/docproc/blob/main/docproc/bin/cli.py"
click DA "https://github.com/rithulkamesh/docproc/blob/main/docproc/doc/analyzer.py"
click ED "https://github.com/rithulkamesh/docproc/blob/main/docproc/doc/equations.py"
click HD "https://github.com/rithulkamesh/docproc/blob/main/docproc/doc/handwriting.py"
click RD "https://github.com/rithulkamesh/docproc/blob/main/docproc/doc/regions.py"
click CSV "https://github.com/rithulkamesh/docproc/blob/main/docproc/writer/csv.py"
click JSON "https://github.com/rithulkamesh/docproc/blob/main/docproc/writer/json.py"
click SQLITE "https://github.com/rithulkamesh/docproc/blob/main/docproc/writer/sqlite.py"
click FILE "https://github.com/rithulkamesh/docproc/blob/main/docproc/writer/filewriter.py"
click DE1 "https://github.com/rithulkamesh/docproc/blob/main/pyproject.toml"
click DE2 "https://github.com/rithulkamesh/docproc/blob/main/shell.nix"
click TS "https://github.com/rithulkamesh/docproc/tree/main/tests/"
click CI "https://github.com/rithulkamesh/docproc/blob/main/.github Directory"

```

This diagram was generated by [GitDiagram](https://gitdiagram.com). A shoutout.

## Installation

```bash
# Using pip
pip install docproc
```

## Usage

### As a Command-line Tool

```bash
# Basic usage
docproc input.pdf

# Specify output format and file
docproc input.pdf -w csv -o output.csv
docproc input.pdf -w sqlite -o database.db
docproc input.pdf -w json -o output.json

# Extract only specific region types
docproc input.pdf --regions text equation
docproc input.pdf -r text image # Short form

# Enable verbose logging
docproc input.pdf -v
```

Supported output formats:

- CSV (default)
- SQLite
- JSON

### As a Library

```python
from docproc.doc.analyzer import DocumentAnalyzer
from docproc.writer import CSVWriter

# Using context manager (recommended)
with DocumentAnalyzer("input.pdf", CSVWriter, output_path="output.csv") as analyzer:
regions = analyzer.detect_regions()
analyzer.export_regions()
```

## Roadmap

The following features are planned for upcoming releases:

- **Handwriting Recognition**: Detect and extract handwritten content from documents

## Development

```bash
uv sync
```

## Contributing

Pull requests are welcome. Please ensure tests pass before submitting.

## Contact

For any questions, feedback or suggestions, please contact the author @ [hi@rithul.dev](mailto:hi@rithul.dev)