https://github.com/rithulkamesh/docproc
Opinionated and Sophisticated Document Region Analyzer.
https://github.com/rithulkamesh/docproc
content-extraction data-extraction document-analysis document-parsing equation-detection layout-analysis machine-learning mathematical-symbols ocr pdf-processing pdf-text-extraction python region-detection text-classification text-extraction
Last synced: 2 months ago
JSON representation
Opinionated and Sophisticated Document Region Analyzer.
- Host: GitHub
- URL: https://github.com/rithulkamesh/docproc
- Owner: rithulkamesh
- License: mit
- Created: 2025-01-30T09:08:57.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-13T19:47:13.000Z (12 months ago)
- Last Synced: 2025-10-25T17:50:00.919Z (5 months ago)
- Topics: content-extraction, data-extraction, document-analysis, document-parsing, equation-detection, layout-analysis, machine-learning, mathematical-symbols, ocr, pdf-processing, pdf-text-extraction, python, region-detection, text-classification, text-extraction
- Language: Python
- Homepage:
- Size: 219 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE.md
Awesome Lists containing this project
README
# Docproc
A Python-based document region analyzer and content extraction tool.
> [!WARNING]
> Project is under active development so most of the features aren't implemented, The readme is written to understand project scope.
## Overview
Docproc is an opinionated document region analyzer that helps extract text, equations, images and handwriting from documents. It provides both a library interface and a command-line tool.
## Repository Flow
```mermaid
flowchart TD
%% User Interface Layer
subgraph "User Interface"
UI["User Input"]:::cli
CLI["CLI"]:::cli
end
UI -->|"initiates"| CLI
%% Core Processing Layer
subgraph "Core Processing"
DA["Document Analyzer"]:::core
ED["Equations Detector"]:::core
HD["Handwriting Detector"]:::core
RD["Regions Detector"]:::core
end
CLI -->|"processes"| DA
DA -->|"detects"| ED
DA -->|"detects"| HD
DA -->|"detects"| RD
%% Output Generation Layer
subgraph "Output Generation"
CSV["CSV Writer"]:::writer
JSON["JSON Writer"]:::writer
SQLITE["SQLite Writer"]:::writer
FILE["Generic File Writer"]:::writer
end
DA -->|"exports"| CSV
DA -->|"exports"| JSON
DA -->|"exports"| SQLITE
DA -->|"exports"| FILE
%% Environment & Testing Layer
subgraph "Environment & Testing"
DE1["pyproject.toml"]:::env
DE2["shell.nix"]:::env
TS["Test Suite"]:::test
CI[".github Directory"]:::env
end
DE1 -.->|"env"| CLI
DE2 -.->|"env"| CLI
CI -.->|"CI"| CLI
TS -.->|"tests"| DA
%% Styles
classDef cli fill:#ADD8E6,stroke:#000,stroke-width:1px;
classDef core fill:#90EE90,stroke:#000,stroke-width:1px;
classDef writer fill:#FFD700,stroke:#000,stroke-width:1px;
classDef env fill:#D3D3D3,stroke:#000,stroke-width:1px;
classDef test fill:#FFB6C1,stroke:#000,stroke-width:1px;
%% Click Events
click CLI "https://github.com/rithulkamesh/docproc/blob/main/docproc/bin/cli.py"
click DA "https://github.com/rithulkamesh/docproc/blob/main/docproc/doc/analyzer.py"
click ED "https://github.com/rithulkamesh/docproc/blob/main/docproc/doc/equations.py"
click HD "https://github.com/rithulkamesh/docproc/blob/main/docproc/doc/handwriting.py"
click RD "https://github.com/rithulkamesh/docproc/blob/main/docproc/doc/regions.py"
click CSV "https://github.com/rithulkamesh/docproc/blob/main/docproc/writer/csv.py"
click JSON "https://github.com/rithulkamesh/docproc/blob/main/docproc/writer/json.py"
click SQLITE "https://github.com/rithulkamesh/docproc/blob/main/docproc/writer/sqlite.py"
click FILE "https://github.com/rithulkamesh/docproc/blob/main/docproc/writer/filewriter.py"
click DE1 "https://github.com/rithulkamesh/docproc/blob/main/pyproject.toml"
click DE2 "https://github.com/rithulkamesh/docproc/blob/main/shell.nix"
click TS "https://github.com/rithulkamesh/docproc/tree/main/tests/"
click CI "https://github.com/rithulkamesh/docproc/blob/main/.github Directory"
```
This diagram was generated by [GitDiagram](https://gitdiagram.com). A shoutout.
## Installation
```bash
# Using pip
pip install docproc
```
## Usage
### As a Command-line Tool
```bash
# Basic usage
docproc input.pdf
# Specify output format and file
docproc input.pdf -w csv -o output.csv
docproc input.pdf -w sqlite -o database.db
docproc input.pdf -w json -o output.json
# Extract only specific region types
docproc input.pdf --regions text equation
docproc input.pdf -r text image # Short form
# Enable verbose logging
docproc input.pdf -v
```
Supported output formats:
- CSV (default)
- SQLite
- JSON
### As a Library
```python
from docproc.doc.analyzer import DocumentAnalyzer
from docproc.writer import CSVWriter
# Using context manager (recommended)
with DocumentAnalyzer("input.pdf", CSVWriter, output_path="output.csv") as analyzer:
regions = analyzer.detect_regions()
analyzer.export_regions()
```
## Roadmap
The following features are planned for upcoming releases:
- **Handwriting Recognition**: Detect and extract handwritten content from documents
## Development
```bash
uv sync
```
## Contributing
Pull requests are welcome. Please ensure tests pass before submitting.
## Contact
For any questions, feedback or suggestions, please contact the author @ [hi@rithul.dev](mailto:hi@rithul.dev)