https://github.com/ranfysvalle02/hello-docling

"Hello world" but for docling.
https://github.com/ranfysvalle02/hello-docling

Last synced: 7 months ago
JSON representation

"Hello world" but for docling.

Host: GitHub
URL: https://github.com/ranfysvalle02/hello-docling
Owner: ranfysvalle02
Created: 2024-11-08T03:33:16.000Z (11 months ago)
Default Branch: main
Last Pushed: 2024-11-08T03:43:14.000Z (11 months ago)
Last Synced: 2025-01-26T07:08:57.471Z (9 months ago)
Language: Python
Homepage:
Size: 748 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# hello-docling

![](docling.png)

[Docling](https://ds4sd.github.io/docling/) is an open-source Python library for efficient document conversion and analysis.

## How It Works

1. **PDF Parsing**: Extracts text tokens and renders page images
2. **AI Model Application**: Applies specialized models to each page
3. **Post-processing**: Augments metadata, detects language, corrects reading order
4. **Assembly**: Combines results into a structured document object
5. **Output**: Generates JSON or Markdown output

Docling runs entirely locally on commodity hardware, making it a powerful tool for document analysis and conversion without relying on external services.

## Key Features

### 1. PDF Conversion
- Converts PDF documents to JSON or Markdown format
- Fast and stable processing

### 2. Advanced Layout Analysis
- Utilizes AI models (DocLayNet) for detailed page layout understanding
- Recovers reading order and document structure

### 3. Table Structure Recognition
- Employs TableFormer AI model to extract and reconstruct table structures

### 4. Metadata Extraction
- Automatically extracts document metadata:
- Title
- Authors
- References
- Language

### 5. OCR Capability
- Optional OCR for scanned PDFs

### 6. Post-processing
- Removes overlapping bounding-box proposals
- Intersects predictions with text tokens for complete content units

### 7. Integration
- Works with Quackling for optimized vector embedding and chunking
- Compatible with LLM frameworks like LlamaIndex

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ranfysvalle02/hello-docling

Awesome Lists containing this project

README