https://github.com/DS4SD/docling

Get your documents ready for gen AI
https://github.com/DS4SD/docling

ai convert document-parser document-parsing documents docx html markdown pdf pdf-converter pdf-to-json pdf-to-text pptx tables xlsx

Last synced: 6 months ago
JSON representation

Get your documents ready for gen AI

Host: GitHub
URL: https://github.com/DS4SD/docling
Owner: DS4SD
License: mit
Created: 2024-07-09T07:50:26.000Z (12 months ago)
Default Branch: main
Last Pushed: 2025-01-03T16:05:11.000Z (6 months ago)
Last Synced: 2025-01-04T06:00:08.175Z (6 months ago)
Topics: ai, convert, document-parser, document-parsing, documents, docx, html, markdown, pdf, pdf-converter, pdf-to-json, pdf-to-text, pptx, tables, xlsx
Language: Python
Homepage: https://ds4sd.github.io/docling
Size: 47.4 MB
Stars: 17,283
Watchers: 82
Forks: 902
Open Issues: 124
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Citation: CITATION.cff
- Security: .github/SECURITY.md

Awesome Lists containing this project

awesomeLibrary - docling - Get your documents ready for gen AI. (语言资源库 / python)
awesome - DS4SD/docling - Get your documents ready for gen AI (Python)
awesome-LLM-resources - Docling
definitive-opensource - Docling
awesome-repositories - DS4SD/docling - Get your documents ready for gen AI (Python)
awesome-github-repos - DS4SD/docling - Get your documents ready for gen AI (Python)
jimsghstars - DS4SD/docling - Get your documents ready for gen AI (Python)
AiTreasureBox - DS4SD/docling - 06-19_32296_48](https://img.shields.io/github/stars/DS4SD/docling.svg)|Get your docs ready for gen AI| (Repos)
awesome-safety-critical-ai - `DS4SD/docling`

README

        


  

    

  



# Docling



  



[![arXiv](https://img.shields.io/badge/arXiv-2408.09869-b31b1b.svg)](https://arxiv.org/abs/2408.09869)

[![Docs](https://img.shields.io/badge/docs-live-brightgreen)](https://ds4sd.github.io/docling/)

[![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/)

[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/docling)](https://pypi.org/project/docling/)

[![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)

[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)

[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)

[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)

[![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT)

[![PyPI Downloads](https://static.pepy.tech/badge/docling/month)](https://pepy.tech/projects/docling)

Docling parses documents and exports them to the desired format with ease and speed.

## Features

* 🗂️ Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to HTML, Markdown and JSON (with embedded and referenced images)

* 📑 Advanced PDF document understanding including page layout, reading order & table structures

* 🧩 Unified, expressive [DoclingDocument](https://ds4sd.github.io/docling/concepts/docling_document/) representation format

* 🤖 Easy integration with 🦙 LlamaIndex & 🦜🔗 LangChain for powerful RAG / QA applications

* 🔍 OCR support for scanned PDFs

* 💻 Simple and convenient CLI

Explore the [documentation](https://ds4sd.github.io/docling/) to discover plenty examples and unlock the full power of Docling!

### Coming soon

* ♾️ Equation & code extraction

* 📝 Metadata extraction, including title, authors, references & language

* 🦜🔗 Native LangChain extension

## Installation

To use Docling, simply install `docling` from your package manager, e.g. pip:

```bash

pip install docling

```

Works on macOS, Linux and Windows environments. Both x86_64 and arm64 architectures.

More [detailed installation instructions](https://ds4sd.github.io/docling/installation/) are available in the docs.

## Getting started

To convert individual documents, use `convert()`, for example:

```python

from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"  # document per local path or URL

converter = DocumentConverter()

result = converter.convert(source)

print(result.document.export_to_markdown())  # output: "## Docling Technical Report[...]"

```

More [advanced usage options](https://ds4sd.github.io/docling/usage/) are available in

the docs.

## Documentation

Check out Docling's [documentation](https://ds4sd.github.io/docling/), for details on

installation, usage, concepts, recipes, extensions, and more.

## Examples

Go hands-on with our [examples](https://ds4sd.github.io/docling/examples/),

demonstrating how to address different application use cases with Docling.

## Integrations

To further accelerate your AI application development, check out Docling's native

[integrations](https://ds4sd.github.io/docling/integrations/) with popular frameworks

and tools.

## Get help and support

Please feel free to connect with us using the [discussion section](https://github.com/DS4SD/docling/discussions).

## Technical report

For more details on Docling's inner workings, check out the [Docling Technical Report](https://arxiv.org/abs/2408.09869).

## Contributing

Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md) for details.

## References

If you use Docling in your projects, please consider citing the following:

```bib

@techreport{Docling,

  author = {Deep Search Team},

  month = {8},

  title = {Docling Technical Report},

  url = {https://arxiv.org/abs/2408.09869},

  eprint = {2408.09869},

  doi = {10.48550/arXiv.2408.09869},

  version = {1.0.0},

  year = {2024}

}

```

## License

The Docling codebase is under MIT license.

For individual model usage, please refer to the model licenses found in the original packages.

## IBM ❤️ Open Source AI

Docling has been brought to you by IBM.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/DS4SD/docling

Awesome Lists containing this project

README