https://github.com/dothinking/pdf2docx

Open source Python library for converting PDF to DOCX.
https://github.com/dothinking/pdf2docx

docx extract-table pdf-converter pdf-to-word pymupdf

Last synced: 6 months ago
JSON representation

Open source Python library for converting PDF to DOCX.

Host: GitHub
URL: https://github.com/dothinking/pdf2docx
Owner: ArtifexSoftware
License: agpl-3.0
Created: 2019-06-20T07:32:24.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2024-09-23T22:31:55.000Z (9 months ago)
Last Synced: 2024-12-18T16:05:46.557Z (6 months ago)
Topics: docx, extract-table, pdf-converter, pdf-to-word, pymupdf
Language: Python
Homepage: https://pdf2docx.readthedocs.io
Size: 21.9 MB
Stars: 2,658
Watchers: 26
Forks: 388
Open Issues: 83
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        English | [中文](README_CN.md)

# pdf2docx 

![python-version](https://img.shields.io/badge/python->=3.6-green.svg)

[![codecov](https://codecov.io/gh/dothinking/pdf2docx/branch/master/graph/badge.svg)](https://codecov.io/gh/dothinking/pdf2docx)

[![pypi-version](https://img.shields.io/pypi/v/pdf2docx.svg)](https://pypi.python.org/pypi/pdf2docx/)

![license](https://img.shields.io/pypi/l/pdf2docx.svg)

![pypi-downloads](https://img.shields.io/pypi/dm/pdf2docx)

- Extract data from PDF with `PyMuPDF`, e.g. text, images and drawings 

- Parse layout with rule, e.g. sections, paragraphs, images and tables

- Generate docx with `python-docx`

## Features

- Parse and re-create page layout

    - page margin

    - section and column (1 or 2 columns only)

    - page header and footer [TODO]

- Parse and re-create paragraph

    - OCR text [TODO]

    - text in horizontal/vertical direction: from left to right, from bottom to top

    - font style, e.g. font name, size, weight, italic and color

    - text format, e.g. highlight, underline, strike-through

    - list style [TODO]

    - external hyper link

    - paragraph horizontal alignment (left/right/center/justify) and vertical spacing

    

- Parse and re-create image

	- in-line image

    - image in Gray/RGB/CMYK mode

    - transparent image

    - floating image, i.e. picture behind text

- Parse and re-create table

    - border style, e.g. width, color

    - shading style, i.e. background color

    - merged cells

    - vertical direction cell

    - table with partly hidden borders

    - nested tables

- Parsing pages with multi-processing

*It can also be used as a tool to extract table contents since both table content and format/style is parsed.*

## Limitations

- Text-based PDF file

- Left to right language

- Normal reading direction, no word transformation / rotation

- Rule-based method can't 100% convert the PDF layout

## Documentation

- [Installation](https://pdf2docx.readthedocs.io/en/latest/installation.html)

- [Quickstart](https://pdf2docx.readthedocs.io/en/latest/quickstart.html)

    - [Convert PDF](https://pdf2docx.readthedocs.io/en/latest/quickstart.convert.html)

    - [Extract table](https://pdf2docx.readthedocs.io/en/latest/quickstart.table.html)

    - [Command Line Interface](https://pdf2docx.readthedocs.io/en/latest/quickstart.cli.html)

    - [Graphic User Interface](https://pdf2docx.readthedocs.io/en/latest/quickstart.gui.html)

- [Technical Documentation (In Chinese)](https://pdf2docx.readthedocs.io/en/latest/techdoc.html)

- [API Documentation](https://pdf2docx.readthedocs.io/en/latest/modules.html)

## Sample

![sample_compare.png](https://s1.ax1x.com/2020/08/04/aDryx1.png)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dothinking/pdf2docx

Awesome Lists containing this project

README