Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/badbye/docxpy
A pure python based utility to extract text and images from docx files.
https://github.com/badbye/docxpy
docx python python3
Last synced: 3 months ago
JSON representation
A pure python based utility to extract text and images from docx files.
- Host: GitHub
- URL: https://github.com/badbye/docxpy
- Owner: badbye
- License: mit
- Fork: true (ankushshah89/python-docx2txt)
- Created: 2017-03-02T06:58:15.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2022-10-23T00:09:40.000Z (about 2 years ago)
- Last Synced: 2024-09-10T23:18:05.692Z (3 months ago)
- Topics: docx, python, python3
- Language: Python
- Homepage:
- Size: 46.9 KB
- Stars: 5
- Watchers: 2
- Forks: 4
- Open Issues: 1
-
Metadata Files:
- Readme: README.rst
- License: LICENSE.txt
Awesome Lists containing this project
README
docxpy
======|image0| |PyPI|
This project is forked from
`ankushshah89/python-docx2txt `__.
A new feature is added: extract the hyperlinks and its corresponding
texts.It is a pure python-based utility to extract text from docx files. The
code is taken and adapted from
`python-docx `__. It can
however also extract **text** from header, footer and **hyperlinks**. It
can now also extract **images**.How to install?
---------------.. code:: bash
pip install docxpy
How to run?
-----------a. From command line:
.. code:: bash
# extract text
docx2txt file.docx
# extract text and images
docx2txt -i /tmp/img_dir file.docxb. From python:
.. code:: python
import docxpy
file = 'file.docx'
# extract text
text = docxpy.process(file)# extract text and write images in /tmp/img_dir
text = docxpy.process(file, "/tmp/img_dir")# if you want the hyperlinks
doc = docxpy.DOCReader(file)
doc.process() # process file
hyperlinks = doc.data['links'].. |image0| image:: https://travis-ci.org/badbye/docxpy.svg?branch=master
.. |PyPI| image:: https://img.shields.io/pypi/pyversions/scrapy-corenlp.svg?style=flat-square