https://github.com/dridk/all2txt
parallel converter to plain text for doc, docx, rtf, pdf files
https://github.com/dridk/all2txt
Last synced: about 1 month ago
JSON representation
parallel converter to plain text for doc, docx, rtf, pdf files
- Host: GitHub
- URL: https://github.com/dridk/all2txt
- Owner: dridk
- License: gpl-3.0
- Created: 2024-02-25T09:49:26.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-02-26T21:02:10.000Z (about 1 year ago)
- Last Synced: 2025-01-31T17:52:47.525Z (3 months ago)
- Language: Rich Text Format
- Size: 375 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# all2txt
parallel converter to plain text for doc, docx, rtf, pdf filesFile conversion with libreoffice is difficult to parallelize for massive batch conversion. And sometimes libreoffice hang makes it impossible to use.
This docker image converts all files (doc, docx, rtf, pdf ) in the input folder to plain text, via html with dedicated tools and GNU Parallel (https://www.gnu.org/software/parallel/).
Source code are available on github : https://github.com/dridk/all2txt- *.doc file are converted with **abiword**
- *.rtf are converted with **rtf2html**
- *.docx are converted with **pandoc**
- *.pdf are converted with **pdf2html**# Usage
- Create a folder named *input* with all your documents inside .
- Pull image:
```docker pull dridk/all2txt```
- Run conversion with :
```docker run -v $(pwd)/input:/input --rm -it all2txt ```