https://github.com/miyako/4d-plugin-doctotext
4D implementation of DocToText.
https://github.com/miyako/4d-plugin-doctotext
4d-plugin doc docx rtf
Last synced: about 1 month ago
JSON representation
4D implementation of DocToText.
- Host: GitHub
- URL: https://github.com/miyako/4d-plugin-doctotext
- Owner: miyako
- License: mit
- Created: 2021-12-02T12:14:42.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2021-12-13T00:33:09.000Z (over 4 years ago)
- Last Synced: 2025-01-08T17:57:24.920Z (over 1 year ago)
- Topics: 4d-plugin, doc, docx, rtf
- Language: C
- Homepage:
- Size: 117 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README


[](LICENSE)

### Dependencies and Licensing
* the source code of this plugin developed using the [4D Plug-in SDK](https://github.com/4d/4D-Plugin-SDK) is licensed under the MIT license.
* see [SILVERCODERS](https://silvercoders.com/en/products/doctotext/) for the licensing of **DocToText** (GPLv2 or commercial).
* the licensing of the binary product of this plugin is subject to the licensing of all its dependencies.
# 4d-plugin-doctotext
4D implementation of [DocToText](https://silvercoders.com/en/products/doctotext/).

### Abstract
the goal of this project is to support legacy Microsoft Word documents with the `.doc` file extension.
* [`wv`](https://sourceforge.net/projects/wvware/files/wv/) can load and parse Word 2000, 97, 95 and 6 file formats.
* [`wvware`](http://wvware.sourceforge.net) is a document converter that uses `wv` to import `.doc` files. the outout format includes `.rtf`, `.txt`, `.tex`, `.pdf` or `.html`. see [unofficial mirror](https://github.com/remram44/wvware).
* [`abiword`](http://www.abisource.com) is a word processor that uses `wv` to import `.doc` files. it has a command line interface and server mode, similar to OpenOffice, that can be uses as a document converter. `wvware` deprecated its own suite of converters in favour of `abiword`.
* [`wv2`](https://sourceforge.net/projects/wvware/files/wv2/) is the successor to `wv`. it depends on `zlib`, `libgsf`, `libbz2`, `libxml2`, `libiconv` and `glib`, which in turns depends on `libffi` and `libpcre`.
* [`doctotext`](http://silvercoders.com/en/products/doctotext/) is a document converter that uses `wv2` to import `.doc` files. additionally it uses [`libcharsetdetect`](https://github.com/batterseapower/libcharsetdetect), [`htmlcxx`](http://htmlcxx.sourceforge.net), `libmimetic`, `minizip` to support other input formats. the outout format is always plain text.
* [`pthread-win32`](https://github.com/GerHobbelt/pthread-win32) nuget might not work, need to compile from source.
### Features
extract plain text from various file types:
### Syntax
```4d
status:=DocToText (document;options;attachments)
```
Parameter|Type|Description
------------|------|----
document|BLOB|
options|Object|see below
attachments|Array BLOB|
status|Object|
#### Options
Property|Type|Description
------------|------|----
xml | Text |`parse` (default) `fix` `strip`
table | Text | `table` (default) `row` `col`
url | Text | `underscored` (default) `text` `extended`
list | Text |` * ` (default) or any string
verbose | Boolean |`false` (default)
fallback | Boolean |`false` (default)
format | Text | `.doc` (default) `.rtf` `.docx` `.pptx` `.xlsx` `.fodt` `.fods` `.fodp` `.fodg` `.odt` `.ods` `.odp` `.odg` `.ppt` `.xls` `.xlsb` `.pages` `.numbers` `.key` `.html` `.pdf` `.eml`