An open API service indexing awesome lists of open source software.

https://github.com/miyako/4d-plugin-doctotext

4D implementation of DocToText.
https://github.com/miyako/4d-plugin-doctotext

4d-plugin doc docx rtf

Last synced: about 1 month ago
JSON representation

4D implementation of DocToText.

Awesome Lists containing this project

README

          

![version](https://img.shields.io/badge/version-18%2B-EB8E5F)
![platform](https://img.shields.io/static/v1?label=platform&message=mac-intel%20|%20mac-arm%20|%20win-64&color=blue)
[![license](https://img.shields.io/github/license/miyako/4d-plugin-doctotext)](LICENSE)
![downloads](https://img.shields.io/github/downloads/miyako/4d-plugin-doctotext/total)

### Dependencies and Licensing

* the source code of this plugin developed using the [4D Plug-in SDK](https://github.com/4d/4D-Plugin-SDK) is licensed under the MIT license.
* see [SILVERCODERS](https://silvercoders.com/en/products/doctotext/) for the licensing of **DocToText** (GPLv2 or commercial).
* the licensing of the binary product of this plugin is subject to the licensing of all its dependencies.

# 4d-plugin-doctotext
4D implementation of [DocToText](https://silvercoders.com/en/products/doctotext/).

ss

### Abstract

the goal of this project is to support legacy Microsoft Word documents with the `.doc` file extension.

* [`wv`](https://sourceforge.net/projects/wvware/files/wv/) can load and parse Word 2000, 97, 95 and 6 file formats.

* [`wvware`](http://wvware.sourceforge.net) is a document converter that uses `wv` to import `.doc` files. the outout format includes `.rtf`, `.txt`, `.tex`, `.pdf` or `.html`. see [unofficial mirror](https://github.com/remram44/wvware).

* [`abiword`](http://www.abisource.com) is a word processor that uses `wv` to import `.doc` files. it has a command line interface and server mode, similar to OpenOffice, that can be uses as a document converter. `wvware` deprecated its own suite of converters in favour of `abiword`.

* [`wv2`](https://sourceforge.net/projects/wvware/files/wv2/) is the successor to `wv`. it depends on `zlib`, `libgsf`, `libbz2`, `libxml2`, `libiconv` and `glib`, which in turns depends on `libffi` and `libpcre`.

* [`doctotext`](http://silvercoders.com/en/products/doctotext/) is a document converter that uses `wv2` to import `.doc` files. additionally it uses [`libcharsetdetect`](https://github.com/batterseapower/libcharsetdetect), [`htmlcxx`](http://htmlcxx.sourceforge.net), `libmimetic`, `minizip` to support other input formats. the outout format is always plain text.

* [`pthread-win32`](https://github.com/GerHobbelt/pthread-win32) nuget might not work, need to compile from source.

### Features

extract plain text from various file types:

### Syntax

```4d
status:=DocToText (document;options;attachments)
```

Parameter|Type|Description
------------|------|----
document|BLOB|
options|Object|see below
attachments|Array BLOB|
status|Object|

#### Options

Property|Type|Description
------------|------|----
xml | Text |`parse` (default) `fix` `strip`
table | Text | `table` (default) `row` `col`
url | Text | `underscored` (default) `text` `extended`
list | Text |` * ` (default) or any string
verbose | Boolean |`false` (default)
fallback | Boolean |`false` (default)
format | Text | `.doc` (default) `.rtf` `.docx` `.pptx` `.xlsx` `.fodt` `.fods` `.fodp` `.fodg` `.odt` `.ods` `.odp` `.odg` `.ppt` `.xls` `.xlsb` `.pages` `.numbers` `.key` `.html` `.pdf` `.eml`