An open API service indexing awesome lists of open source software.

https://github.com/semvis123/go-catdoc

Go-catdoc, get text and metadata from .doc files.
https://github.com/semvis123/go-catdoc

catdoc doc extraction go golang imhex-pattern metadata msdoc text

Last synced: about 1 year ago
JSON representation

Go-catdoc, get text and metadata from .doc files.

Awesome Lists containing this project

README

          

## Go-catdoc, get text and metadata from .doc files.
[![GoDoc](https://godoc.org/github.com/semvis123/go-catdoc?status.svg)](https://godoc.org/github.com/semvis123/go-catdoc)
[![Tests](https://github.com/semvis123/go-catdoc/actions/workflows/go.yml/badge.svg)](https://github.com/semvis123/go-catdoc/actions/workflows/go.yml)

Uses Wazero to run catdoc as webassembly in Go.
The catdoc source is slightly modified to support reading metadata in `.doc`.
The `msdoc.hexpat` file is a pattern file for imhex that can parse the `summaryinformation` ole object inside the `.doc` file.

To compile the webassembly binary, go to ./catdoc/src/ and run `make catdoc-wasm`.
To run the tests, do `go test ./...`

Usage:
```
f, err := os.Open("test.doc")
text, err := gocatdoc.GetTextFromFile(f)
```