Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/gamemaker1/office-text-extractor

Yet another library to extract text from MS Office and PDF files
https://github.com/gamemaker1/office-text-extractor

docx get-text ms-excel ms-office ms-powerpoint ms-word parser pdf pptx text-extraction xlsx

Last synced: 6 days ago
JSON representation

Yet another library to extract text from MS Office and PDF files

Awesome Lists containing this project

README

        

#

office-text-extractor

yet another library to extract text from docx, pptx, xlsx, and pdf files.

## similar libraries

there are other great libraries that do the same job and have inspired this
project, such as:

- [`any-text`](https://github.com/abhinaba-ghosh/any-text)
- [`officeparser`](https://github.com/harshankur/officeParser)
- [`textract`](https://www.npmjs.com/package/textract)

however, office-text-extractor has the following differences:

- parses file based on its **mime type**, not its file extension.
- **does not spawn** a child process to use a tool installed on the device.
- reads and returns text from the file if it contains **plain text**.

## libraries used

this package uses some amazing existing libraries that perform better than the
ones that originally existed in this module, and are therefore used instead:

- [`pdf-parse`](https://www.npmjs.com/package/pdf-parse), for parsing pdf files
- [`xlsx`](https://www.npmjs.com/package/xlsx), for parsing xlsx files
- [`mammoth`](https://www.npmjs.com/package/mammoth), for parsing docx files

a big thank you to the contributors of these projects!

## installation

#### node

> from version 2.0.0 onwards, this package is pure esm. please read
> [this article](https://gist.github.com/sindresorhus/a39789f98801d908bbc7ff3ecc99d99c)
> for a guide on how to ensure your project can import this library.

to use office-text-extractor in an Node project, install it using `npm`/`pnpm`/`yarn`:

```sh
> npm install office-text-extractor
> pnpm add office-text-extractor
> yarn add office-text-extractor
```

#### ~browser~

the library currently cannot be used in the browser due to its usage of the `node:buffer`
library. pull requests that can replace `node:buffer` with a different library are welcome!

## usage

an example of using the library to extract text is as follows:

```ts
import { readFile } from 'node:fs/promises'
import { getTextExtractor } from 'office-text-extractor'

// this function returns a new instance of the `TextExtractor` class, with the default
// extraction methods (docx, pptx, xlsx, pdf) registered.
const extractor = getTextExtractor()

// extract text from a url, because that's a neat first example :p
const url = 'https://raw.githubusercontent.com/gamemaker1/office-text-extractor/rewrite/test/fixtures/docs/pptx.pptx'
const text = await extractor.extractText({ input: url, type: 'url' })

// you can extract text from a file too, like so:
const path = 'stuff/boring.pdf'
const text = await extractor.extractText({ input: path, type: 'file' })

// if you have a buffer with the file in it, you can pass that too:
const buffer = await readFile(path)
const text = await extractor.extractText({ input: buffer, type: 'buffer' })

console.log(text)
```

the following is an example of how to create and use your own text extraction method:

```ts
import { type Buffer } from 'node:buffer'
import { TextExtractor, type TextExtractionMethod } from 'office-text-extractor'

/**
* Extracts text from images.
*/
class ImageExtractor implements TextExtractionMethod {
/**
* The mime types of the file that the extractor accepts.
*/
mimes = ['image/png', 'image/jpeg']

/**
* Extracts text from the image file passed by the user.
*/
apply = async (input: Buffer): Promise {
const text = await processImage(input)
return text
}
}

// create a new extractor and register our extraction method
const extractor = new TextExtractor()
extractor.addMethod(new ImageExtractor())

// then use it like you would normally
const text = await extractor.extractText({ input: '...', type: '...' }
console.log(text)
```

## license

this project is licensed under the ISC license. please see [`license.md`](./license.md)
for more details.