Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/gamemaker1/office-text-extractor
Yet another library to extract text from MS Office and PDF files
https://github.com/gamemaker1/office-text-extractor
docx get-text ms-excel ms-office ms-powerpoint ms-word parser pdf pptx text-extraction xlsx
Last synced: 6 days ago
JSON representation
Yet another library to extract text from MS Office and PDF files
- Host: GitHub
- URL: https://github.com/gamemaker1/office-text-extractor
- Owner: gamemaker1
- License: isc
- Created: 2021-03-04T11:13:13.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2024-07-23T08:09:37.000Z (6 months ago)
- Last Synced: 2025-01-11T09:08:22.878Z (13 days ago)
- Topics: docx, get-text, ms-excel, ms-office, ms-powerpoint, ms-word, parser, pdf, pptx, text-extraction, xlsx
- Language: TypeScript
- Homepage: https://npm.im/office-text-extractor
- Size: 2.15 MB
- Stars: 68
- Watchers: 2
- Forks: 7
- Open Issues: 6
-
Metadata Files:
- Readme: readme.md
- License: license.md
Awesome Lists containing this project
README
#
office-text-extractoryet another library to extract text from docx, pptx, xlsx, and pdf files.
## similar libraries
there are other great libraries that do the same job and have inspired this
project, such as:- [`any-text`](https://github.com/abhinaba-ghosh/any-text)
- [`officeparser`](https://github.com/harshankur/officeParser)
- [`textract`](https://www.npmjs.com/package/textract)however, office-text-extractor has the following differences:
- parses file based on its **mime type**, not its file extension.
- **does not spawn** a child process to use a tool installed on the device.
- reads and returns text from the file if it contains **plain text**.## libraries used
this package uses some amazing existing libraries that perform better than the
ones that originally existed in this module, and are therefore used instead:- [`pdf-parse`](https://www.npmjs.com/package/pdf-parse), for parsing pdf files
- [`xlsx`](https://www.npmjs.com/package/xlsx), for parsing xlsx files
- [`mammoth`](https://www.npmjs.com/package/mammoth), for parsing docx filesa big thank you to the contributors of these projects!
## installation
#### node
> from version 2.0.0 onwards, this package is pure esm. please read
> [this article](https://gist.github.com/sindresorhus/a39789f98801d908bbc7ff3ecc99d99c)
> for a guide on how to ensure your project can import this library.to use office-text-extractor in an Node project, install it using `npm`/`pnpm`/`yarn`:
```sh
> npm install office-text-extractor
> pnpm add office-text-extractor
> yarn add office-text-extractor
```#### ~browser~
the library currently cannot be used in the browser due to its usage of the `node:buffer`
library. pull requests that can replace `node:buffer` with a different library are welcome!## usage
an example of using the library to extract text is as follows:
```ts
import { readFile } from 'node:fs/promises'
import { getTextExtractor } from 'office-text-extractor'// this function returns a new instance of the `TextExtractor` class, with the default
// extraction methods (docx, pptx, xlsx, pdf) registered.
const extractor = getTextExtractor()// extract text from a url, because that's a neat first example :p
const url = 'https://raw.githubusercontent.com/gamemaker1/office-text-extractor/rewrite/test/fixtures/docs/pptx.pptx'
const text = await extractor.extractText({ input: url, type: 'url' })// you can extract text from a file too, like so:
const path = 'stuff/boring.pdf'
const text = await extractor.extractText({ input: path, type: 'file' })// if you have a buffer with the file in it, you can pass that too:
const buffer = await readFile(path)
const text = await extractor.extractText({ input: buffer, type: 'buffer' })console.log(text)
```the following is an example of how to create and use your own text extraction method:
```ts
import { type Buffer } from 'node:buffer'
import { TextExtractor, type TextExtractionMethod } from 'office-text-extractor'/**
* Extracts text from images.
*/
class ImageExtractor implements TextExtractionMethod {
/**
* The mime types of the file that the extractor accepts.
*/
mimes = ['image/png', 'image/jpeg']/**
* Extracts text from the image file passed by the user.
*/
apply = async (input: Buffer): Promise {
const text = await processImage(input)
return text
}
}// create a new extractor and register our extraction method
const extractor = new TextExtractor()
extractor.addMethod(new ImageExtractor())// then use it like you would normally
const text = await extractor.extractText({ input: '...', type: '...' }
console.log(text)
```## license
this project is licensed under the ISC license. please see [`license.md`](./license.md)
for more details.