{"id":15645280,"url":"https://github.com/gamemaker1/office-text-extractor","last_synced_at":"2026-03-16T11:38:21.432Z","repository":{"id":37935228,"uuid":"344447034","full_name":"gamemaker1/office-text-extractor","owner":"gamemaker1","description":"Yet another library to extract text from MS Office and PDF files","archived":false,"fork":false,"pushed_at":"2024-07-23T08:09:37.000Z","size":2253,"stargazers_count":73,"open_issues_count":7,"forks_count":7,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-10T00:17:30.168Z","etag":null,"topics":["docx","get-text","ms-excel","ms-office","ms-powerpoint","ms-word","parser","pdf","pptx","text-extraction","xlsx"],"latest_commit_sha":null,"homepage":"https://npm.im/office-text-extractor","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"isc","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gamemaker1.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"license.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-03-04T11:13:13.000Z","updated_at":"2025-03-26T09:06:12.000Z","dependencies_parsed_at":"2024-10-03T12:09:45.883Z","dependency_job_id":"cd477927-0210-4638-8a93-f617fd5a5316","html_url":"https://github.com/gamemaker1/office-text-extractor","commit_stats":null,"previous_names":[],"tags_count":17,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gamemaker1%2Foffice-text-extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gamemaker1%2Foffice-text-extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gamemaker1%2Foffice-text-extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gamemaker1%2Foffice-text-extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gamemaker1","download_url":"https://codeload.github.com/gamemaker1/office-text-extractor/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248131318,"owners_count":21052820,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["docx","get-text","ms-excel","ms-office","ms-powerpoint","ms-word","parser","pdf","pptx","text-extraction","xlsx"],"created_at":"2024-10-03T12:05:42.015Z","updated_at":"2025-12-26T23:06:35.128Z","avatar_url":"https://github.com/gamemaker1.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# \u003cdiv align=\"center\"\u003e office-text-extractor \u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n\nyet another library to extract text from docx, pptx, xlsx, and pdf files.\n\n\u003c/div\u003e\n\n## similar libraries\n\nthere are other great libraries that do the same job and have inspired this\nproject, such as:\n\n- [`any-text`](https://github.com/abhinaba-ghosh/any-text)\n- [`officeparser`](https://github.com/harshankur/officeParser)\n- [`textract`](https://www.npmjs.com/package/textract)\n\nhowever, office-text-extractor has the following differences:\n\n- parses file based on its **mime type**, not its file extension.\n- **does not spawn** a child process to use a tool installed on the device.\n- reads and returns text from the file if it contains **plain text**.\n\n## libraries used\n\nthis package uses some amazing existing libraries that perform better than the\nones that originally existed in this module, and are therefore used instead:\n\n- [`pdf-parse`](https://www.npmjs.com/package/pdf-parse), for parsing pdf files\n- [`xlsx`](https://www.npmjs.com/package/xlsx), for parsing xlsx files\n- [`mammoth`](https://www.npmjs.com/package/mammoth), for parsing docx files\n\na big thank you to the contributors of these projects!\n\n## installation\n\n#### node\n\n\u003e from version 2.0.0 onwards, this package is pure esm. please read\n\u003e [this article](https://gist.github.com/sindresorhus/a39789f98801d908bbc7ff3ecc99d99c)\n\u003e for a guide on how to ensure your project can import this library.\n\nto use this package in an node project, install it using a package manager such\nas `npm`/`pnpm`/`bun`:\n\n```sh\n\u003e npm install office-text-extractor\n\u003e pnpm add office-text-extractor\n\u003e bun add office-text-extractor\n```\n\n#### ~browser~\n\nthe library currently cannot be used in the browser due to my inability to figure\nout how to properly bundle the library with its dependencies. pull requests are\nwelcome and appreciated!\n\n## usage\n\nan example of using the library to extract text is as follows:\n\n```ts\nimport { readFile } from 'node:fs/promises'\nimport { getTextExtractor } from 'office-text-extractor'\n\n// this function returns a new instance of the `TextExtractor` class, with the default\n// extraction methods (docx, pptx, xlsx, pdf) registered.\nconst extractor = getTextExtractor()\n\n// extract text from a url, because that's a neat first example :p\nconst url = 'https://raw.githubusercontent.com/gamemaker1/office-text-extractor/rewrite/test/fixtures/pptx.pptx'\nconst text = await extractor.extractText({ input: url, type: 'url' })\n\n// you can extract text from a file too, like so:\nconst path = 'stuff/boring.pdf'\nconst text = await extractor.extractText({ input: path, type: 'file' })\n\n// if you have a buffer (Uint8Array) with the file in it, you can pass that too:\nconst buffer = await readFile(path)\nconst text = await extractor.extractText({ input: buffer, type: 'buffer' })\n\nconsole.log(text)\n```\n\nthe following is an example of how to create and use your own text extraction method:\n\n```ts\nimport { TextExtractor, type TextExtractionMethod } from 'office-text-extractor'\n\n/**\n * Extracts text from images.\n */\nclass ImageExtractor implements TextExtractionMethod {\n  /**\n   * The mime types of the file that the extractor accepts.\n   */\n  mimes = ['image/png', 'image/jpeg']\n\n  /**\n   * Extracts text from the image file passed by the user.\n   */\n  apply = async (input: Uint8Array): Promise\u003cstring\u003e {\n    const text = await processImage(input)\n    return text\n  }\n}\n\n// create a new extractor and register our extraction method.\nconst extractor = new TextExtractor()\nextractor.addMethod(new ImageExtractor())\n\n// then use it like you would normally.\nconst text = await extractor.extractText(...)\nconsole.log(text)\n```\n\n## license\n\nthis project is licensed under the ISC license. please see\n[`license.md`](./license.md) for more details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgamemaker1%2Foffice-text-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgamemaker1%2Foffice-text-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgamemaker1%2Foffice-text-extractor/lists"}