https://github.com/xcrap-cloud/image-text-extractor
Xcrap Image Text Extractor is a package of the Xcrap framework that abstracts the extraction of texts from images using the node-tesseract-ocr library.
https://github.com/xcrap-cloud/image-text-extractor
extractor image javascript nodejs scraping tesseract text typescript web xcrap
Last synced: about 1 year ago
JSON representation
Xcrap Image Text Extractor is a package of the Xcrap framework that abstracts the extraction of texts from images using the node-tesseract-ocr library.
- Host: GitHub
- URL: https://github.com/xcrap-cloud/image-text-extractor
- Owner: Xcrap-Cloud
- License: mit
- Created: 2025-04-10T16:11:27.000Z (about 1 year ago)
- Default Branch: master
- Last Pushed: 2025-04-10T16:38:48.000Z (about 1 year ago)
- Last Synced: 2025-04-10T18:00:16.316Z (about 1 year ago)
- Topics: extractor, image, javascript, nodejs, scraping, tesseract, text, typescript, web, xcrap
- Language: TypeScript
- Homepage: https://www.npmjs.com/package/@xcrap/image-text-extractor
- Size: 83 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# 🕷️ Xcrap Image Text Extractor
**Xcrap Image Text Extractor** is a package of the Xcrap framework that abstracts the extraction of texts from images using the [node-tesseract-ocr](https://www.npmjs.com/package/node-tesseract-ocr) library.
## 📦 Installation
There are no secrets to installing it, just use your preferred dependency manager. Here is an example using NPM:
```cmd
npm i @xcrap/image-text-extractor
```
## 🚀 Usage
**Xcrap Image Text Extractor** provides an *async extractor* that can be used in an HTML parsing model just like any extractor:
```ts
import { extractImageText } from "@xcrap/image-text-extractor"
import { HtmlParsingModel } from "@xcrap/parser"
const parsingModel = new ParsingModel({
imageTexts: {
query: "img",
multiple: true,
extractor: extractImageText({ lang: "eng" })
}
})
```
If you want to transform the `src` of the images to resolve relative paths or something like that, pass the `transformSrc` option in the options like this:
```ts
const parsingModel = new ParsingModel({
imageTexts: {
query: "img",
multiple: true,
extractor: extractImageText({
lang: "eng",
transformSrc: (originalSrc) => {...}
})
}
})
```
> Check out more options at [node-tesseract-ocr](https://www.npmjs.com/package/node-tesseract-ocr).
## 🤝 Contributing
- Want to contribute? Follow these steps:
- Fork the repository.
- Create a new branch (git checkout -b feature-new).
- Commit your changes (git commit -m 'Add new feature').
- Push to the branch (git push origin feature-new).
- Open a Pull Request.
## 📝 License
This project is licensed under the MIT License.