Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/robertknight/tesseract-wasm

JS/WebAssembly build of the Tesseract OCR engine for use in browsers and Node
https://github.com/robertknight/tesseract-wasm

js ocr webassembly

Last synced: 13 days ago
JSON representation

JS/WebAssembly build of the Tesseract OCR engine for use in browsers and Node

Awesome Lists containing this project

README

        

# tesseract-wasm

![npm package](https://img.shields.io/npm/v/tesseract-wasm)

A WebAssembly build of the [Tesseract](https://github.com/tesseract-ocr/tesseract)
OCR engine for use in the browser and Node.

tesseract-wasm can detect and recognize text in document images. It supports multiple languages via different [trained models](https://tesseract-ocr.github.io/tessdoc/Data-Files).

👉 [**Try the demo**](https://robertknight.github.io/tesseract-wasm/) (Currently supports English)

## Features

This Tesseract build has been optimized for use in the browser by:

- Stripping functionality which is not needed in a browser environment (eg.
code to parse various image formats) to reduce download size and improve
startup performance. The library and English training data require a ~2.1MB
download (with Brotli compression).

- Using [WebAssembly SIMD](https://v8.dev/features/simd) when available
(Chrome >= 91, Firefox >= 90, Safari >= 16.4) to improve text
recognition performance.

- Providing a high-level API that can be used to run web pages without blocking
interaction and a low-level API that provides more control over execution.

## Setup

1. Add the tesseract-wasm library to your project:

```sh
npm install tesseract-wasm
```

2. Serve the `tesseract-core.wasm`, `tesseract-core-fallback.wasm` and
`tesseract-worker.js` files from `node_modules/tesseract-wasm/dist` alongside
your JavaScript bundle.

3. Get the training data file(s) for the languages you want to support from the
[tessdata_fast](https://github.com/tesseract-ocr/tessdata_fast) repo and
serve it from a URL that your JavaScript can load. The `eng.traineddata`
file supports English for example, and also works with many documents in
other languages that use the same script.

## Usage

tesseract-wasm provides two APIs: a high-level asynchronous API (`OCRClient`)
and a lower-level synchronous API (`OCREngine`). The high-level API is the most
convenient way to run OCR on an image in a web page. It handles running the OCR
engine inside a Web Worker to avoid blocking page interaction. The low-level API
is useful if more control is needed over where/how the code runs and has lower
latency per API call.

### Using OCRClient in a web page

```js
import { OCRClient } from 'tesseract-wasm';

async function runOCR() {
// Fetch document image and decode it into an ImageBitmap.
const imageResponse = await fetch('./test-image.jpg');
const imageBlob = await imageResponse.blob();
const image = await createImageBitmap(imageBlob);

// Initialize the OCR engine. This will start a Web Worker to do the
// work in the background.
const ocr = new OCRClient();

try {
// Load the appropriate OCR training data for the image(s) we want to
// process.
await ocr.loadModel('eng.traineddata');

await ocr.loadImage(image);

// Perform text recognition and return text in reading order.
const text = await ocr.getText();

console.log('OCR text: ', text);
} finally {
// Once all OCR-ing has been done, shut down the Web Worker and free up
// resources.
ocr.destroy();
}
}

runOCR();
```

## Examples and documentation

See the `examples/` directory for projects that show usage of the library in
the browser and Node.

See the [API documentation](https://robertknight.github.io/tesseract-wasm/api/)
for detailed usage information.

See the Tesseract [User Manual](https://tesseract-ocr.github.io/tessdoc/) for
information on how Tesseract works, as well as advice on [improving
recognition](https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html).

## Development

### Prerequisites

To build this library locally, you will need:

- A C++ build toolchain (eg. via the `build-essential` package on Ubuntu or Xcode on macOS)
- [CMake](https://cmake.org)
- [Ninja](https://ninja-build.org)

The [Emscripten](https://emscripten.org) toolchain used to compile C++ to
WebAssembly is downloaded as part of the build process.

To install CMake and Ninja:

#### On macOS:

```
brew install cmake ninja
```

#### On Ubuntu

```
sudo apt-get install cmake ninja-build
```

### Building the library

```sh
git clone https://github.com/robertknight/tesseract-wasm
cd tesseract-wasm

# Build WebAssembly binaries and JS library in dist/ folder
make lib

# Run tests
make test
```

To test your local build of the library with the example projects, or your own
projects, you can use [yalc](https://www.npmjs.com/package/yalc).

```sh
# In this project
yalc publish

# In the project where you want to use your local build of tesseract-wasm
yalc link tesseract-wasm
```