https://github.com/iwstkhr/python-tesseract-ocr-example

This repository is intended for validation of Japanese OCR accuracy using Tesseract OCR + pytesseract.
https://github.com/iwstkhr/python-tesseract-ocr-example

japanese pytesseract tesseract-ocr

Last synced: 4 months ago
JSON representation

This repository is intended for validation of Japanese OCR accuracy using Tesseract OCR + pytesseract.

Host: GitHub
URL: https://github.com/iwstkhr/python-tesseract-ocr-example
Owner: iwstkhr
License: mit
Created: 2022-06-04T13:11:11.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2025-02-22T02:06:55.000Z (4 months ago)
Last Synced: 2025-02-23T20:42:27.736Z (4 months ago)
Topics: japanese, pytesseract, tesseract-ocr
Language: Python
Homepage:
Size: 789 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

## Introduction

This repository demonstrates how to perform OCR on Japanese PDFs using [**Tesseract OCR v4**](https://github.com/tesseract-ocr/tesseract) and [**pytesseract**](https://pypi.org/project/pytesseract/).

The source text is [**Run, Melos!**](http://pddlib.v.wol.ne.jp/literature/dazai/meros.htm) by Osamu Dazai, a work that is now in the public domain.

## Prerequisites

Before we start, ensure the following libraries are installed:

- [pdf2image](https://pypi.org/project/pdf2image/)
- [pytesseract](https://pypi.org/project/pytesseract/)

Additionally, Tesseract OCR itself must be installed. Follow the instructions on the [official repository](https://github.com/tesseract-ocr/tesseract) to set it up.

## Steps to Perform OCR

A python script - [app.py](app.py) - executes the following:

1. Convert a PDF to PNG image data using pdf2image.
2. Extract characters from the converted data using Tesseract OCR and pytesseract.
3. Save the result in a text file.

Run the following command:

```shell
NAME=pytesseract-sample
docker build -t $NAME .
docker run --name $NAME $NAME

# You can see OCR processing result in `result.txt`.
docker cp $NAME:/usr/src/app/result.txt ./
less result.txt

# Clean up
docker container rm $NAME
docker image rm $NAME
```

## Result

- **Original Text**: [original.txt](original.txt)
- **OCR Result**: [result.txt](result.txt)

### Notes on Diff Tools

Since the script removes newlines, comparing results using the `diff` command may be ineffective. Instead, consider GUI-based tools like [Araxis Merge](https://www.araxis.com/merge/) or [WinMerge](https://winmerge.org/).

## Conclusion

Using **Tesseract OCR** and **pytesseract**, we can efficiently extract text from Japanese PDFs. While the results are promising, the accuracy may vary based on the complexity of the PDF layout.

Happy Coding! 🚀

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/iwstkhr/python-tesseract-ocr-example

Awesome Lists containing this project

README