https://github.com/omaxel/pdf-ocr
Recognize page content of a PDF as text using Tesseract and Ghostscript.
https://github.com/omaxel/pdf-ocr
csharp ghostscript ocr pdf pdf-ocr-extraction tesseract-ocr
Last synced: 10 months ago
JSON representation
Recognize page content of a PDF as text using Tesseract and Ghostscript.
- Host: GitHub
- URL: https://github.com/omaxel/pdf-ocr
- Owner: omaxel
- License: mit
- Created: 2018-01-09T07:57:05.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2018-01-09T16:33:33.000Z (about 8 years ago)
- Last Synced: 2025-04-10T00:51:16.131Z (10 months ago)
- Topics: csharp, ghostscript, ocr, pdf, pdf-ocr-extraction, tesseract-ocr
- Language: C#
- Homepage:
- Size: 173 KB
- Stars: 7
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# pdf-ocr
Recognize page content of a PDF as text [Tesseract](https://github.com/charlesw/tesseract) and [Ghostscript](https://www.ghostscript.com/).
## Prerequisites
* Install [Visual Studio 2015 Runtime](https://www.microsoft.com/en-us/download/details.aspx?id=48145) (both x86 & x64)
* Install [Ghostscript](https://www.ghostscript.com/download/gsdnld.html) (x86 or x64, depending on your computer)
## Installation
* Clone or download this repository.
* Open the solution in Visual Studio and run `Install-Package Tesseract -Version 3.0.2` from the `Package Manager Console`.
* Download language data files for tesseract 3.04 from the [tessdata repository](https://github.com/tesseract-ocr/tessdata/archive/3.04.00.zip) and add them to the `tessdata` folder of your project. Set `Copy to output directory` to `Always` for all the copied files. You can copy only the language files you are interested in (e.g. all the files that starts with `eng` for English language).
## Configuration
| | Variable name | Default | Description |
|----------------------------------|-----------------|----------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Input PDF file** | `inputPdfFile` | `test.pdf`, included in the repository | The PDF file whose selected page's content will be recognized as text. |
| **Page number** | `pageNumber` | `1` | The number of the page whose content will be recognized as text. |
| **Recognition language** | `ocrLanguage` | `"eng"` | The language used from tesseract to recognize text. When you change this value, make shure you add the language data files to the tessdata folder. See [Installation section](#Installation). |
| **DPI converting PDF page to image** | `pdfToImageDPI` | `150` | Tesseract can't recognize text from PDF pages. This is way we have to convert the PDF page to an image. This property indicates the DPI when making this convertion. |
## Tesseract usage
If you need more information on Tesseract usage, please visit [its own repository](https://github.com/charlesw/tesseract).