https://github.com/Pouyaexe/Farsi_PDF

Make editable Persian(Farsi) PDF from Non-Editable ones, using OCR.
https://github.com/Pouyaexe/Farsi_PDF

Last synced: 6 days ago
JSON representation

Make editable Persian(Farsi) PDF from Non-Editable ones, using OCR.

Host: GitHub
URL: https://github.com/Pouyaexe/Farsi_PDF
Owner: Pouyaexe
Created: 2022-08-31T13:17:27.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2023-06-21T06:32:47.000Z (about 2 years ago)
Last Synced: 2024-11-20T01:33:24.490Z (8 months ago)
Language: Jupyter Notebook
Size: 55.7 KB
Stars: 6
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Farsi_PDF

# PDF OCR

This repository contains a Jupyter Notebook (`PDF_OCR.ipynb`) that converts PDF files into searchable PDFs using OCR (Optical Character Recognition) technology.

## Requirements

To run the notebook, you need to install the following dependencies:

- `tesseract-ocr`: OCR engine for text recognition.
- `libtesseract-dev`: Development files for the Tesseract OCR library.
- `tesseract-ocr-fas`: Tesseract OCR language data for Persian (Farsi).
- `pytesseract`: Python wrapper for Tesseract OCR.
- `ghostscript`: Interpreter for the PostScript language and the PDF file format.
- `ocrmypdf`: Python tool to add OCR text to PDFs.

You can install these dependencies by running the following commands:

```bash
!apt install tesseract-ocr
!apt install libtesseract-dev
!apt-get install tesseract-ocr-fas
!pip install pytesseract
!apt install ghostscript
!pip install ocrmypdf==13.7.0
```

## Usage

1. Clone the repository:

```bash
git clone https://github.com/your-username/PDF-OCR.git
```

2. Move to the repository directory:

```bash
cd PDF-OCR
```

3. Launch Jupyter Notebook:

```bash
jupyter notebook
```

4. Open the `PDF_OCR.ipynb` notebook in your Jupyter environment.

5. Follow the instructions in the notebook to convert your PDF files to searchable PDFs using OCR.

Note: Make sure to place the PDF files you want to convert in the `input_papers` directory and the converted output files will be saved in the `output_papers` directory.

Feel free to customize the notebook as per your requirements and explore other features of the OCR tools used.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/Pouyaexe/Farsi_PDF

Awesome Lists containing this project

README