Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Pouyaexe/Farsi_PDF
Make editable Persian(Farsi) PDF from Non-Editable ones, using OCR.
https://github.com/Pouyaexe/Farsi_PDF
Last synced: 3 months ago
JSON representation
Make editable Persian(Farsi) PDF from Non-Editable ones, using OCR.
- Host: GitHub
- URL: https://github.com/Pouyaexe/Farsi_PDF
- Owner: Pouyaexe
- Created: 2022-08-31T13:17:27.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2023-06-21T06:32:47.000Z (over 1 year ago)
- Last Synced: 2024-05-30T12:45:38.807Z (6 months ago)
- Language: Jupyter Notebook
- Size: 55.7 KB
- Stars: 6
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Farsi_PDF
# PDF OCR
This repository contains a Jupyter Notebook (`PDF_OCR.ipynb`) that converts PDF files into searchable PDFs using OCR (Optical Character Recognition) technology.
## Requirements
To run the notebook, you need to install the following dependencies:
- `tesseract-ocr`: OCR engine for text recognition.
- `libtesseract-dev`: Development files for the Tesseract OCR library.
- `tesseract-ocr-fas`: Tesseract OCR language data for Persian (Farsi).
- `pytesseract`: Python wrapper for Tesseract OCR.
- `ghostscript`: Interpreter for the PostScript language and the PDF file format.
- `ocrmypdf`: Python tool to add OCR text to PDFs.You can install these dependencies by running the following commands:
```bash
!apt install tesseract-ocr
!apt install libtesseract-dev
!apt-get install tesseract-ocr-fas
!pip install pytesseract
!apt install ghostscript
!pip install ocrmypdf==13.7.0
```## Usage
1. Clone the repository:
```bash
git clone https://github.com/your-username/PDF-OCR.git
```2. Move to the repository directory:
```bash
cd PDF-OCR
```3. Launch Jupyter Notebook:
```bash
jupyter notebook
```4. Open the `PDF_OCR.ipynb` notebook in your Jupyter environment.
5. Follow the instructions in the notebook to convert your PDF files to searchable PDFs using OCR.
Note: Make sure to place the PDF files you want to convert in the `input_papers` directory and the converted output files will be saved in the `output_papers` directory.
Feel free to customize the notebook as per your requirements and explore other features of the OCR tools used.