https://github.com/teleprint-me/text-extraction

A Python package for extracting text from various file formats such as images, PDFs, and HTML documents.
https://github.com/teleprint-me/text-extraction

Last synced: 5 months ago
JSON representation

A Python package for extracting text from various file formats such as images, PDFs, and HTML documents.

Host: GitHub
URL: https://github.com/teleprint-me/text-extraction
Owner: teleprint-me
License: agpl-3.0
Created: 2024-02-09T20:08:02.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-07-11T16:30:10.000Z (about 1 year ago)
Last Synced: 2025-01-03T10:45:48.594Z (6 months ago)
Language: Python
Size: 124 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Text Extraction

Text Extraction is a Python package for extracting text from various file formats such as images, PDFs, and HTML documents.

## Getting Started

### Prerequisites

Make sure you have Tesseract OCR installed on your system.

#### Arch Linux

```sh

sudo pacman -S tesseract

```

#### Ubuntu

```sh

# Todo: Add installation instructions for Ubuntu

```

### Setup

Clone the repository, create a virtual environment, activate it, and install the required dependencies using pip.

```sh

git clone https://github.com/teleprint-me/text-extraction

cd text-extraction

virtualenv .venv

source .venv/bin/activate

pip install -r requirements.txt

```

## Usage

Once the package and its dependencies are installed, you can use the command-line tools provided by the package to extract text from different file formats.

```sh

# Example command for extracting text from an image

python -m text_extraction.cli.ocr --path_image 

# Example command for extracting text from a PDF

python -m text_extraction.cli.pdf --path_input 

# Example command for extracting text from an HTML file

python -m text_extraction.cli.html --dir-path 

```

## Contributions

Contributions are welcome! Feel free to submit bug reports, feature requests, or pull requests to help improve the package.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/teleprint-me/text-extraction

Awesome Lists containing this project

README