https://github.com/teleprint-me/text-extraction
A Python package for extracting text from various file formats such as images, PDFs, and HTML documents.
https://github.com/teleprint-me/text-extraction
Last synced: 3 months ago
JSON representation
A Python package for extracting text from various file formats such as images, PDFs, and HTML documents.
- Host: GitHub
- URL: https://github.com/teleprint-me/text-extraction
- Owner: teleprint-me
- License: agpl-3.0
- Created: 2024-02-09T20:08:02.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-07-11T16:30:10.000Z (10 months ago)
- Last Synced: 2025-01-03T10:45:48.594Z (4 months ago)
- Language: Python
- Size: 124 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Text Extraction
Text Extraction is a Python package for extracting text from various file formats such as images, PDFs, and HTML documents.
## Getting Started
### Prerequisites
Make sure you have Tesseract OCR installed on your system.
#### Arch Linux
```sh
sudo pacman -S tesseract
```#### Ubuntu
```sh
# Todo: Add installation instructions for Ubuntu
```### Setup
Clone the repository, create a virtual environment, activate it, and install the required dependencies using pip.
```sh
git clone https://github.com/teleprint-me/text-extraction
cd text-extraction
virtualenv .venv
source .venv/bin/activate
pip install -r requirements.txt
```## Usage
Once the package and its dependencies are installed, you can use the command-line tools provided by the package to extract text from different file formats.
```sh
# Example command for extracting text from an image
python -m text_extraction.cli.ocr --path_image# Example command for extracting text from a PDF
python -m text_extraction.cli.pdf --path_input# Example command for extracting text from an HTML file
python -m text_extraction.cli.html --dir-path
```## Contributions
Contributions are welcome! Feel free to submit bug reports, feature requests, or pull requests to help improve the package.
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.