https://github.com/teleprint-me/text
A Python package for extracting text from various file formats such as images, PDFs, and HTML documents.
https://github.com/teleprint-me/text
Last synced: about 2 months ago
JSON representation
A Python package for extracting text from various file formats such as images, PDFs, and HTML documents.
- Host: GitHub
- URL: https://github.com/teleprint-me/text
- Owner: teleprint-me
- License: agpl-3.0
- Created: 2024-02-09T20:08:02.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-20T05:13:37.000Z (about 2 months ago)
- Last Synced: 2025-03-20T06:23:23.178Z (about 2 months ago)
- Language: Python
- Size: 137 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# **Text**
Text is a Python package for extracting, parsing, and automating text pipelines from various file formats, including **plain text, Markdown, HTML, PDFs, and images**.
> _**DISCLAIMER:** It is your responsibility to comply with copyright laws. This library's primary purpose is to facilitate dataset preparation and analysis for advanced modeling techniques._
> _**NOTE:** This repository is currently a work in progress. Text extraction, parsing, and mining are incredibly challenging—each dataset comes with its own set of edge cases, making generalization difficult._
## **Getting Started**
### **Prerequisites**
Make sure you have **Tesseract OCR** installed on your system.
#### **Arch Linux**
```sh
sudo pacman -S poppler tesseract
```#### **Ubuntu**
```sh
# TODO: Add installation instructions for Ubuntu
```### **Setup**
Clone the repository, create a virtual environment, activate it, and install dependencies:```sh
git clone https://github.com/teleprint-me/text
cd text
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```## **Usage**
Once installed, you can use the CLI tools to extract text from different file formats:```sh
# Extract text from an image
python -m text.cli.ocr -i -o# Extract text from a PDF
python -m text.cli.pdf -i -o# Extract text from an HTML file or directory
python -m text.cli.html -i# Extract text from a web page and cache results
python -m text.cli.web --cache
```## **Contributing**
Contributions are welcome!
Feel free to submit **bug reports, feature requests, or pull requests** to help improve the package.## **License**
This project is licensed under the **AGPL License** – see the [LICENSE](LICENSE) file for details.