https://github.com/teleprint-me/text

A Python package for extracting text from various file formats such as images, PDFs, and HTML documents.
https://github.com/teleprint-me/text

Last synced: 4 months ago
JSON representation

A Python package for extracting text from various file formats such as images, PDFs, and HTML documents.

Host: GitHub
URL: https://github.com/teleprint-me/text
Owner: teleprint-me
License: agpl-3.0
Created: 2024-02-09T20:08:02.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-03-20T05:13:37.000Z (4 months ago)
Last Synced: 2025-03-20T06:23:23.178Z (4 months ago)
Language: Python
Size: 137 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# **Text**

Text is a Python package for extracting, parsing, and automating text pipelines from various file formats, including **plain text, Markdown, HTML, PDFs, and images**.

> _**DISCLAIMER:** It is your responsibility to comply with copyright laws. This library's primary purpose is to facilitate dataset preparation and analysis for advanced modeling techniques._

> _**NOTE:** This repository is currently a work in progress. Text extraction, parsing, and mining are incredibly challenging—each dataset comes with its own set of edge cases, making generalization difficult._

## **Getting Started**

### **Prerequisites**

Make sure you have **Tesseract OCR** installed on your system.

#### **Arch Linux**
```sh
sudo pacman -S poppler tesseract
```

#### **Ubuntu**
```sh
# TODO: Add installation instructions for Ubuntu
```

### **Setup**
Clone the repository, create a virtual environment, activate it, and install dependencies:

```sh
git clone https://github.com/teleprint-me/text
cd text
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

## **Usage**
Once installed, you can use the CLI tools to extract text from different file formats:

```sh
# Extract text from an image
python -m text.cli.ocr -i -o

# Extract text from a PDF
python -m text.cli.pdf -i -o

# Extract text from an HTML file or directory
python -m text.cli.html -i

# Extract text from a web page and cache results
python -m text.cli.web --cache
```

## **Contributing**
Contributions are welcome!
Feel free to submit **bug reports, feature requests, or pull requests** to help improve the package.

## **License**
This project is licensed under the **AGPL License** – see the [LICENSE](LICENSE) file for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/teleprint-me/text

Awesome Lists containing this project

README