https://github.com/emcf/thepipe
Extract clean data from anywhere, powered by vision-language models ⚡
https://github.com/emcf/thepipe
gpt-4 gpt-4o large-language-models multimodal pdf scrapers vision-transformer web
Last synced: 11 days ago
JSON representation
Extract clean data from anywhere, powered by vision-language models ⚡
- Host: GitHub
- URL: https://github.com/emcf/thepipe
- Owner: emcf
- License: mit
- Created: 2024-03-22T03:33:26.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-10-30T01:30:24.000Z (7 months ago)
- Last Synced: 2024-10-30T03:57:52.503Z (7 months ago)
- Topics: gpt-4, gpt-4o, large-language-models, multimodal, pdf, scrapers, vision-transformer, web
- Language: Python
- Homepage: https://thepi.pe
- Size: 4.11 MB
- Stars: 1,159
- Watchers: 11
- Forks: 73
- Open Issues: 16
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
## Extract clean data from tricky documents ⚡
thepi.pe is a package that can scrape clean markdown, multimodal media, and structured data from complex documents. It uses vision-language models (VLMs) under the hood for superior output quality, and works out-of-the-box with any LLM, VLM, or vector database. It can extract well-formatted data from a wide range of sources, including PDFs, URLs, Word docs, Powerpoints, Python notebooks, videos, audio, and more.
## Features 🌟
- Scrape clean markdown, tables, and images from any document
- Scrape text, images, video, and audio from any file or URL
- Works out-of-the-box with vision-language models, vector databases, and RAG frameworks
- AI-native filetype detection, layout analysis, and structured data extraction
- Accepts a wide range of sources, including PDFs, URLs, Word docs, Powerpoints, Python notebooks, GitHub repos, videos, audio, and more## Get started in 5 minutes 🚀
Thepipe can be installed via the command line:
```bash
pip install thepipe-api
```If you need full functionality with media-rich sources such as webpages, video, and audio, you can choose to install the following dependencies:
```bash
apt-get update && apt-get install -y git ffmpeg
python -m playwright install --with-deps chromium
```### Default setup (OpenAI)
By default, thepipe uses the [OpenAI API](https://platform.openai.com/docs/overview), so VLM features will work out of the box provided you have the `OPENAI_API_KEY` environment variable set.
### Custom VLM server setup (OpenRouter, OpenLLM, etc.)
If you wish to use a local vision-language model or a different cloud provider, you can set the `LLM_SERVER_BASE_URL` environment variable, for example, `https://openrouter.ai/api/v1` for [OpenRouter](https://openrouter.ai/), or `http://localhost:3000/v1` for a local server such as [OpenLLM](https://github.com/bentoml/OpenLLM). You may set the `LLM_SERVER_API_KEY` environment variable for authentication to a non-OpenAI cloud provider. You can set the `DEFAULT_AI_MODEL` environment variable to specify the model to use for VLM features (for OpenAI, this is defaulted to `gpt-4o`).
### Scraping
```python
from thepipe.scraper import scrape_file# scrape clean markdown and images from a PDF
chunks = scrape_file(filepath="paper.pdf", ai_extraction=True)
```### Chunking
To satisfy token limit constraints, the following chunking methods are available to split the content into smaller chunks.
- `chunk_by_document`: Returns one chunk with the entire content of the file.
- `chunk_by_page`: Returns one chunk for each page (for example: each webpage, PDF page, or powerpoint slide).
- `chunk_by_length`: Splits chunks by length.
- `chunk_by_section`: Splits chunks by markdown section.
- `chunk_by_keyword`: Splits chunks at keywords
- `chunk_semantic` (experimental, requires [sentence transformers](https://pypi.org/project/sentence-transformers/)): Returns chunks split by spikes in semantic changes, with a configurable threshold.
- `chunk_agentic` (experimental, requires [OpenAI](https://pypi.org/project/openai/)): Returns chunks split by an LLM agent that attempts to find semantically meaningful sections.For example,
```python
from thepipe.scraper import scrape_file
from thepipe.chunker import chunk_by_document, chunk_by_page# optionally, pass in chunking_method
# chunk_by_document returns one chunk for the entire document
chunks = scrape_file(filepath="paper.pdf", chunking_method=chunk_by_document)# you can also re-chunk later.
# chunk_by_page returns one chunk for each page (for example: each webpage, PDF page, or powerpoint slide).
chunks = chunk_by_page(chunks)
```### OpenAI Integration 🤖
```python
from openai import OpenAI
from thepipe.core import chunks_to_messages# Initialize OpenAI client
client = OpenAI()# Use OpenAI-formatted chat messages
messages = [{
"role": "user",
"content": [{
"type": "text",
"text": "What is the paper about?"
}]
}]# Simply add the scraped chunks to the messages
messages += chunks_to_messages(chunks)# Call LLM
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
)
````chunks_to_messages` takes in an optional `text_only` parameter to only output text from the source document. This is useful for downstream use with LLMs that lack multimodal capabilities.
> ⚠️ **It is important to be mindful of your model's token limit.**
> Be sure your prompt is within the token limit of your model. You can use chunking to split your messages into smaller chunks.### LLamaIndex Integration 🦙
A chunk can be converted to LlamaIndex Document/ImageDocument with `.to_llamaindex`.
### Structured extraction 🗂️
```python
from thepipe.extract import extractschema = {
"description": "string",
"amount_usd": "float"
}results, tokens_used = extract(
chunks=chunks,
schema=schema,
multiple_extractions=True, # extract multiple rows of data per chunk
)
```## Sponsors
Please consider supporting thepipe by [becoming a sponsor](mailto:[email protected]).
Your support helps me maintain and improve the project while helping the open source community discover your work.Visit [Cal.com](https://cal.com/) for an open source scheduling tool that helps you book meetings with ease. It's the perfect solution for busy professionals who want to streamline their scheduling process.
Looking for enterprise-ready document processing and intelligent automation? Discover
how [Trellis AI](https://runtrellis.com/) can streamline your workflows and enhance productivity.## How it works 🛠️
thepipe uses a combination of computer vision models and heuristics to scrape clean content from the source and process it for downstream use with [large language models](https://en.wikipedia.org/wiki/Large_language_model), or [vision-language models](https://en.wikipedia.org/wiki/Vision_transformer). You can feed these messages directly into the model, or alternatively you can chunk these messages for downstream storage in a vector database such as ChromaDB, LLamaIndex, or equivalent RAG framework.
## Supported File Types 📚
| Source | Input types | Multimodal | Notes |
| ---------------------------- | ------------------------------------------------------------------------------------ | ---------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Webpage | URLs starting with `http`, `https`, `ftp` | ✔️ | Scrapes markdown, images, and tables from web pages. `ai_extraction` available for AI content extraction from the webpage's screenshot |
| PDF | `.pdf` | ✔️ | Extracts page markdown and page images. `ai_extraction` available to use a VLM for complex or scanned documents |
| Word Document | `.docx` | ✔️ | Extracts text, tables, and images |
| PowerPoint | `.pptx` | ✔️ | Extracts text and images from slides |
| Video | `.mp4`, `.mov`, `.wmv` | ✔️ | Uses Whisper for transcription and extracts frames |
| Audio | `.mp3`, `.wav` | ✔️ | Uses Whisper for transcription |
| Jupyter Notebook | `.ipynb` | ✔️ | Extracts markdown, code, outputs, and images |
| Spreadsheet | `.csv`, `.xls`, `.xlsx` | ❌ | Converts each row to JSON format, including row index for each |
| Plaintext | `.txt`, `.md`, `.rtf`, etc | ❌ | Simple text extraction |
| Image | `.jpg`, `.jpeg`, `.png` | ✔️ | Uses VLM for OCR in text-only mode |
| ZIP File | `.zip` | ✔️ | Extracts and processes contained files |
| Directory | any `path/to/folder` | ✔️ | Recursively processes all files in directory. Optionally use `inclusion_pattern` to pass regex strings for file inclusion rules. |
| YouTube Video (known issues) | YouTube video URLs starting with `https://youtube.com` or `https://www.youtube.com`. | ✔️ | Uses pytube for video download and Whisper for transcription. For consistent extraction, you may need to modify your `pytube` installation to send a valid user agent header (see [this issue](https://github.com/pytube/pytube/issues/399)). |
| Tweet | URLs starting with `https://twitter.com` or `https://x.com` | ✔️ | Uses unofficial API, may break unexpectedly |
| GitHub Repository | GitHub repo URLs starting with `https://github.com` or `https://www.github.com` | ✔️ | Requires GITHUB_TOKEN environment variable |## Configuration & Environment
Set these environment variables to control API keys, hosting, and model defaults:
```bash
# If you want longer-term image storage and hosting (saves to ./images and serves via HOST_URL)
export HOST_IMAGES=true# GitHub token for scraping private/public repos via `scrape_url`
export GITHUB_TOKEN=ghp_...# Base URL + key for any custom LLM server (used in extract/scrape)
export LLM_SERVER_BASE_URL=https://openrouter.ai/api/v1
export LLM_SERVER_API_KEY=or-...# Control scraping defaults
export DEFAULT_AI_MODEL=gpt-4o
export FILESIZE_LIMIT_MB=50
```## CLI Reference
```shell
# Basic usage: scrape a file or URL
thepipe [options]# Options:
--ai_extraction Use AI for PDF/image/text extraction
--text_only Only output text (no images)
--inclusion_pattern=REGEX Only include files matching REGEX when scraping directories
--verbose Print detailed progress messages
```## Contributing
We welcome contributions! To get started:
1. Fork the repo and create a feature branch:
```bash
git checkout -b feature/my-new-feature```
2. Install dependencies & run tests:
```bash
pip install -r requirements.txt
python -m unittest discover
```3. Make your changes, format them, and commit them:
```bash
black .
git add .
git commit -m "..."
```4. Push to your fork and create a pull request:
```bash
git push origin feature/my-new-feature
```5. Submit a pull request to the main repository.
6. Wait for review and feedback from the maintainers. This may take some time, so please be patient!