{"id":21127190,"url":"https://github.com/emcf/thepipe","last_synced_at":"2025-05-14T03:07:55.767Z","repository":{"id":229421045,"uuid":"775790548","full_name":"emcf/thepipe","owner":"emcf","description":"Extract clean data from anywhere, powered by vision-language models ⚡","archived":false,"fork":false,"pushed_at":"2024-10-30T01:30:24.000Z","size":4306,"stargazers_count":1159,"open_issues_count":16,"forks_count":73,"subscribers_count":11,"default_branch":"main","last_synced_at":"2024-10-30T03:57:52.503Z","etag":null,"topics":["gpt-4","gpt-4o","large-language-models","multimodal","pdf","scrapers","vision-transformer","web"],"latest_commit_sha":null,"homepage":"https://thepi.pe","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/emcf.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-22T03:33:26.000Z","updated_at":"2024-10-30T01:40:45.000Z","dependencies_parsed_at":"2024-04-06T01:30:24.430Z","dependency_job_id":"794f6a8f-6f5e-420a-a697-1f904c1d7d36","html_url":"https://github.com/emcf/thepipe","commit_stats":null,"previous_names":["emcf/thepipe"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/emcf%2Fthepipe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/emcf%2Fthepipe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/emcf%2Fthepipe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/emcf%2Fthepipe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/emcf","download_url":"https://codeload.github.com/emcf/thepipe/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248288362,"owners_count":21078903,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gpt-4","gpt-4o","large-language-models","multimodal","pdf","scrapers","vision-transformer","web"],"created_at":"2024-11-20T04:47:01.451Z","updated_at":"2025-05-14T03:07:55.761Z","avatar_url":"https://github.com/emcf.png","language":"Python","readme":"\u003cdiv align=\"center\"\u003e\r\n  \u003ca href=\"https://thepi.pe/\"\u003e\r\n    \u003cimg src=\"https://rpnutzemutbrumczwvue.supabase.co/storage/v1/object/public/assets/pipeline_small%20(1).png\" alt=\"Pipeline Illustration\" style=\"width:96px; height:72px; vertical-align:middle;\"\u003e\r\n    \u003ch1\u003ethepi.pe\u003c/h1\u003e\r\n  \u003c/a\u003e\r\n  \u003ca\u003e\r\n    \u003cimg src=\"https://github.com/emcf/thepipe/actions/workflows/python-ci.yml/badge.svg\" alt=\"python-gh-action\"\u003e\r\n  \u003c/a\u003e\r\n    \u003ca href=\"https://codecov.io/gh/emcf/thepipe\"\u003e\r\n    \u003cimg src=\"https://codecov.io/gh/emcf/thepipe/graph/badge.svg?token=OE7CUEFUL9\" alt=\"codecov\"\u003e\r\n  \u003c/a\u003e\r\n  \u003ca href=\"https://raw.githubusercontent.com/emcf/thepipe/main/LICENSE\"\u003e\r\n    \u003cimg src=\"https://img.shields.io/badge/license-MIT-green\" alt=\"MIT license\"\u003e\r\n  \u003c/a\u003e\r\n  \u003ca href=\"https://www.pepy.tech/projects/thepipe-api\"\u003e\r\n    \u003cimg src=\"https://static.pepy.tech/badge/thepipe-api\" alt=\"PyPI\"\u003e\r\n  \u003c/a\u003e\r\n\u003c/div\u003e\r\n\r\n## Extract clean data from tricky documents ⚡\r\n\r\nthepi.pe is a package that can scrape clean markdown, multimodal media, and structured data from complex documents. It uses vision-language models (VLMs) under the hood for superior output quality, and works out-of-the-box with any LLM, VLM, or vector database. It can extract well-formatted data from a wide range of sources, including PDFs, URLs, Word docs, Powerpoints, Python notebooks, videos, audio, and more.\r\n\r\n## Features 🌟\r\n\r\n- Scrape clean markdown, tables, and images from any document\r\n- Scrape text, images, video, and audio from any file or URL\r\n- Works out-of-the-box with vision-language models, vector databases, and RAG frameworks\r\n- AI-native filetype detection, layout analysis, and structured data extraction\r\n- Accepts a wide range of sources, including PDFs, URLs, Word docs, Powerpoints, Python notebooks, GitHub repos, videos, audio, and more\r\n\r\n## Get started in 5 minutes 🚀\r\n\r\nThepipe can be installed via the command line:\r\n\r\n```bash\r\npip install thepipe-api\r\n```\r\n\r\nIf you need full functionality with media-rich sources such as webpages, video, and audio, you can choose to install the following dependencies:\r\n\r\n```bash\r\napt-get update \u0026\u0026 apt-get install -y git ffmpeg\r\npython -m playwright install --with-deps chromium\r\n```\r\n\r\n### Default setup (OpenAI)\r\n\r\nBy default, thepipe uses the [OpenAI API](https://platform.openai.com/docs/overview), so VLM features will work out of the box provided you have the `OPENAI_API_KEY` environment variable set.\r\n\r\n### Custom VLM server setup (OpenRouter, OpenLLM, etc.)\r\n\r\nIf you wish to use a local vision-language model or a different cloud provider, you can set the `LLM_SERVER_BASE_URL` environment variable, for example, `https://openrouter.ai/api/v1` for [OpenRouter](https://openrouter.ai/), or `http://localhost:3000/v1` for a local server such as [OpenLLM](https://github.com/bentoml/OpenLLM). You may set the `LLM_SERVER_API_KEY` environment variable for authentication to a non-OpenAI cloud provider. You can set the `DEFAULT_AI_MODEL` environment variable to specify the model to use for VLM features (for OpenAI, this is defaulted to `gpt-4o`).\r\n\r\n### Scraping\r\n\r\n```python\r\nfrom thepipe.scraper import scrape_file\r\n\r\n# scrape clean markdown and images from a PDF\r\nchunks = scrape_file(filepath=\"paper.pdf\", ai_extraction=True)\r\n```\r\n\r\n### Chunking\r\n\r\nTo satisfy token limit constraints, the following chunking methods are available to split the content into smaller chunks.\r\n\r\n- `chunk_by_document`: Returns one chunk with the entire content of the file.\r\n- `chunk_by_page`: Returns one chunk for each page (for example: each webpage, PDF page, or powerpoint slide).\r\n- `chunk_by_length`: Splits chunks by length.\r\n- `chunk_by_section`: Splits chunks by markdown section.\r\n- `chunk_by_keyword`: Splits chunks at keywords\r\n- `chunk_semantic` (experimental, requires [sentence transformers](https://pypi.org/project/sentence-transformers/)): Returns chunks split by spikes in semantic changes, with a configurable threshold.\r\n- `chunk_agentic` (experimental, requires [OpenAI](https://pypi.org/project/openai/)): Returns chunks split by an LLM agent that attempts to find semantically meaningful sections.\r\n\r\nFor example,\r\n\r\n```python\r\nfrom thepipe.scraper import scrape_file\r\nfrom thepipe.chunker import chunk_by_document, chunk_by_page\r\n\r\n# optionally, pass in chunking_method\r\n# chunk_by_document returns one chunk for the entire document\r\nchunks = scrape_file(filepath=\"paper.pdf\", chunking_method=chunk_by_document)\r\n\r\n# you can also re-chunk later.\r\n# chunk_by_page returns one chunk for each page (for example: each webpage, PDF page, or powerpoint slide).\r\nchunks = chunk_by_page(chunks)\r\n```\r\n\r\n### OpenAI Integration 🤖\r\n\r\n```python\r\nfrom openai import OpenAI\r\nfrom thepipe.core import chunks_to_messages\r\n\r\n# Initialize OpenAI client\r\nclient = OpenAI()\r\n\r\n# Use OpenAI-formatted chat messages\r\nmessages = [{\r\n  \"role\": \"user\",\r\n  \"content\": [{\r\n      \"type\": \"text\",\r\n      \"text\": \"What is the paper about?\"\r\n    }]\r\n}]\r\n\r\n# Simply add the scraped chunks to the messages\r\nmessages += chunks_to_messages(chunks)\r\n\r\n# Call LLM\r\nresponse = client.chat.completions.create(\r\n    model=\"gpt-4o\",\r\n    messages=messages,\r\n)\r\n```\r\n\r\n`chunks_to_messages` takes in an optional `text_only` parameter to only output text from the source document. This is useful for downstream use with LLMs that lack multimodal capabilities.\r\n\r\n\u003e ⚠️ **It is important to be mindful of your model's token limit.**\r\n\u003e Be sure your prompt is within the token limit of your model. You can use chunking to split your messages into smaller chunks.\r\n\r\n### LLamaIndex Integration 🦙\r\n\r\nA chunk can be converted to LlamaIndex Document/ImageDocument with `.to_llamaindex`.\r\n\r\n### Structured extraction 🗂️\r\n\r\n```python\r\nfrom thepipe.extract import extract\r\n\r\nschema = {\r\n  \"description\": \"string\",\r\n  \"amount_usd\": \"float\"\r\n}\r\n\r\nresults, tokens_used = extract(\r\n    chunks=chunks,\r\n    schema=schema,\r\n    multiple_extractions=True, # extract multiple rows of data per chunk\r\n)\r\n```\r\n\r\n## Sponsors\r\n\r\nPlease consider supporting thepipe by [becoming a sponsor](mailto:emmett@thepi.pe).\r\nYour support helps me maintain and improve the project while helping the open source community discover your work.\r\n\r\nVisit [Cal.com](https://cal.com/) for an open source scheduling tool that helps you book meetings with ease. It's the perfect solution for busy professionals who want to streamline their scheduling process.\r\n\r\n\u003ca href=\"https://cal.com/emmett-mcf/30min\"\u003e\u003cimg alt=\"Book us with Cal.com\" src=\"https://cal.com/book-with-cal-dark.svg\" /\u003e\u003c/a\u003e\r\n\r\nLooking for enterprise-ready document processing and intelligent automation? Discover\r\nhow [Trellis AI](https://runtrellis.com/) can streamline your workflows and enhance productivity.\r\n\r\n## How it works 🛠️\r\n\r\nthepipe uses a combination of computer vision models and heuristics to scrape clean content from the source and process it for downstream use with [large language models](https://en.wikipedia.org/wiki/Large_language_model), or [vision-language models](https://en.wikipedia.org/wiki/Vision_transformer). You can feed these messages directly into the model, or alternatively you can chunk these messages for downstream storage in a vector database such as ChromaDB, LLamaIndex, or equivalent RAG framework.\r\n\r\n## Supported File Types 📚\r\n\r\n| Source                       | Input types                                                                          | Multimodal | Notes                                                                                                                                                                                                                                         |\r\n| ---------------------------- | ------------------------------------------------------------------------------------ | ---------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\r\n| Webpage                      | URLs starting with `http`, `https`, `ftp`                                            | ✔️         | Scrapes markdown, images, and tables from web pages. `ai_extraction` available for AI content extraction from the webpage's screenshot                                                                                                        |\r\n| PDF                          | `.pdf`                                                                               | ✔️         | Extracts page markdown and page images. `ai_extraction` available to use a VLM for complex or scanned documents                                                                                                                               |\r\n| Word Document                | `.docx`                                                                              | ✔️         | Extracts text, tables, and images                                                                                                                                                                                                             |\r\n| PowerPoint                   | `.pptx`                                                                              | ✔️         | Extracts text and images from slides                                                                                                                                                                                                          |\r\n| Video                        | `.mp4`, `.mov`, `.wmv`                                                               | ✔️         | Uses Whisper for transcription and extracts frames                                                                                                                                                                                            |\r\n| Audio                        | `.mp3`, `.wav`                                                                       | ✔️         | Uses Whisper for transcription                                                                                                                                                                                                                |\r\n| Jupyter Notebook             | `.ipynb`                                                                             | ✔️         | Extracts markdown, code, outputs, and images                                                                                                                                                                                                  |\r\n| Spreadsheet                  | `.csv`, `.xls`, `.xlsx`                                                              | ❌         | Converts each row to JSON format, including row index for each                                                                                                                                                                                |\r\n| Plaintext                    | `.txt`, `.md`, `.rtf`, etc                                                           | ❌         | Simple text extraction                                                                                                                                                                                                                        |\r\n| Image                        | `.jpg`, `.jpeg`, `.png`                                                              | ✔️         | Uses VLM for OCR in text-only mode                                                                                                                                                                                                            |\r\n| ZIP File                     | `.zip`                                                                               | ✔️         | Extracts and processes contained files                                                                                                                                                                                                        |\r\n| Directory                    | any `path/to/folder`                                                                 | ✔️         | Recursively processes all files in directory. Optionally use `inclusion_pattern` to pass regex strings for file inclusion rules.                                                                                                              |\r\n| YouTube Video (known issues) | YouTube video URLs starting with `https://youtube.com` or `https://www.youtube.com`. | ✔️         | Uses pytube for video download and Whisper for transcription. For consistent extraction, you may need to modify your `pytube` installation to send a valid user agent header (see [this issue](https://github.com/pytube/pytube/issues/399)). |\r\n| Tweet                        | URLs starting with `https://twitter.com` or `https://x.com`                          | ✔️         | Uses unofficial API, may break unexpectedly                                                                                                                                                                                                   |\r\n| GitHub Repository            | GitHub repo URLs starting with `https://github.com` or `https://www.github.com`      | ✔️         | Requires GITHUB_TOKEN environment variable                                                                                                                                                                                                    |\r\n\r\n## Configuration \u0026 Environment\r\n\r\nSet these environment variables to control API keys, hosting, and model defaults:\r\n\r\n```bash\r\n# If you want longer-term image storage and hosting (saves to ./images and serves via HOST_URL)\r\nexport HOST_IMAGES=true\r\n\r\n# GitHub token for scraping private/public repos via `scrape_url`\r\nexport GITHUB_TOKEN=ghp_...\r\n\r\n# Base URL + key for any custom LLM server (used in extract/scrape)\r\nexport LLM_SERVER_BASE_URL=https://openrouter.ai/api/v1\r\nexport LLM_SERVER_API_KEY=or-...\r\n\r\n# Control scraping defaults\r\nexport DEFAULT_AI_MODEL=gpt-4o\r\nexport FILESIZE_LIMIT_MB=50\r\n```\r\n\r\n## CLI Reference\r\n\r\n```shell\r\n# Basic usage: scrape a file or URL\r\nthepipe \u003csource\u003e [options]\r\n\r\n# Options:\r\n--ai_extraction       Use AI for PDF/image/text extraction\r\n--text_only           Only output text (no images)\r\n--inclusion_pattern=REGEX Only include files matching REGEX when scraping directories\r\n--verbose             Print detailed progress messages\r\n```\r\n\r\n## Contributing\r\n\r\nWe welcome contributions! To get started:\r\n\r\n1. Fork the repo and create a feature branch:\r\n\r\n   ```bash\r\n   git checkout -b feature/my-new-feature\r\n\r\n   ```\r\n\r\n2. Install dependencies \u0026 run tests:\r\n\r\n   ```bash\r\n   pip install -r requirements.txt\r\n   python -m unittest discover\r\n   ```\r\n\r\n3. Make your changes, format them, and commit them:\r\n\r\n   ```bash\r\n    black .\r\n    git add .\r\n    git commit -m \"...\"\r\n   ```\r\n\r\n4. Push to your fork and create a pull request:\r\n\r\n   ```bash\r\n     git push origin feature/my-new-feature\r\n   ```\r\n\r\n5. Submit a pull request to the main repository.\r\n\r\n6. Wait for review and feedback from the maintainers. This may take some time, so please be patient!\r\n","funding_links":[],"categories":["Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Femcf%2Fthepipe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Femcf%2Fthepipe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Femcf%2Fthepipe/lists"}