https://github.com/patrickloeber/llm-data-scrapers
A list of useful Open Source tools and scrapers to gather data for LLMs
https://github.com/patrickloeber/llm-data-scrapers
Last synced: 6 months ago
JSON representation
A list of useful Open Source tools and scrapers to gather data for LLMs
- Host: GitHub
- URL: https://github.com/patrickloeber/llm-data-scrapers
- Owner: patrickloeber
- Created: 2025-02-23T12:07:54.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-02-24T20:33:20.000Z (8 months ago)
- Last Synced: 2025-03-03T05:02:45.594Z (8 months ago)
- Size: 10.7 KB
- Stars: 204
- Watchers: 4
- Forks: 20
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-hacking-lists - patrickloeber/llm-data-scrapers - A list of useful Open Source tools and scrapers to gather data for LLMs (Others)
README
# LLM Data Scrapers 🚀
A list of useful Open Source tools and scrapers to gather data for LLMs:
| Name | |
| :------| :------------|
| [gitingest](https://github.com/cyclotruc/gitingest) | Replace `hub` with `ingest` in any github url to get a prompt-friendly extract of a codebase |
| [repomix](https://github.com/yamadashy/repomix) | Packs your entire repository into a single, AI-friendly file |
| [llm-scraper](https://github.com/mishushakov/llm-scraper) | Turn any webpage into structured data using LLMs |
| [crawl4ai](https://github.com/unclecode/crawl4ai) | LLM friendly web crawler & scraper |
| [trafilatura](https://github.com/adbar/trafilatura) | Python & Command-line tool to gather text and metadata on the web |
| [RepoToTextForLLMs](https://github.com/Doriandarko/RepoToTextForLLMs) | Simple Python script to fetch repo content |
| [marker](https://github.com/VikParuchuri/marker) | Convert PDF to markdown or JSON quickly |
| [reader](https://github.com/jina-ai/reader) | Convert any URL to an LLM-friendly input with a simple prefix `https://r.jina.ai/` |
| [files-to-prompt](https://github.com/simonw/files-to-prompt) | Concatenate a directory full of files into a single prompt for use with LLMs |
| [docling](https://github.com/DS4SD/docling) | Simplifies document processing and parsing of diverse formats |
| [firecrawl](https://github.com/mendableai/firecrawl) | API to turn websites into LLM-ready markdown or structured data, can be self-hosted (with limitations) |
| [llmstxt-generator](https://github.com/mendableai/llmstxt-generator) | API to generate `llms.txt`files from websites for LLM training and inference |
## More
- https://github.com/mlabonne/llm-datasets: Curated list of datasets and tools specifically for post-training.