An open API service indexing awesome lists of open source software.

https://github.com/patrickloeber/llm-data-scrapers

A list of useful Open Source tools and scrapers to gather data for LLMs
https://github.com/patrickloeber/llm-data-scrapers

Last synced: 6 months ago
JSON representation

A list of useful Open Source tools and scrapers to gather data for LLMs

Awesome Lists containing this project

README

          

# LLM Data Scrapers 🚀

A list of useful Open Source tools and scrapers to gather data for LLMs:

| Name | |
| :------| :------------|
| [gitingest](https://github.com/cyclotruc/gitingest) | Replace `hub` with `ingest` in any github url to get a prompt-friendly extract of a codebase |
| [repomix](https://github.com/yamadashy/repomix) | Packs your entire repository into a single, AI-friendly file |
| [llm-scraper](https://github.com/mishushakov/llm-scraper) | Turn any webpage into structured data using LLMs |
| [crawl4ai](https://github.com/unclecode/crawl4ai) | LLM friendly web crawler & scraper |
| [trafilatura](https://github.com/adbar/trafilatura) | Python & Command-line tool to gather text and metadata on the web |
| [RepoToTextForLLMs](https://github.com/Doriandarko/RepoToTextForLLMs) | Simple Python script to fetch repo content |
| [marker](https://github.com/VikParuchuri/marker) | Convert PDF to markdown or JSON quickly |
| [reader](https://github.com/jina-ai/reader) | Convert any URL to an LLM-friendly input with a simple prefix `https://r.jina.ai/` |
| [files-to-prompt](https://github.com/simonw/files-to-prompt) | Concatenate a directory full of files into a single prompt for use with LLMs |
| [docling](https://github.com/DS4SD/docling) | Simplifies document processing and parsing of diverse formats |
| [firecrawl](https://github.com/mendableai/firecrawl) | API to turn websites into LLM-ready markdown or structured data, can be self-hosted (with limitations) |
| [llmstxt-generator](https://github.com/mendableai/llmstxt-generator) | API to generate `llms.txt`files from websites for LLM training and inference |

## More
- https://github.com/mlabonne/llm-datasets: Curated list of datasets and tools specifically for post-training.