Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/alexfazio/devdocs-to-llm
Turn any developer documentation into a GPT
https://github.com/alexfazio/devdocs-to-llm
crawler crawling firecrawl scraper scraping
Last synced: 5 days ago
JSON representation
Turn any developer documentation into a GPT
- Host: GitHub
- URL: https://github.com/alexfazio/devdocs-to-llm
- Owner: alexfazio
- License: mit
- Created: 2024-08-22T21:09:29.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-09-29T16:41:48.000Z (4 months ago)
- Last Synced: 2025-01-15T12:59:54.678Z (12 days ago)
- Topics: crawler, crawling, firecrawl, scraper, scraping
- Language: Jupyter Notebook
- Homepage:
- Size: 121 KB
- Stars: 81
- Watchers: 1
- Forks: 11
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Turn any developer documentation into a specialized GPT.
## Overview
DevDocs to LLM is a tool that allows you to crawl developer documentation, extract content, and process it into a format suitable for use with large language models (LLMs) like ChatGPT. This enables you to create specialized assistants tailored to specific documentation sets.
## Features
- Web crawling with customizable options
- Content extraction in Markdown format
- Rate limiting to respect server constraints
- Retry mechanism for failed scrapes
- Export options:
- Rentry.co for quick sharing
- Google Docs for larger documents## Usage
1. Set up the Firecrawl environment
2. Crawl a website and generate a sitemap
3. Extract content from crawled pages
4. Export the processed content## Requirements
- Firecrawl API key
- Google Docs API credentials (optional, for Google Docs export)## Installation
This project is designed to run in a Jupyter notebook environment, particularly Google Colab. No local installation is required.
## Configuration
Before running the notebook, you'll need to set a few parameters:
- `sub_url`: The URL of the documentation you want to crawl
- `limit`: Maximum number of pages to crawl
- `scrape_option`: Choose to scrape all pages or a specific number
- `num_pages`: Number of pages to scrape if not scraping all
- `pages_per_minute`: Rate limiting parameter
- `wait_time_between_chunks`: Delay between scraping chunks
- `retry_attempts`: Number of retries for failed scrapes## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
[MIT](https://opensource.org/licenses/MIT)
Copyright (c) 2024-present, Alex Fazio
---
[![Watch the video](https://i.imgur.com/VKRoApP.png)](https://x.com/alxfazio/status/1826731977283641615)