https://github.com/alexfazio/devdocs-to-llm
Turn any developer documentation into a GPT
https://github.com/alexfazio/devdocs-to-llm
crawler crawling firecrawl scraper scraping
Last synced: about 1 month ago
JSON representation
Turn any developer documentation into a GPT
- Host: GitHub
- URL: https://github.com/alexfazio/devdocs-to-llm
- Owner: alexfazio
- License: mit
- Created: 2024-08-22T21:09:29.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-03-03T11:02:33.000Z (3 months ago)
- Last Synced: 2025-03-24T09:08:43.158Z (about 2 months ago)
- Topics: crawler, crawling, firecrawl, scraper, scraping
- Language: Jupyter Notebook
- Homepage:
- Size: 150 KB
- Stars: 89
- Watchers: 2
- Forks: 12
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Turn any developer documentation into a specialized GPT.
## Overview
DevDocs to LLM is a tool that allows you to crawl developer documentation, extract content, and process it into a format suitable for use with large language models (LLMs) like ChatGPT. This enables you to create specialized assistants tailored to specific documentation sets.
## Features
- Web crawling with customizable options
- Content extraction in Markdown format
- Rate limiting to respect server constraints
- Retry mechanism for failed scrapes
- Export options:
- Rentry.co for quick sharing
- Google Docs for larger documents## Usage
1. Set up the Firecrawl environment
2. Crawl a website and generate a sitemap
3. Extract content from crawled pages
4. Export the processed content## Requirements
- Firecrawl API key
- Google Docs API credentials (optional, for Google Docs export)## Installation
This project is designed to run in a Jupyter notebook environment, particularly Google Colab. No local installation is required.
## Configuration
Before running the notebook, you'll need to set a few parameters:
- `sub_url`: The URL of the documentation you want to crawl
- `limit`: Maximum number of pages to crawl
- `scrape_option`: Choose to scrape all pages or a specific number
- `num_pages`: Number of pages to scrape if not scraping all
- `pages_per_minute`: Rate limiting parameter
- `wait_time_between_chunks`: Delay between scraping chunks
- `retry_attempts`: Number of retries for failed scrapes## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
[MIT](https://opensource.org/licenses/MIT)
Copyright (c) 2024-present, Alex Fazio
---
[](https://x.com/alxfazio/status/1826731977283641615)