https://github.com/alexfazio/devdocs-to-llm

Turn any developer documentation into a GPT
https://github.com/alexfazio/devdocs-to-llm

crawler crawling firecrawl scraper scraping

Last synced: 3 months ago
JSON representation

Turn any developer documentation into a GPT

Host: GitHub
URL: https://github.com/alexfazio/devdocs-to-llm
Owner: alexfazio
License: mit
Created: 2024-08-22T21:09:29.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-03-03T11:02:33.000Z (4 months ago)
Last Synced: 2025-03-24T09:08:43.158Z (3 months ago)
Topics: crawler, crawling, firecrawl, scraper, scraping
Language: Jupyter Notebook
Homepage:
Size: 150 KB
Stars: 89
Watchers: 2
Forks: 12
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

Turn any developer documentation into a specialized GPT.

## Overview

DevDocs to LLM is a tool that allows you to crawl developer documentation, extract content, and process it into a format suitable for use with large language models (LLMs) like ChatGPT. This enables you to create specialized assistants tailored to specific documentation sets.

## Features

- Web crawling with customizable options
- Content extraction in Markdown format
- Rate limiting to respect server constraints
- Retry mechanism for failed scrapes
- Export options:
- Rentry.co for quick sharing
- Google Docs for larger documents

## Usage

1. Set up the Firecrawl environment
2. Crawl a website and generate a sitemap
3. Extract content from crawled pages
4. Export the processed content

## Requirements

- Firecrawl API key
- Google Docs API credentials (optional, for Google Docs export)

## Installation

This project is designed to run in a Jupyter notebook environment, particularly Google Colab. No local installation is required.

## Configuration

Before running the notebook, you'll need to set a few parameters:

- `sub_url`: The URL of the documentation you want to crawl
- `limit`: Maximum number of pages to crawl
- `scrape_option`: Choose to scrape all pages or a specific number
- `num_pages`: Number of pages to scrape if not scraping all
- `pages_per_minute`: Rate limiting parameter
- `wait_time_between_chunks`: Delay between scraping chunks
- `retry_attempts`: Number of retries for failed scrapes

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

[MIT](https://opensource.org/licenses/MIT)

---

[![Watch the video](https://i.imgur.com/VKRoApP.png)](https://x.com/alxfazio/status/1826731977283641615)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/alexfazio/devdocs-to-llm

Awesome Lists containing this project

README