https://github.com/compiler-inc/doc-scraper

Scrape API docs into beautiful markdown
https://github.com/compiler-inc/doc-scraper

api docs python scraper

Last synced: 8 months ago
JSON representation

Scrape API docs into beautiful markdown

Host: GitHub
URL: https://github.com/compiler-inc/doc-scraper
Owner: Compiler-Inc
Created: 2025-03-05T00:54:59.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-03-06T03:16:55.000Z (over 1 year ago)
Last Synced: 2025-04-14T03:09:41.313Z (about 1 year ago)
Topics: api, docs, python, scraper
Language: Python
Homepage:
Size: 16.6 KB
Stars: 9
Watchers: 3
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Doc Scraper

A flexible documentation crawler that can scrape and process documentation from any website.

## Installation

First install dependencies:

```bash

pip install -r requirements.txt

```

Then install the package in editable mode:

```bash

pip install -e .

```

The `-e` flag installs the package in "editable" mode, which means:

- The package is installed in your Python environment

- Python looks for the package in your current directory instead of copying files

- Changes to the source code take effect immediately without reinstalling

- Required for running the package as a module with `python -m`

### Environment Setup

Create a `.env` file in the project root:

```bash

OPENAI_API_KEY=your_api_key_here

```

⚠️ The OpenAI API key is required for the crawler to process documentation.

## Usage

Run the scraper with a URL from the `src` directory:

```bash

cd src

python main.py https://docs.example.com

```

### Optional Arguments

- `-o, --output`: Output directory (default: output_docs)

- `-m, --max-pages`: Maximum pages to scrape (default: 1000)

- `-c, --concurrent`: Number of concurrent pages to scrape (default: 1)

Example with all options:

```bash

python main.py https://docs.example.com -o my_docs -m 500 -c 2

```

### Troubleshooting

If you get a "ModuleNotFoundError", make sure you:

1. Have run `pip install -e .` from the project root

2. Are running the command from the `src` directory

## Configuration

The crawler accepts the following parameters:

- `base_url`: The starting URL to crawl

- `output_dir`: Directory where scraped docs will be saved

- `max_pages`: Maximum number of pages to crawl

- `max_concurrent_pages`: Number of concurrent pages to process

## Requirements

- Python 3.8+

- Chrome/Chromium browser (for Selenium)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/compiler-inc/doc-scraper

Awesome Lists containing this project

README