https://github.com/compiler-inc/doc-scraper
Scrape API docs into beautiful markdown
https://github.com/compiler-inc/doc-scraper
api docs python scraper
Last synced: 8 months ago
JSON representation
Scrape API docs into beautiful markdown
- Host: GitHub
- URL: https://github.com/compiler-inc/doc-scraper
- Owner: Compiler-Inc
- Created: 2025-03-05T00:54:59.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-06T03:16:55.000Z (over 1 year ago)
- Last Synced: 2025-04-14T03:09:41.313Z (about 1 year ago)
- Topics: api, docs, python, scraper
- Language: Python
- Homepage:
- Size: 16.6 KB
- Stars: 9
- Watchers: 3
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Doc Scraper
A flexible documentation crawler that can scrape and process documentation from any website.
## Installation
First install dependencies:
```bash
pip install -r requirements.txt
```
Then install the package in editable mode:
```bash
pip install -e .
```
The `-e` flag installs the package in "editable" mode, which means:
- The package is installed in your Python environment
- Python looks for the package in your current directory instead of copying files
- Changes to the source code take effect immediately without reinstalling
- Required for running the package as a module with `python -m`
### Environment Setup
Create a `.env` file in the project root:
```bash
OPENAI_API_KEY=your_api_key_here
```
⚠️ The OpenAI API key is required for the crawler to process documentation.
## Usage
Run the scraper with a URL from the `src` directory:
```bash
cd src
python main.py https://docs.example.com
```
### Optional Arguments
- `-o, --output`: Output directory (default: output_docs)
- `-m, --max-pages`: Maximum pages to scrape (default: 1000)
- `-c, --concurrent`: Number of concurrent pages to scrape (default: 1)
Example with all options:
```bash
python main.py https://docs.example.com -o my_docs -m 500 -c 2
```
### Troubleshooting
If you get a "ModuleNotFoundError", make sure you:
1. Have run `pip install -e .` from the project root
2. Are running the command from the `src` directory
## Configuration
The crawler accepts the following parameters:
- `base_url`: The starting URL to crawl
- `output_dir`: Directory where scraped docs will be saved
- `max_pages`: Maximum number of pages to crawl
- `max_concurrent_pages`: Number of concurrent pages to process
## Requirements
- Python 3.8+
- Chrome/Chromium browser (for Selenium)