{"id":33388985,"url":"https://github.com/utsavbalar1231/sus","last_synced_at":"2026-05-11T01:23:14.051Z","repository":{"id":325532019,"uuid":"1096026830","full_name":"UtsavBalar1231/sus","owner":"UtsavBalar1231","description":"SUS - Simple Universal Scraper for docs","archived":false,"fork":false,"pushed_at":"2025-11-21T20:57:53.000Z","size":872,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-11-21T22:22:31.224Z","etag":null,"topics":["docs","mkdocs","mypy","pydantic-v2","python","ruff","scraper","sus"],"latest_commit_sha":null,"homepage":"https://utsavbalar1231.github.io/sus/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/UtsavBalar1231.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-13T20:57:52.000Z","updated_at":"2025-11-21T20:57:57.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/UtsavBalar1231/sus","commit_stats":null,"previous_names":["utsavbalar1231/sus"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/UtsavBalar1231/sus","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UtsavBalar1231%2Fsus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UtsavBalar1231%2Fsus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UtsavBalar1231%2Fsus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UtsavBalar1231%2Fsus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/UtsavBalar1231","download_url":"https://codeload.github.com/UtsavBalar1231/sus/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UtsavBalar1231%2Fsus/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":285914727,"owners_count":27252968,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-23T02:00:06.149Z","response_time":135,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["docs","mkdocs","mypy","pydantic-v2","python","ruff","scraper","sus"],"created_at":"2025-11-23T07:00:24.750Z","updated_at":"2026-05-11T01:23:14.038Z","avatar_url":"https://github.com/UtsavBalar1231.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SUS - Simple Universal Scraper\n\n[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)\n[![Type Checked](https://img.shields.io/badge/mypy-strict-blue.svg)]()\n\n**High-performance async web scraper for converting documentation sites to Markdown.**\n\nBuilt with Python 3.12+, httpx, and asyncio. Designed for scraping documentation websites with HTTP/2 support, intelligent checkpoint/resume, and HTTP-first smart routing for optimal performance.\n\n---\n\n## Quick Install\n\n```bash\n# Using uv (recommended)\nuv pip install sus\n\n# Or using pip\npip install sus\n\n# Verify installation\nsus --version\n```\n\n**Requirements:** Python 3.12+\n\n---\n\n## 30-Second Example\n\nCreate a config file `docs.yaml`:\n\n```yaml\nname: my-docs\n\nsite:\n  start_urls:\n    - https://docs.python.org/3/\n  allowed_domains:\n    - docs.python.org\n\noutput:\n  base_dir: ./output\n\ncrawling:\n  max_pages: 50\n```\n\nRun the scraper:\n\n```bash\nsus scrape --config docs.yaml\n```\n\nOutput:\n\n```\noutput/\n├── docs/           # Markdown files with frontmatter\n├── assets/         # Downloaded images, CSS, JS\n└── checkpoint.json # Resume state (if enabled)\n```\n\n---\n\n## Key Features\n\n**Core Capabilities:**\n- **Config-driven YAML** - Define scraping behavior declaratively\n- **Async architecture** - Built on httpx and asyncio for maximum performance\n- **HTTP/2 support** - Connection pooling, multiplexing, 60-80% overhead reduction\n- **Type-safe** - Full mypy --strict compliance with Pydantic 2.9+ validation\n\n**Scraping Features:**\n- **Checkpoint/Resume** - Incremental scraping with crash recovery (JSON or SQLite backends)\n- **JavaScript rendering** - Playwright integration for SPA sites\n- **Sitemap parsing** - Fast URL discovery via sitemap.xml\n- **Authentication** - Built-in support for Basic, Cookie, Header, and OAuth2 auth\n- **Content filtering** - Regex/glob/prefix URL patterns, CSS selectors\n- **Asset handling** - Download and rewrite image/CSS/JS references with deduplication\n\n**Performance \u0026 Reliability:**\n- **Rate limiting** - Token bucket algorithm with burst support\n- **HTTP caching** - RFC 9111 compliant caching for development\n- **Pipeline mode** - Multi-stage processing with memory-aware queues (3-10x speedup)\n- **Error handling** - Retry logic with exponential backoff, graceful degradation\n\n**Extensibility:**\n- **Plugin system** - 5 lifecycle hooks with built-in plugins (code highlighting, image optimization, link validation)\n- **Custom backends** - Pluggable checkpoint storage (JSON for \u003c10K pages, SQLite for larger)\n\n---\n\n## Performance\n\nSUS automatically optimizes for different site types:\n\n| Mode | Speed | Use Case |\n|------|-------|----------|\n| **HTTP-only** | 25-50 pages/sec | Static sites, server-side rendered docs |\n| **Auto (HTTP-first)** | 10-30 pages/sec | Mixed content - tries HTTP, falls back to JS |\n| **JS Rendering** | 2-8 pages/sec | SPAs (React, Vue, Next.js) requiring JavaScript |\n\n**Performance features:**\n- **Pipeline architecture** - Concurrent fetch + process for 3-10x throughput\n- **HTTP-first routing** - Auto-detects when JavaScript rendering is needed\n- **Adaptive concurrency** - 50 global / 10 per-domain connections\n- **HTTP/2 multiplexing** - 60-80% connection overhead reduction\n\nConfigure JavaScript mode in your YAML:\n```yaml\ncrawling:\n  javascript:\n    mode: auto  # disabled | enabled | auto (recommended)\n```\n\n---\n\n## Documentation\n\n**[Full Documentation →](https://UtsavBalar1231.github.io/sus/)**\n\n### Quick Links\n\n- **[Getting Started](https://UtsavBalar1231.github.io/sus/getting-started/)** - Installation and first scrape tutorial\n- **[Configuration Reference](https://UtsavBalar1231.github.io/sus/configuration/)** - Complete YAML schema documentation\n- **[Examples](https://UtsavBalar1231.github.io/sus/examples/)** - Real-world configuration examples\n- **[API Reference](https://UtsavBalar1231.github.io/sus/api/overview/)** - Python API documentation\n\n---\n\n## CLI Overview\n\n```bash\n# Run scraper with config\nsus scrape --config FILE\n\n# Validate config syntax\nsus validate FILE\n\n# Interactive config wizard\nsus init [OUTPUT]\n\n# List example configs\nsus list\n\n# Common options\nsus scrape --config FILE \\\n  --output DIR \\           # Override output directory\n  --max-pages N \\          # Limit page count\n  --resume \\               # Resume from checkpoint\n  --reset-checkpoint \\     # Start fresh\n  --clear-cache            # Clear HTTP cache\n```\n\nSee the [Getting Started guide](https://UtsavBalar1231.github.io/sus/getting-started/) for detailed usage.\n\n---\n\n## Development\n\n### Setup\n\n```bash\n# Clone repository\ngit clone https://github.com/UtsavBalar1231/sus.git\ncd sus\n\n# Install with dev dependencies\nuv sync --group dev\n\n# Optional: Install JavaScript rendering support\nuv sync --group js\n\n# Optional: Install plugin dependencies\nuv sync --group plugins\n```\n\n### Quality Checks\n\n```bash\n# Run all checks (lint + type-check + test)\njust check\n\n# Individual commands\njust lint          # ruff check\njust lint-fix      # ruff check --fix\njust format        # ruff format\njust type-check    # mypy --strict\njust test          # pytest\njust test-cov      # pytest with coverage\n```\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for development workflow and coding standards.\n\n---\n\n## Architecture\n\nSUS implements a six-stage pipeline:\n\n1. **Configuration** - Pydantic 2.9+ models with YAML validation and type coercion\n2. **Crawler** - httpx async client with token bucket rate limiting and robots.txt compliance\n3. **URL Filtering** - lxml link extraction with regex/glob/prefix pattern matching\n4. **Content Conversion** - html-to-markdown (Rust-powered) converter with YAML frontmatter generation\n5. **CLI Interface** - Typer commands with Rich progress bars and real-time statistics\n6. **Testing** - Comprehensive pytest suite with pytest-asyncio and pytest-httpx\n\n**Advanced features:**\n- **Backend system** - Pluggable checkpoint storage (JSON/SQLite) via `StateBackend` protocol\n- **Plugin architecture** - 5 lifecycle hooks (PRE_CRAWL, POST_FETCH, POST_CONVERT, POST_SAVE, POST_CRAWL)\n- **Pipeline mode** - Producer-consumer architecture with memory-aware queues\n\nSee [CLAUDE.md](CLAUDE.md) for comprehensive technical documentation.\n\n---\n\n## Project Status\n\n**Production-ready features:**\n- Checkpoint/resume with JSON and SQLite backends\n- JavaScript rendering via Playwright\n- Authentication (Basic, Cookie, Header, OAuth2)\n- Plugin system with 3 built-in plugins\n- Sitemap parsing and HTTP caching\n- Memory monitoring and graceful degradation\n\n**Quality:**\n- mypy --strict type checking (zero errors)\n- Comprehensive test coverage with pytest\n- Tested on Python 3.12+ (Linux)\n\n---\n\n## Contributing\n\nContributions welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for:\n- Development setup\n- Coding standards and conventions\n- Testing requirements\n- Pull request process\n\n**Report issues:** [GitHub Issues](https://github.com/UtsavBalar1231/sus/issues)\n\n---\n\n## License\n\nThis project is currently unlicensed. Please contact the maintainer for licensing information.\n\n---\n\n**SUS** - Simple Universal Scraper\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Futsavbalar1231%2Fsus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Futsavbalar1231%2Fsus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Futsavbalar1231%2Fsus/lists"}