{"id":49666755,"url":"https://github.com/aavision/wpr-crawler","last_synced_at":"2026-05-06T17:03:56.881Z","repository":{"id":349057764,"uuid":"1197226421","full_name":"AAVision/wpr-crawler","owner":"AAVision","description":"A Crawler to get content from a website","archived":false,"fork":false,"pushed_at":"2026-04-10T21:02:58.000Z","size":7395,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-10T23:11:33.792Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AAVision.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-31T12:25:18.000Z","updated_at":"2026-04-10T21:03:03.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/AAVision/wpr-crawler","commit_stats":null,"previous_names":["aavision/workplacerelations-crawler","aavision/wpr-crawler"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/AAVision/wpr-crawler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AAVision%2Fwpr-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AAVision%2Fwpr-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AAVision%2Fwpr-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AAVision%2Fwpr-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AAVision","download_url":"https://codeload.github.com/AAVision/wpr-crawler/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AAVision%2Fwpr-crawler/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32703532,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-06T08:33:17.875Z","status":"ssl_error","status_checked_at":"2026-05-06T08:33:17.221Z","response_time":117,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-05-06T17:03:55.634Z","updated_at":"2026-05-06T17:03:56.873Z","avatar_url":"https://github.com/AAVision.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Workplace Relations Scraping Pipeline\n\nA production-grade pipeline to scrape Irish legal decisions, store them in MinIO, and transform HTML content.\n\n![Dagster Orchestration Diagram](dragster.png)\n\n## Dashboards \u0026 Monitoring\n\nAfter starting the infrastructure with `docker compose up -d`, the following tools are available:\n\n- **Dagster UI (Orchestration)**: [http://localhost:3000](http://localhost:3000)\n- **MinIO Console (File Storage)**: [http://localhost:9001](http://localhost:9001)\n    - Username: `minioadmin` | Password: `minioadmin` (see `.env`)\n- **Mongo Express (Database Web UI)**: [http://localhost:8081](http://localhost:8081)\n    - Username: `admin` | Password: `pass`\n\n## Quick Start\n\n1. Copy `.env.example` to `.env` and adjust as needed.\n2. Run `docker compose up -d`\n3. Access Dagster UI at http://localhost:3000\n4. Launch the `full_pipeline` job via **Launchpad** with your desired date range.\n\n## Features\n\n- Scrapy with Playwright to bypass Cloudflare.\n- Rotating user agents, retries, and rate limiting.\n- Metadata stored in MongoDB.\n- Files stored in MinIO object storage.\n- Idempotent: uses file hashes to avoid duplicates.\n- Orchestrated with Dagster.\n\n## Database \u0026 Storage Architecture\n\nThe pipeline uses a decoupled storage architecture to maintain a robust data lake.\n\n### MongoDB (`workplace_relations`)\nStores all metadata, tracking states, and parsed data.\n- **`landing_documents`**: Stores the raw metadata from the Scrapy spiders (Total records as of last scan: **43,032**).\n  - Core fields: `identifier`, `title`, `date` (stored as `YYYY-MM-DD` string), `body` (tribunal name).\n  - Storage mapping: `file_path`, `file_hash`, `document_type`, `version`.\n- **`transformed_documents`**: Stores final processed documents after `transform.py` unifies them.\n\n### MinIO (`wrc-data`)\nObject storage acts as our Data Lake for raw files. Documents are dynamically partitioned based on their historical decision dates.\n- **Bucket**: `wrc-data`\n- **Structure**: `{body_name}/{YYYY-MM}/{identifier}.{ext}` (e.g., `Workplace_Relations_Commission/2000-03/ADJ-0001.html`)\n\n## Configuration\n\nAll settings via environment variables (`.env`) or `config.yaml`.\n\n## Running Manually\n\n```bash\ndocker compose up -d mongodb minio create-buckets\ndocker compose build dagster-webserver\n\n# Run for a specific body + date range\ndocker compose run --rm dagster-webserver \\\n  scrapy crawl wr_spider \\\n  -a start_date=2025-01-01 \\\n  -a end_date=2025-02-01 \\\n  -a body_id=15376 \\\n  -a body_name=\"Workplace Relations Commission\" \\\n  -a partition_date=2025-01\n\n# Or run the full orchestrated scraper (all bodies, monthly partitions)\ndocker compose run --rm scraper \\\npython /opt/dagster/scripts/run_scraper.py --start-date 2025-10-01 --end-date 2025-11-01\n```\n\n## Testing \u0026 Coverage\n\nThe repository maintains strict test coverage ensuring orchestration mechanisms, HTML extraction, and scrapers work flawlessly.\n\nTo run the unified test suite and view your coverage breakdown, run the following from within your local virtual environment:\n\n```bash\n# Ensure your virtual environment is active or use the binary directly:\nPYTHONPATH=. ./venv/bin/pytest tests/ --cov=. --cov-report=term-missing\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faavision%2Fwpr-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faavision%2Fwpr-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faavision%2Fwpr-crawler/lists"}