{"id":21447978,"url":"https://github.com/kvdomingo/douglas-crawler","last_synced_at":"2025-03-17T02:09:16.440Z","repository":{"id":263846712,"uuid":"888900411","full_name":"kvdomingo/douglas-crawler","owner":"kvdomingo","description":"Simple script \u0026 web app for crawling product pages on douglas.de","archived":false,"fork":false,"pushed_at":"2024-12-02T13:02:30.000Z","size":777,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-23T11:45:14.503Z","etag":null,"topics":["beautifulsoup","cloud-run","fastapi","python","supabase","terraform","web-crawler"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kvdomingo.png","metadata":{"files":{"readme":"docs/README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-15T08:16:31.000Z","updated_at":"2024-12-02T17:40:35.000Z","dependencies_parsed_at":"2025-01-23T11:45:09.901Z","dependency_job_id":"ce7a4425-4557-493c-bedf-e0abe84d62a0","html_url":"https://github.com/kvdomingo/douglas-crawler","commit_stats":null,"previous_names":["kvdomingo/douglas-crawler"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kvdomingo%2Fdouglas-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kvdomingo%2Fdouglas-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kvdomingo%2Fdouglas-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kvdomingo%2Fdouglas-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kvdomingo","download_url":"https://codeload.github.com/kvdomingo/douglas-crawler/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243960665,"owners_count":20375104,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["beautifulsoup","cloud-run","fastapi","python","supabase","terraform","web-crawler"],"created_at":"2024-11-23T03:13:38.118Z","updated_at":"2025-03-17T02:09:16.421Z","avatar_url":"https://github.com/kvdomingo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Douglas Crawler\n\n## Overview\n\nThe crawler job is written in Python using HTTPX and BeautifulSoup, and is deployed as a GCP Cloud Run job. The job is\ntriggered manually, which crawls the default URL, extracts the necessary information, and stores it in a Postgres\ndatabase hosted on Supabase.\n\nThe web API is powered by FastAPI and is deployed as a GCP Cloud Run service.\n\n![Architecture](./images/architecture.png)\n\nAll infrastructure is managed using Terraform, and CI/CD is orchestrated via GitHub Actions.\n\n## Directory Structure\n\n- `.github` - Configuration files for GitHub Actions.\n- `docs` - Documentation and README files.\n- `douglas` - Python code for the crawler and web API.\n  - `internal` - Internal modules (i.e. business logic).\n  - `models` - ORM definitions.\n  - `schemas` - Pydantic data models.\n- `infra` - IaC via Terraform.\n- `migrations` - SQL and scripts for database migrations.\n- `scripts` - Python scripts for running the crawler locally.\n- `tests` - Unit tests for the crawler and web API.\n\n## Usage\n\n### Web API\n\nExplore the Swagger UI at https://douglas-crawler-api-lhebzk57ca-ew.a.run.app/api/docs.\n\n\u003e [!NOTE]\n\u003e The web API is configured to autoscale to 0 instances when no traffic is received within a certain\n\u003e time window, in order to save costs. If the web API takes a while to load, it is probably undergoing\n\u003e a cold start.\n\nHere you will find 3 useful endpoints:\n\n- `/api/crawl` - Crawl a specific product page on [douglas.de](https://www.douglas.de).\n- `/api/products` - List all product information stored in the database.\n- `/api/products/{ean}` - Retrieve product information stored in the database using the product `ean`.\n\n### CLI\n\n#### Prerequisites\n\n- [Mise](https://mise.jdx.dev)\n- [Docker](https://www.docker.com)\n\n#### Setup\n\n1. Install prerequisites.\n2. Install additional prerequisites via Mise. This will automatically install Python, Poetry, Task, and Terraform. You\n   may not want to use Mise if you already have these tools installed or if you use a different environment manager. At\n   minimum, you only need to have Docker.\n    ```shell\n    mise install\n    ```\n3. Copy the contents of `.env.example` into a new file `.env` in the same directory, and fill in the necessary\n   environment variables.\n4. Launch Docker containers\n    ```shell\n    task\n\n    # Alternatively without Task\n    docker compose --project-name douglas-crawler up --detach --build\n    ```\n\n#### Running\n\nRun the crawler\n\n```shell\ntask crawl\n\n# Without Task\ndocker compose --project-name douglas-crawler exec -t api poetry run python -m scripts.crawl\n```\n\nThe crawler script has a `-u`/`--url` parameter which defaults to\nthis [category page](https://www.douglas.de/de/c/gesicht/gesichtsmasken/feuchtigkeitsmasken/120308). To use a different\ncategory page:\n\n```shell\ntask crawl -- -u https://www.douglas.de/\u003cother-page\u003e\n```\n\nA local copy of the web API is available at http://localhost:8000/api/docs.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkvdomingo%2Fdouglas-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkvdomingo%2Fdouglas-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkvdomingo%2Fdouglas-crawler/lists"}