{"id":23286424,"url":"https://github.com/httparchive/dataform","last_synced_at":"2025-10-07T02:12:45.520Z","repository":{"id":254829834,"uuid":"847666784","full_name":"HTTPArchive/dataform","owner":"HTTPArchive","description":"The data pipeline for HTTP Archive orchestrated by Dataform","archived":false,"fork":false,"pushed_at":"2025-10-05T19:20:37.000Z","size":1027,"stargazers_count":5,"open_issues_count":3,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-10-05T21:16:51.156Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/HTTPArchive.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-08-26T10:04:04.000Z","updated_at":"2025-10-05T19:20:40.000Z","dependencies_parsed_at":"2024-08-26T12:44:36.819Z","dependency_job_id":"18ce8d9b-daf6-414b-a286-b96fdc9de5b7","html_url":"https://github.com/HTTPArchive/dataform","commit_stats":{"total_commits":85,"total_committers":2,"mean_commits":42.5,"dds":"0.10588235294117643","last_synced_commit":"c8d8bcc2ba27167a62d585919ec8726a279e9f03"},"previous_names":["httparchive/dataform"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/HTTPArchive/dataform","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HTTPArchive%2Fdataform","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HTTPArchive%2Fdataform/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HTTPArchive%2Fdataform/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HTTPArchive%2Fdataform/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/HTTPArchive","download_url":"https://codeload.github.com/HTTPArchive/dataform/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HTTPArchive%2Fdataform/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278708004,"owners_count":26031932,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-07T02:00:06.786Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-20T02:12:25.305Z","updated_at":"2025-10-07T02:12:45.491Z","avatar_url":"https://github.com/HTTPArchive.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# HTTP Archive datasets pipeline\n\nThis repository handles the HTTP Archive data pipeline, which takes the results of the monthly HTTP Archive run and saves this to the `httparchive` dataset in BigQuery.\n\n## Pipelines\n\nThe pipelines are run in Dataform service in Google Cloud Platform (GCP) and are kicked off automatically on crawl completion and other events. The code in the `main` branch is used on each triggered pipeline run.\n\n### HTTP Archive Crawl\n\nTag: `crawl_complete`\n\n- Crawl dataset `httparchive.crawl.*`\n\n  Consumers:\n\n  - public dataset and [BQ Sharing Listing](https://console.cloud.google.com/bigquery/analytics-hub/discovery/projects/httparchive/locations/us/dataExchanges/httparchive/listings/crawl)\n\n- Blink Features Report `httparchive.blink_features.usage`\n\n  Consumers:\n\n  - [chromestatus.com](https://chromestatus.com/metrics/feature/timeline/popularity/2089)\n\n### HTTP Archive Technology Report\n\nTag: `crux_ready`\n\n- `httparchive.reports.cwv_tech_*` and `httparchive.reports.tech_*`\n\n  Consumers:\n\n  - [HTTP Archive Tech Report](https://httparchive.org/reports/techreport/landing)\n\n## Schedules\n\n1. [crawl-complete](https://console.cloud.google.com/cloudpubsub/subscription/detail/dataform-service-crawl-complete?authuser=2\u0026project=httparchive) PubSub subscription\n\n    Tags: [\"crawl_complete\"]\n\n2. [bq-poller-crux-ready](https://console.cloud.google.com/cloudscheduler/jobs/edit/us-central1/bq-poller-crux-ready?authuser=7\u0026project=httparchive) Scheduler\n\n    Tags: [\"crux_ready\"]\n\n### Triggering workflows\n\nIn order to unify the workflow triggering mechanism, we use [a Cloud Run function](./infra/README.md) that can be invoked in a number of ways (e.g. listen to PubSub messages), do intermediate checks and trigger the particular Dataform workflow execution configuration.\n\n## Cloud resources overview\n\n```mermaid\ngraph TB;\n    subgraph Cloud Run\n        dataform-service[dataform-service service]\n        bigquery-export[bigquery-export job]\n    end\n\n    subgraph PubSub\n        crawl-complete[crawl-complete topic]\n        dataform-service-crawl-complete[dataform-service-crawl-complete subscription]\n        crawl-complete --\u003e dataform-service-crawl-complete\n    end\n\n    dataform-service-crawl-complete --\u003e dataform-service\n\n    subgraph Cloud_Scheduler\n        bq-poller-crux-ready[bq-poller-crux-ready Poller Scheduler Job]\n        bq-poller-crux-ready --\u003e dataform-service\n    end\n\n    subgraph Dataform\n        dataform[Dataform Repository]\n        dataform_release_config[dataform Release Configuration]\n        dataform_workflow[dataform Workflow Execution]\n    end\n\n    dataform-service --\u003e dataform[Dataform Repository]\n    dataform --\u003e dataform_release_config\n    dataform_release_config --\u003e dataform_workflow\n\n    subgraph BigQuery\n        bq_jobs[BigQuery jobs]\n        bq_datasets[BigQuery table updates]\n        bq_jobs --\u003e bq_datasets\n    end\n\n    dataform_workflow --\u003e bq_jobs\n\n    bq_jobs --\u003e bigquery-export\n\n    subgraph Monitoring\n        cloud_run_logs[Cloud Run logs]\n        dataform_logs[Dataform logs]\n        bq_logs[BigQuery logs]\n        alerting_policies[Alerting Policies]\n        slack_notifications[Slack notifications]\n\n        cloud_run_logs --\u003e alerting_policies\n        dataform_logs --\u003e alerting_policies\n        bq_logs --\u003e alerting_policies\n        alerting_policies --\u003e slack_notifications\n    end\n\n    dataform-service --\u003e cloud_run_logs\n    dataform_workflow --\u003e dataform_logs\n    bq_jobs --\u003e bq_logs\n    bigquery-export --\u003e cloud_run_logs\n```\n\n## Development Setup\n\n1. Install dependencies:\n\n    ```bash\n    npm install\n    ```\n\n2. Available Scripts:\n\n    - `npm run format` - Format code using Standard.js, fix Markdown issues, and format Terraform files\n    - `npm run lint` - Run linting checks on JavaScript, Markdown files, and compile Dataform configs\n    - `make tf_apply` - Apply Terraform configurations\n\n## Code Quality\n\nThis repository uses:\n\n- Standard.js for JavaScript code style\n- Markdownlint for Markdown file formatting\n- Dataform's built-in compiler for SQL validation\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhttparchive%2Fdataform","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhttparchive%2Fdataform","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhttparchive%2Fdataform/lists"}