{"id":23286429,"url":"https://github.com/httparchive/data-pipeline","last_synced_at":"2025-08-21T17:32:21.622Z","repository":{"id":37382801,"uuid":"454271288","full_name":"HTTPArchive/data-pipeline","owner":"HTTPArchive","description":"The new HTTP Archive data pipeline built entirely on GCP","archived":false,"fork":false,"pushed_at":"2024-04-09T03:11:08.000Z","size":1216,"stargazers_count":3,"open_issues_count":45,"forks_count":0,"subscribers_count":8,"default_branch":"main","last_synced_at":"2024-04-14T00:31:54.147Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/HTTPArchive.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2022-02-01T05:22:29.000Z","updated_at":"2022-05-27T12:04:55.000Z","dependencies_parsed_at":"2023-09-25T23:33:08.315Z","dependency_job_id":"db9559e3-b78a-4e4c-9576-f1dcf0d857f5","html_url":"https://github.com/HTTPArchive/data-pipeline","commit_stats":{"total_commits":187,"total_committers":8,"mean_commits":23.375,"dds":0.5668449197860963,"last_synced_commit":"d0479063a3d7cc59723f4c71947c041276262852"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HTTPArchive%2Fdata-pipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HTTPArchive%2Fdata-pipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HTTPArchive%2Fdata-pipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HTTPArchive%2Fdata-pipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/HTTPArchive","download_url":"https://codeload.github.com/HTTPArchive/data-pipeline/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230523771,"owners_count":18239446,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-20T02:12:29.544Z","updated_at":"2025-08-21T17:32:21.611Z","avatar_url":"https://github.com/HTTPArchive.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"**Deprecated**. The next version of the data pipeline: https://github.com/HTTPArchive/dataform\n\n# data-pipeline\n\nThe new HTTP Archive data pipeline built entirely on GCP\n\n![GitHub branch checks state](https://github.com/HTTPArchive/data-pipeline/actions/workflows/code-static-analysis.yml/badge.svg?branch=main)\n![GitHub branch checks state](https://github.com/HTTPArchive/data-pipeline/actions/workflows/linter.yml/badge.svg?branch=main)\n![GitHub branch checks state](https://github.com/HTTPArchive/data-pipeline/actions/workflows/unittest.yml/badge.svg?branch=main)\n![Coverage badge](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/wiki/HTTPArchive/data-pipeline/python-coverage-comment-action-badge.json)\n\n- [Diagrams](#diagrams)\n  * [GCP Workflows pipeline execution](#gcp-workflows-pipeline-execution)\n  * [Development workflow](#development-workflow)\n  * [Manually running the pipeline](#manually-running-the-pipeline)\n- [Run the pipeline](#run-the-pipeline)\n  * [Locally using the `run_*.sh` scripts](#locally-using-the-run_sh-scripts)\n  * [Running a flex template from the Cloud Console](#running-a-flex-template-from-the-cloud-console)\n  * [Publishing a Pub/Sub message](#publishing-a-pubsub-message)\n  * [Pipeline types](#pipeline-types)\n- [Inputs](#inputs)\n  * [Generating HAR manifest files](#generating-har-manifest-files)\n- [Outputs](#outputs)\n- [Builds and Deployments](#builds-and-deployments)\n  * [Build inputs and artifacts](#build-inputs-and-artifacts)\n  * [To build and deploy manually](#to-build-and-deploy-manually)\n- [Known issues](#known-issues)\n  * [Data Pipeline](#data-pipeline)\n    + [Temp table cleanup](#temp-table-cleanup)\n    + [Streaming pipeline](#streaming-pipeline)\n  * [Dataflow](#dataflow)\n    + [Logging](#logging)\n  * [Response cache-control max-age](#response-cache-control-max-age)\n  * [New file formats](#new-file-formats)\n  * [mimetypes and file extensions](#mimetypes-and-file-extensions)\n\n\u003csmall\u003e\u003ci\u003e\u003ca href='http://ecotrust-canada.github.io/markdown-toc/'\u003eTable of contents generated with markdown-toc\u003c/a\u003e\u003c/i\u003e\u003c/small\u003e\n\n## Introduction\n\nThis repo handles the HTTP Archive data pipeline, which takes the results of the monthly HTTP Archive run and saves this to the `httparchive` dataset in BigQuery.\n\nA secondary pipeline is responsible for populating the Technology Report Firestore collections.\n\nThere are currently two main pipelines:\n\n- The `all` pipeline which saves data to the new `httparchive.all` dataset\n- The `combined` pipline which saves data to the legacy tables. This processes both the `summary` tables (`summary_pages` and `summary_requests`) and `non-summary` pipeline (`pages`, `requests`, `response_bodies`....etc.)\n\nThe secondary `tech_report` pipeline saves data to a Firestore database (e.g. `tech-report-apis-prod`) across various collections ([see `TECHNOLOGY_QUERIES` in constants.py](modules/constants.py))\n\nThe pipelines are run in Google Cloud Platform (GCP) and are kicked off automatically on crawl completion, based on the code in the `main` branch which is deployed to GCP on each merge.\n\nThe [`data-pipeline` workflow](https://console.cloud.google.com/workflows/workflow/us-west1/data-pipeline/executions?project=httparchive) as defined by the [data-pipeline-workflows.yaml](./data-pipeline-workflows.yaml) file, runs the whole process from start to finish, including generating the manifest file for each of the two runs (desktop and mobile) and then starting the four dataflow jobs (desktop all, mobile all, desktop combined, mobile combined) in sequence to upload of the HAR files to the BigQuery tables. This can be rerun in case of failure by [publishing a crawl-complete message](#publishing-a-pubsub-message), providing no data was saved to the final BigQuery tables.\n\nThe four [dataflow jobs](https://console.cloud.google.com/dataflow/jobs?project=httparchive) can also be [rerun](#run-the-pipeline) individually in case of failure, but the BigQuery tables need to be cleared down first (including any [lingering temp tables](https://github.com/HTTPArchive/data-pipeline/tree/update-readme-with-more-info#temp-table-cleanup))\n\nThe dataflow jobs can also be [run locally](#locally-using-the-run_sh-scripts), whereby the local code is uploaded to GCP for that particular run.\n\n## Diagrams\n\n### GCP Workflows pipeline execution\n\n```mermaid\nsequenceDiagram\n    participant PubSub\n    participant Workflows\n    participant Monitoring\n    participant Cloud Storage\n    participant Cloud Build\n    participant BigQuery\n    participant Dataflow\n\n    PubSub-\u003e\u003eWorkflows: crawl-complete event\n    loop until crawl queue is empty\n        Workflows-\u003e\u003eMonitoring: check crawl queue\n    end\n    rect rgb(191, 223, 255)\n        Note right of Workflows: generate HAR manifest\n        break when manifest already exists\n            Workflows-\u003e\u003eCloud Storage: check if HAR manifest exists\n        end\n        Workflows-\u003e\u003eCloud Build: trigger job\n        Cloud Build-\u003e\u003eCloud Build: list HAR files and generate manifest file\n        Cloud Build-\u003e\u003eCloud Storage: upload HAR manifest to GCS\n    end\n    rect rgb(191, 223, 255)\n        Note right of Workflows: check BigQuery and run Dataflow jobs\n        break when BigQuery records exist for table and date\n            Workflows-\u003e\u003eBigQuery: check all/combined tables for records in the given date\n        end\n        loop run jobs until retry limit is reached\n            Workflows-\u003e\u003eDataflow: run flex template\n            loop until job is complete\n                Workflows--\u003eDataflow: wait for job completion\n            end\n        end\n    end\n```\n\n### Development workflow\n\n```mermaid\nsequenceDiagram\n    autonumber\n    actor developer\n    participant Local as Local Environment / IDE\n    participant Dataflow\n    participant Cloud Build\n    participant Workflows\n\n    developer-\u003e\u003eLocal: create/update Dataflow code\n    developer-\u003e\u003eLocal: run Dataflow job with DirectRunner via run_*.py\n    developer-\u003e\u003eDataflow: run Dataflow job with DataflowRunner via run_pipeline_*.sh\n    developer-\u003e\u003eCloud Build: run build_flex_template.sh\n    developer-\u003e\u003eWorkflows: update flexTemplateBuildTag\n```\n\n### Manually running the pipeline\n\n```mermaid\nsequenceDiagram\n    actor developer\n    participant Local as Local Environment / IDE\n    participant Dataflow\n    participant PubSub\n    participant Workflows\n\n    alt run Dataflow job from local environment using the Dataflow runner\n        developer-\u003e\u003eLocal: clone repository and execute run_pipeline_*.sh\n    else run Dataflow job as a flex template\n        alt from local environment\n            developer-\u003e\u003eDataflow: clone repository and execute run_flex_template.sh\n        else from Google Cloud Console\n            developer-\u003e\u003eDataflow: use the Google Cloud Console to run a flex template as documented by GCP\n        end\n    else trigger a Google Workflows execution\n        alt\n            developer-\u003e\u003ePubSub: create a new message containing a HAR manifest path from GCS\n        else\n            developer-\u003e\u003eWorkflows: rerun a previously failed Workflows execution\n        end\n    end\n```\n\n## Run the pipeline\n\nDataflow jobs can be triggered several ways:\n- Locally using bash scripts (this can be used to test uncommited code, or code on a non-`main`` branch)\n- From the Google Cloud Console in Dataflow section by choosing to run a flex template (this can be used to run commited code for a particular dataflow pipeline only)\n- From the Google Cloud Console in Workflow section by choosing to execute a failed `data-pipeline` workflow again (this can be used to rerun failed parts of the workflow after reason for failure is fixed)\n- By publishing a Pub/Sub message to run the whole workflow (this kicks off the whole workflow and not just the pipeline so is good for the batch kicking off jobs when done, or to rerun the whole process manually when the manifest file was not generated)\n\n### Locally using the `run_*.sh` scripts\n\nThis method is best used when developing locally, as a convenience for running the pipeline's python scripts and GCP CLI commands.\n\n```shell\n# run the pipeline locally\n./run_pipeline_combined.sh\n./run_pipeline_all.sh\n\n# run the pipeline using a flex template\n./run_flex_template all [...]\n./run_flex_template combined [...]\n./run_flex_template tech_report [...]\n```\n\n### Running a flex template from the Cloud Console\n\nThis method is useful for running individual dataflow jobs from the web console since it does not require a development environment.\n\nFlex templates accept additional parameters as mentioned in the GCP documentation below, while custom parameters are defined in `flex_template_metadata_*.json`\n\nhttps://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#specify-options\n\nhttps://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#run-a-flex-template-pipeline\n\nSteps:\n1. Locate the desired build tag (e.g. see `flexTemplateBuildTag` in the [data-pipeline.workflows.yaml](data-pipeline.workflows.yaml))\n2. From the Google Cloud Console, navigate to the Dataflow \u003e Jobs page\n3. Click \"CREATE JOB FROM TEMPLATE\" at the top of the page.\n4. Provide a \"Job name\"\n5. Change region to `us-west1` (as that's where we have most compute capacity)\n6. Choose \"Custom Template\" from the bottom of the \"Dataflow template\" drop down.\n7. Browse to the template directory by pasting `httparchive/dataflow/templates/` into the \"Template path\", ignoring the error saying this is not a file, and then clicking Browse to choose the actual file from that directory.\n8. Choose the pipeline type (e.g. all or combined) for the chosen build tag (e.g. `data-pipeline-combined-2023-02-10_03-55-04.json` - choose the latest one for `all` or `combined`)\n9. Expand \"Optional Parameters\" and provide an input for the \"GCS input file\" pointing to the manifests file (e.g. `gs://httparchive/crawls_manifest/chrome-Jul_1_2023.txt` for Desktop Jul 2023 or `gs://httparchive/crawls_manifest/android-Jul_1_2023.txt` for Mobile for July 2023).\n10. (Optional) provide values for any additional parameters\n11. Click \"RUN JOB\"\n\n### Rerunning a failed workflow\n\nThis method is useful for running the entire workflow from the web console since it does not require a development environment. It is useful when the part of the workflow failed for known reasons that have since been resolved. Prevous steps should be skipped as the workflow checks if they have already been run.\n\nSteps:\n1. From the Google Cloud Console, navigate to the Workflow \u003e Workflows page\n2. Select the `data-pipeline` workflow\n3. In the Actions column click the three dots and select \"Execute again\"\n\n### Publishing a Pub/Sub message\n\nThis method is best used for serverlessly running the entire workflow, including logic to\n- block execution when the crawl is still running, by waiting for the crawl's Pub/Sub queue to drain\n- skip jobs where BigQuery tables have already been populated\n- automatically retry failed jobs\n\nPublishing a message containing the crawl's GCS path(s) will trigger a GCP workflow, including generating the HAR zip file for that run.\n\n``` shell\n# single path\ngcloud pubsub topics publish projects/httparchive/topics/crawl-complete --message \"gs://httparchive/crawls/android-Nov_1_2022\"\n\n# multiple paths must be comma separated, without spaces\ngcloud pubsub topics publish projects/httparchive/topics/crawl-complete --message \"gs://httparchive/crawls/chrome-Feb_1_2023,gs://httparchive/crawls/android-Feb_1_2023\"\n```\n\nNote that this can be run for an individual crawl (first example), or for both crawls (second example).\n\n### Pipeline types\n\nRunning the `combined` pipeline will produce summary and non-summary tables by default.\nSummary and non-summary outputs can be controlled using the `--pipeline_type` argument.\n\n```shell\n# example\n./run_pipeline_combined.sh --pipeline_type=summary\n\n./run_flex_template.sh combined --parameters pipeline_type=summary\n```\n\n## Inputs\n\nThis pipeline can read individual HAR files, or a single file containing a list of HAR file paths.\n\n```shell\n# Run the `all` pipeline on both desktop and mobile using their pre-generated manifests.\n./run_flex_template.sh all --parameters input_file=gs://httparchive/crawls_manifest/*-Nov_1_2022.txt\n\n# Run the `combined` pipeline on mobile using its manifest.\n./run_flex_template.sh combined --parameters input_file=gs://httparchive/crawls_manifest/android-Nov_1_2022.txt\n\n# Run the `combined` pipeline on desktop using its individual HAR files (much slower, not encouraged).\n./run_flex_template.sh combined --parameters input=gs://httparchive/crawls/chrome-Nov_1_2022\n```\n\nNote the `run_pipeline_combined.sh` and `run_pipeline_all.sh` scriprts uses the parameters in the scripts and these cannot be overridden with command line parameters. These are often useful for local testing of changes (local testing still results in the processing happening in GCP but using code copied from locally).\n\nTo save to different tables for testing, temporarily edit the `modules/constants.py` to prefix all the tables with `experimental_` (note the `experimental_parsed_css` is current production table so use `experimental_gc_parsed_css` instead for now).\n\n### Generating HAR manifest files\n\nThe pipeline can read a manifest file (text file containing GCS file paths separated by new lines for each HAR file). Follow the example to generate a manifest file:\n\n```shell\n# generate manifest files\nnohup gsutil ls gs://httparchive/crawls/chrome-Nov_1_2022 \u003e chrome-Nov_1_2022.txt 2\u003e chrome-Nov_1_2022.err \u0026\nnohup gsutil ls gs://httparchive/crawls/android-Nov_1_2022 \u003e android-Nov_1_2022.txt 2\u003e android-Nov_1_2022.err \u0026\n\n# watch for completion (i.e. file sizes will stop changing)\n#   if the err file increases in size, open and check for issues\nwatch ls -l ./*Nov*\n\n# upload to GCS\ngsutil -m cp ./*Nov*.txt gs://httparchive/crawls_manifest/\n```\n\n## Outputs\n\n- GCP DataFlow \u0026 Monitoring metrics - TODO: runtime metrics and dashboards\n- Dataflow temporary and staging artifacts in GCS\n- BigQuery (final landing zone)\n\n## Builds and Deployments\n\n[GitHub actions](.github/workflows/) are used to automate the build and deployment of Google Cloud Workflows and Dataflow Flex Templates. Actions are triggered on merges to the `main` branch, for specific files, and when other related GitHub actions have completed successfully.\n\n- [Deploy Dataflow Flex Template](.github/workflows/deploy-dataflow-flex-template.yml) will trigger when files related to the data pipeline are updated (e.g. python, Dockerfile, flex template metadata). This will build and upload the new builds (where they _can_ be used) and update the [data-pipeline workflows YAML](data-pipeline.workflows.yaml) with the latest build tag (based on datetime) and open a PR to merge that (so the new builds _will_ be used by the batch).\n- [Deploy Cloud Workflow](.github/workflows/deploy-cloud-workflow.yml) action will trigger when the [data-pipeline workflows YAML](data-pipeline.workflows.yaml) is updated, _or_ when the [Deploy Dataflow Flex Template](.github/workflows/deploy-dataflow-flex-template.yml) action has completed successfully.\n\nPRs with a title of `Bump dataflow flex template build tag` should be merged providing they are only updating the build datetime in the `flexTemplateBuildTag`. Check it has not zeroed the build datetime out (this can happen if the job errors in unusual ways).\n\n### Build inputs and artifacts\n\nGCP's documentation for creating and building Flex Templates are [linked here](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_and_build_a_container_image)\n\nThe following files are used for building and deploying Dataflow Flex Templates:\n- [.gcloudignore](.gcloudignore) excludes files from uploading to GCS for Cloud Build\n- [build_flex_template.sh](build_flex_template.sh) a helper script to initiate the Cloud Build\n- [cloudbuild.yaml](cloudbuild.yaml) is the configuration file for Cloud Build to create containers and template files in GCS (artifacts listed further below)\n- [Dockerfile](Dockerfile) used to create the job graph and start the Dataflow job\n- [flex_template_metadata_all.json](flex_template_metadata_all.json) and [flex_template_metadata_combined.json](flex_template_metadata_combined.json) define custom parameters to be validated when the template is run\n- [run_flex_template.sh](run_flex_template.sh) a helper script to run a Flex Template pipeline\n\n[Cloud Build](cloudbuild.yaml) is used to create Dataflow flex templates and upload them to Artifact Registry and Google Cloud Storage\n- Cloud Build [linked here](https://console.cloud.google.com/cloud-build/builds?project=httparchive)\n- Artifact Registry images [linked here](https://console.cloud.google.com/artifacts/docker/httparchive/us-west1/data-pipeline?project=httparchive)\n- Flex templates in GCS [gs://httparchive/dataflow/templates](https://console.cloud.google.com/storage/browser/httparchive/dataflow/templates?project=httparchive)\n\n### To build and deploy manually\n\nThe GitHub Actions can be triggered manually from the repository by following the documentation here for [Manually running a workflow](https://docs.github.com/en/actions/managing-workflow-runs/manually-running-a-workflow).\n\n```mermaid\nflowchart LR\n    Start((Start))\n    End((End))\n    A{Updating Dataflow?}\n    B[Run 'Deploy Dataflow Flex Template']\n    DDFTA[['Deploy Dataflow Flex Template' executes]]\n    C{Updating Cloud Workflows?}\n    D[Run 'Deploy Cloud Workflow']\n    DCWA[['Deploy Cloud Workflow' executes]]\n\n    Start --\u003e A\n    Start --\u003e C\n    A --\u003e B\n    B --\u003eDDFTA\n    DDFTA --\u003e|automatically triggers| DCWA\n    C --\u003e D\n    D --\u003e DCWA\n    DCWA --\u003e End\n```\n\nAlternatively, a combination of bash scripts and the Google Cloud Console can be used to manually deploy Cloud Workflows and Dataflow Flex Templates.\n\n```mermaid\nflowchart LR\n    Start((Start))\n    End((End))\n    A{Updating Dataflow?}\n    B[Run build_flex_template.sh]\n    C{Updating Cloud Workflows?}\n    D[Note the latest build tag from the script output]\n    E[Update the 'data-pipeline' workflow via the Cloud Console]\n\n    Start --\u003e A\n    Start --\u003e C\n    A --\u003e|Yes| B\n    A --\u003e|No| C\n    B --\u003e D\n    D --\u003e E\n    C --\u003e|Yes| E\n    E --\u003e End\n```\n\nThis can be started by makling changes locally and then running the `run_pipeline_all.sh` or `run_pipeline_combined.sh` scripts (after changing input paramters in those scripts). Local code is copied to GCP for each run so your shell needs to be authenticated to GCP and have permissions to run.\n\nTo save to different tables for testing, temporarily edit the `modules/constants.py` to prefix all the tables with `experimental_` (note the `experimental_parsed_css` is current production table so use `experimental_gc_parsed_css` instead for now).\n\n## Logs\n\n- Error logs can be seen in [Error reporting](https://console.cloud.google.com/errors;time=P30D?project=httparchive) GCP\n- Jobs can be seen in the [Dataflow -\u003e Jobs](https://console.cloud.google.com/dataflow/jobs?project=httparchive) screen of GCP.\n- Workflows can be seen in the [Workflows -\u003e Workflows](https://console.cloud.google.com/workflows?project=httparchive) screen of GCP.\n\n## Known issues\n\n### Data Pipeline\n\n#### Temp table cleanup\n\nSince this pipeline uses the `FILE_LOADS` BigQuery insert method, failures will leave behind temporary tables.\nUse the saved query below and replace the dataset name as desired.\n\nhttps://console.cloud.google.com/bigquery?sq=226352634162:82dad1cd1374428e8d6eaa961d286559\n\n```sql\nFOR field IN\n    (SELECT table_schema, table_name\n    FROM lighthouse.INFORMATION_SCHEMA.TABLES\n    WHERE table_name like 'beam_bq_job_LOAD_%')\nDO\n    EXECUTE IMMEDIATE format(\"drop table %s.%s;\", field.table_schema, field.table_name);\nEND FOR;\n```\n\n#### Streaming pipeline\n\nInitially this pipeline was developed to stream data into tables as individual HAR files became available in GCS from a live/running crawl. This allowed for results to be viewed faster, but came with additional burdens. For example:\n- Job failures and partial recovery/cleaning of tables.\n- Partial table population mid-crawl led to consumer confusion since they were previously accustomed to full tables being available.\n- Dataflow API for streaming inserts burried some low-level configuration leading to errors which were opaque and difficult to troubleshoot.\n\n### Dataflow\n\n#### Logging\n\n\u003e The work item requesting state read is no longer valid on the backend\n\nThis log message is benign and expected when using an auto-scaling pipeline\nhttps://cloud.google.com/dataflow/docs/guides/common-errors#work-item-not-valid\n\n### Response cache-control max-age\n\nVarious parsing issues due to unhandled cases\n\n### New file formats\n\nNew file formats from responses will be noted in WARNING logs\n\n### mimetypes and file extensions\n\nUsing ported custom logic from legacy PHP rather than standard libraries produces missing values and inconsistencies\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhttparchive%2Fdata-pipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhttparchive%2Fdata-pipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhttparchive%2Fdata-pipeline/lists"}