{"id":13528037,"url":"https://github.com/HTTPArchive/bigquery","last_synced_at":"2025-04-01T11:30:53.931Z","repository":{"id":8780051,"uuid":"10468186","full_name":"HTTPArchive/bigquery","owner":"HTTPArchive","description":"BigQuery import and processing pipelines","archived":false,"fork":false,"pushed_at":"2024-04-05T10:28:42.000Z","size":2193,"stargazers_count":65,"open_issues_count":13,"forks_count":19,"subscribers_count":14,"default_branch":"master","last_synced_at":"2024-04-14T00:31:56.816Z","etag":null,"topics":["bigquery"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/HTTPArchive.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null},"funding":{"github":null,"patreon":null,"open_collective":"httparchive","ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"otechie":null,"custom":null}},"created_at":"2013-06-04T00:58:31.000Z","updated_at":"2024-01-21T06:44:21.000Z","dependencies_parsed_at":"2024-04-13T16:00:23.732Z","dependency_job_id":null,"html_url":"https://github.com/HTTPArchive/bigquery","commit_stats":{"total_commits":343,"total_committers":8,"mean_commits":42.875,"dds":0.4868804664723032,"last_synced_commit":"4eb9354aa665f0bf237374ef5980dc97c6a14c33"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HTTPArchive%2Fbigquery","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HTTPArchive%2Fbigquery/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HTTPArchive%2Fbigquery/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HTTPArchive%2Fbigquery/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/HTTPArchive","download_url":"https://codeload.github.com/HTTPArchive/bigquery/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246631644,"owners_count":20808723,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigquery"],"created_at":"2024-08-01T06:02:10.896Z","updated_at":"2025-04-01T11:30:53.295Z","avatar_url":"https://github.com/HTTPArchive.png","language":"Jupyter Notebook","funding_links":["https://opencollective.com/httparchive"],"categories":["Jupyter Notebook"],"sub_categories":[],"readme":"# HTTP Archive + BigQuery data import\n\n_Note: you don't need to import this data yourself, the BigQuery dataset is public! [Getting started](https://github.com/HTTPArchive/httparchive.org/blob/master/docs/gettingstarted_bigquery.md)._\n\nHowever, if you do want your own private copy of the dataset... The following import and sync scripts will help you import the [HTTP Archive dataset](http://httparchive.org/downloads.php) into BigQuery and keep it up to date.\n\n```bash\n$\u003e sh sync.sh Jun_15_2013\n$\u003e sh sync.sh mobile_Jun_15_2013\n```\n\nThat's all there is to it. The sync script handles all the necessary processing:\n\n* Archives are fetched from archive.org (and cached locally)\n* Archived CSV is transformed to BigQuery compatible escaping\n  * You will need +pigz+ installed for parallel compression\n* Request files are split into \u003c1GB compressed CSV's\n* Resulting pages and request data is synced to a Google Storage bucket\n* BigQuery import is kicked off for each of compressed archives on Google Storage\n\nAfter the upload is complete, a copy of the latest tables can be made with:\n\n```bash\n$\u003e bq.py cp runs.2013_06_15_pages runs.latest_pages\n$\u003e bq.py cp runs.2013_06_15_pages_mobile runs.latest_pages_mobile\n$\u003e bq.py cp runs.2013_06_15_requests runs.latest_requests\n$\u003e bq.py cp runs.2013_06_15_requests_mobile runs.latest_requests_mobile\n```\n\n(MIT License) - Copyright (c) 2013 Ilya Grigorik\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FHTTPArchive%2Fbigquery","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FHTTPArchive%2Fbigquery","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FHTTPArchive%2Fbigquery/lists"}