Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/HTTPArchive/bigquery
BigQuery import and processing pipelines
https://github.com/HTTPArchive/bigquery
bigquery
Last synced: 10 days ago
JSON representation
BigQuery import and processing pipelines
- Host: GitHub
- URL: https://github.com/HTTPArchive/bigquery
- Owner: HTTPArchive
- Created: 2013-06-04T00:58:31.000Z (over 11 years ago)
- Default Branch: master
- Last Pushed: 2024-03-07T05:39:10.000Z (8 months ago)
- Last Synced: 2024-03-23T01:14:32.298Z (8 months ago)
- Topics: bigquery
- Language: Jupyter Notebook
- Homepage:
- Size: 2.07 MB
- Stars: 65
- Watchers: 14
- Forks: 19
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
Awesome Lists containing this project
README
# HTTP Archive + BigQuery data import
_Note: you don't need to import this data yourself, the BigQuery dataset is public! [Getting started](https://github.com/HTTPArchive/httparchive.org/blob/master/docs/gettingstarted_bigquery.md)._
However, if you do want your own private copy of the dataset... The following import and sync scripts will help you import the [HTTP Archive dataset](http://httparchive.org/downloads.php) into BigQuery and keep it up to date.
```bash
$> sh sync.sh Jun_15_2013
$> sh sync.sh mobile_Jun_15_2013
```That's all there is to it. The sync script handles all the necessary processing:
* Archives are fetched from archive.org (and cached locally)
* Archived CSV is transformed to BigQuery compatible escaping
* You will need +pigz+ installed for parallel compression
* Request files are split into <1GB compressed CSV's
* Resulting pages and request data is synced to a Google Storage bucket
* BigQuery import is kicked off for each of compressed archives on Google StorageAfter the upload is complete, a copy of the latest tables can be made with:
```bash
$> bq.py cp runs.2013_06_15_pages runs.latest_pages
$> bq.py cp runs.2013_06_15_pages_mobile runs.latest_pages_mobile
$> bq.py cp runs.2013_06_15_requests runs.latest_requests
$> bq.py cp runs.2013_06_15_requests_mobile runs.latest_requests_mobile
```(MIT License) - Copyright (c) 2013 Ilya Grigorik