https://github.com/HTTPArchive/bigquery

BigQuery import and processing pipelines
https://github.com/HTTPArchive/bigquery

bigquery

Last synced: 4 months ago
JSON representation

BigQuery import and processing pipelines

Host: GitHub
URL: https://github.com/HTTPArchive/bigquery
Owner: HTTPArchive
Created: 2013-06-04T00:58:31.000Z (about 12 years ago)
Default Branch: master
Last Pushed: 2024-04-05T10:28:42.000Z (over 1 year ago)
Last Synced: 2024-04-14T00:31:56.816Z (over 1 year ago)
Topics: bigquery
Language: Jupyter Notebook
Homepage:
Size: 2.09 MB
Stars: 65
Watchers: 14
Forks: 19
Open Issues: 13
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml

Awesome Lists containing this project

README

        # HTTP Archive + BigQuery data import

_Note: you don't need to import this data yourself, the BigQuery dataset is public! [Getting started](https://github.com/HTTPArchive/httparchive.org/blob/master/docs/gettingstarted_bigquery.md)._

However, if you do want your own private copy of the dataset... The following import and sync scripts will help you import the [HTTP Archive dataset](http://httparchive.org/downloads.php) into BigQuery and keep it up to date.

```bash

$> sh sync.sh Jun_15_2013

$> sh sync.sh mobile_Jun_15_2013

```

That's all there is to it. The sync script handles all the necessary processing:

* Archives are fetched from archive.org (and cached locally)

* Archived CSV is transformed to BigQuery compatible escaping

  * You will need +pigz+ installed for parallel compression

* Request files are split into <1GB compressed CSV's

* Resulting pages and request data is synced to a Google Storage bucket

* BigQuery import is kicked off for each of compressed archives on Google Storage

After the upload is complete, a copy of the latest tables can be made with:

```bash

$> bq.py cp runs.2013_06_15_pages runs.latest_pages

$> bq.py cp runs.2013_06_15_pages_mobile runs.latest_pages_mobile

$> bq.py cp runs.2013_06_15_requests runs.latest_requests

$> bq.py cp runs.2013_06_15_requests_mobile runs.latest_requests_mobile

```

(MIT License) - Copyright (c) 2013 Ilya Grigorik

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/HTTPArchive/bigquery

Awesome Lists containing this project

README