https://github.com/tylertreat/bigquery-python

Simple Python client for interacting with Google BigQuery.
https://github.com/tylertreat/bigquery-python
bigquery google-bigquery python
Last synced: about 2 months ago
JSON representation
Simple Python client for interacting with Google BigQuery.
Host: GitHub
URL: https://github.com/tylertreat/bigquery-python
Owner: tylertreat
License: apache-2.0
Created: 2014-03-05T22:26:49.000Z (over 11 years ago)
Default Branch: master
Last Pushed: 2021-11-24T20:45:46.000Z (over 3 years ago)
Last Synced: 2025-05-10T10:48:05.550Z (about 2 months ago)
Topics: bigquery, google-bigquery, python
Language: Python
Size: 1.13 MB
Stars: 458
Watchers: 33
Forks: 176
Open Issues: 23
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        BigQuery-Python

===============



Simple Python client for interacting with Google BigQuery.

This client provides an API for retrieving and inserting BigQuery data by wrapping Google's low-level API client library. It also provides facilities that make it convenient to access data that is tied to an App Engine appspot, such as request logs.

[Documentation](http://tylertreat.github.io/BigQuery-Python/)

# Installation

`pip install bigquery-python`

# Basic Usage

```python

from bigquery import get_client

# BigQuery project id as listed in the Google Developers Console.

project_id = 'project_id'

# Service account email address as listed in the Google Developers Console.

service_account = '[email protected]'

# PKCS12 or PEM key provided by Google.

key = 'key.pem'

client = get_client(project_id, service_account=service_account,

                    private_key_file=key, readonly=True)

# JSON key provided by Google

json_key = 'key.json'

 

client = get_client(json_key_file=json_key, readonly=True)

# Submit an async query.

job_id, _results = client.query('SELECT * FROM dataset.my_table LIMIT 1000')

# Check if the query has finished running.

complete, row_count = client.check_job(job_id)

# Retrieve the results.

results = client.get_query_rows(job_id)

```

# Executing Queries

The BigQuery client allows you to execute raw queries against a dataset. The `query` method inserts a query job into BigQuery. By default, `query` method runs asynchronously with `0` for `timeout`. When a non-zero timeout value is specified, the job will wait for the results, and throws an exception on timeout.

When you run an async query, you can use the returned `job_id` to poll for job status later with `check_job`.

```python

# Submit an async query

job_id, _results = client.query('SELECT * FROM dataset.my_table LIMIT 1000')

# Do other stuffs

# Poll for query completion.

complete, row_count = client.check_job(job_id)

# Retrieve the results.

if complete:

    results = client.get_query_rows(job_id)

```

You can also specify a non-zero timeout value if you want your query to be synchronous.

```python

# Submit a synchronous query

try:

    _job_id, results = client.query('SELECT * FROM dataset.my_table LIMIT 1000', timeout=10)

except BigQueryTimeoutException:

    print "Timeout"

```

## Query Builder

The `query_builder` module provides an API for generating query strings that can be run using the BigQuery client.

```python

from bigquery.query_builder import render_query

selects = {

    'start_time': {

        'alias': 'Timestamp',

        'format': 'INTEGER-FORMAT_UTC_USEC'

    }

}

conditions = [

    {

        'field': 'Timestamp',

        'type': 'INTEGER',

        'comparators': [

            {

                'condition': '>=',

                'negate': False,

                'value': 1399478981

            }

        ]

    }

]

grouping = ['Timestamp']

having = [

    {

        'field': 'Timestamp',

        'type': 'INTEGER',

        'comparators': [

            {

                'condition': '==',

                'negate': False,

                'value': 1399478981

            }

        ]

    }

]

order_by ={'fields': ['Timestamp'], 'direction': 'desc'}

query = render_query(

    'dataset',

    ['table'],

    select=selects,

    conditions=conditions,

    groupings=grouping,

    having=having,

    order_by=order_by,

    limit=47

)

job_id, _ = client.query(query)

```

# Managing Tables

The BigQuery client provides facilities to manage dataset tables, including creating, deleting, checking the existence, and getting the metadata of tables.

```python

# Create a new table.

schema = [

    {'name': 'foo', 'type': 'STRING', 'mode': 'nullable'},

    {'name': 'bar', 'type': 'FLOAT', 'mode': 'nullable'}

]

created = client.create_table('dataset', 'my_table', schema)

# Delete an existing table.

deleted = client.delete_table('dataset', 'my_table')

# Check if a table exists.

exists = client.check_table('dataset', 'my_table')

# Get a table's full metadata. Includes numRows, numBytes, etc. 

# See: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables

metadata = client.get_table('dataset', 'my_table')

```

There is also functionality for retrieving tables that are associated with a Google App Engine appspot, assuming table names are in the form of appid_YYYY_MM or YYYY_MM_appid. This allows tables between a date range to be selected and queried on.

```python

# Get appspot tables falling within a start and end time.

from datetime import datetime, timedelta

range_end = datetime.utcnow()

range_start = range_end - timedelta(weeks=12)

tables = client.get_tables('dataset', 'appid', range_start, range_end)

```

# Inserting Data

The client provides an API for inserting data into a BigQuery table. The last parameter refers to an optional insert id key used to avoid duplicate entries.

```python

# Insert data into table.

rows =  [

    {'one': 'ein', 'two': 'zwei'},

    {'id': 'NzAzYmRiY', 'one': 'uno', 'two': 'dos'},

    {'id': 'NzAzYmRiY', 'one': 'ein', 'two': 'zwei'} # duplicate entry

]

inserted = client.push_rows('dataset', 'table', rows, 'id')

```

# Write Query Results to Table

You can write query results directly to table. When either dataset or table parameter is omitted, query result will be written to temporary table.

```python

# write to permanent table

job = client.write_to_table('SELECT * FROM dataset.original_table LIMIT 100',

                            'dataset',

                            'table')

try:

    job_resource = client.wait_for_job(job, timeout=60)

    print job_resource

except BigQueryTimeoutException:

    print "Timeout"

# write to permanent table with UDF in query string

external_udf_uris = ["gs://bigquery-sandbox-udf/url_decode.js"]

query = """SELECT requests, title

            FROM

              urlDecode(

                SELECT

                  title, sum(requests) AS num_requests

                FROM

                  [fh-bigquery:wikipedia.pagecounts_201504]

                WHERE language = 'fr'

                GROUP EACH BY title

              )

            WHERE title LIKE '%ç%'

            ORDER BY requests DESC

            LIMIT 100

        """

job = client.write_to_table(

  query,

  'dataset',

  'table',

  external_udf_uris=external_udf_uris

)

try:

    job_resource = client.wait_for_job(job, timeout=60)

    print job_resource

except BigQueryTimeoutException:

    print "Timeout"

# write to temporary table

job = client.write_to_table('SELECT * FROM dataset.original_table LIMIT 100')

try:

    job_resource = client.wait_for_job(job, timeout=60)

    print job_resource

except BigQueryTimeoutException:

    print "Timeout"

```

# Import data from Google cloud storage

```python

schema = [ {"name": "username", "type": "string", "mode": "nullable"} ]

job = client.import_data_from_uris( ['gs://mybucket/mydata.json'],

                                    'dataset',

                                    'table',

                                    schema,

                                    source_format=JOB_SOURCE_FORMAT_JSON)

try:

    job_resource = client.wait_for_job(job, timeout=60)

    print job_resource

except BigQueryTimeoutException:

    print "Timeout"

```

# Export data to Google cloud storage

```python

job = client.export_data_to_uris( ['gs://mybucket/mydata.json'],

                                   'dataset',

                                   'table')

try:

    job_resource = client.wait_for_job(job, timeout=60)

    print job_resource

except BigQueryTimeoutException:

    print "Timeout"

```

# Managing Datasets

The client provides an API for listing, creating, deleting, updating and patching datasets.

```python

# List datasets

datasets = client.get_datasets()

# Create dataset

dataset = client.create_dataset('mydataset', friendly_name="My Dataset", description="A dataset created by me")

# Get dataset

client.get_dataset('mydataset')

# Delete dataset

client.delete_dataset('mydataset')

client.delete_dataset('mydataset', delete_contents=True) # delete even if it contains data

# Update dataset

client.update_dataset('mydataset', friendly_name="mon Dataset") # description is deleted

# Patch dataset

client.patch_dataset('mydataset', friendly_name="mon Dataset") # friendly_name changed; description is preserved

# Check if dataset exists.

exists = client.check_dataset('mydataset')

```

# Creating a schema from a sample record

```python

from bigquery import schema_from_record

schema_from_record({"id":123, "posts": [{"id":123, "text": "this is a post"}], "username": "bob"})

```

# Contributing

Requirements to commit here:

  - Branch off master, PR back to master.

  - Your code should pass [Flake8](http://flake8.readthedocs.org/en/latest/).

  - Unit test coverage is required.

  - Good docstrs are required.

  - Good [commit messages](http://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html) are required.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tylertreat/bigquery-python

Awesome Lists containing this project

README