Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jehiah/socrata_to_bigquery
A tool to copy public data to BigQuery
https://github.com/jehiah/socrata_to_bigquery
bigquery opendata socrata
Last synced: 2 months ago
JSON representation
A tool to copy public data to BigQuery
- Host: GitHub
- URL: https://github.com/jehiah/socrata_to_bigquery
- Owner: jehiah
- License: mit
- Created: 2019-10-12T21:03:00.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2024-09-20T20:47:06.000Z (3 months ago)
- Last Synced: 2024-10-15T14:41:07.427Z (2 months ago)
- Topics: bigquery, opendata, socrata
- Language: Go
- Size: 108 KB
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# socrata_to_bigquery
This tool facilitates replicating Open Data from the [Socrata Platform](https://socrata.com/) to [Google BigQuery](https://cloud.google.com/bigquery/)
WARNING: This is Alpha Release Software. It might be useful, but it will be rough around the edgesMany Governemnt Open-Data projects are hosted on Socrata, and searchable through the [Open Data Network](https://www.opendatanetwork.com/)
* https://opendata.cityofnewyork.us/
* https://data.ny.gov/
* etc...## Installing
```bash
go get github.com/jehiah/socrata_to_bigquery/...
```## Quick Start
1. `socrata_to_bigquery init`
2. `socrata_to_bigquery download`
3. `socrata_to_bigquery sync`
## Documentation
### `init`
`init` initializes a yaml config file for synchronizing a Socrata dataset to BigQuery.
Usage: `init -api-endpoint=https://path/to/api [-project-id -bq-dataset]`
i.e. `socrata_to_bigquery init -api-endpoint=https://data.cityofnewyork.us/resource/nc67-uf89 -data-dir=/path/to/data`
API endpoint is the published Socrata API endpoint for a dataset.
This config file defines all fields that will be loaded to BigQuery, and the target bigquery project and dataset. Optionally it can defines custom conversion from TEXT socrata field to richer DATE or TIME field types. It also defines the target bigquery field names.
For example, this `issue_date` is a `"text"` format in Socrata but it will be parsed using the Go format string `"01/02/2006"` and stored in a `DATE` column. `on_error = "SKIP_ROW"` indicates that any rows that do not meet this date format will be skipped.
```
[schema.issue_date]
bigquery_type = "DATE"
description = "Issue Date"
# example_values = "\"03/06/2017\", \"10/07/2017\", \"05/29/2016\""# SKIP_VALUE | SKIP_ROW | ERROR
on_error = "SKIP_ROW"
required = true
source_field = "issue_date"
source_field_type = "text"# the time.Parse format string
time_format = "01/02/2006"
``````
Usage of socrata_to_bigquery init:
-api-endpoint string
The URL to the socrata dataset
-bq-dataset string
BigQuery Dataset
-data-dir string
directory to create config file in
-debug
show debug output
-filename string
defaults to ${NAME}-${ID}.toml
-project-id string
Google Cloud Project ID
-socrata-app-token string
Socrata App Token (also src SOCRATA_APP_TOKEN env)
```### `download`
Download does an initial copy from Socrata to Bigquery
Usage: `socrata_to_bigquery download /path/to/config.yaml`
i.e. `socrata_to_bigquery download open-parking-and-camera-violations-nc67-uf89.toml`
### `sync`
Sync does a periodic copy of new records from Socrata to BigQuery copying only new records since the most recent record in BigQuery.
Usage: `socrata_to_bigquery sync /path/to/config.yaml`
i.e. `socrata_to_bigquery sync open-parking-and-camera-violations-nc67-uf89.toml`
## Setup
Socrata API Token
https://dev.socrata.com/docs/authentication.html
```bash
export SOCRATA_APP_TOKEN=...
```