https://github.com/tinybirdco/data-rise-gcp

Repo for the GCP-Tinybird workshop
https://github.com/tinybirdco/data-rise-gcp

Last synced: 12 months ago
JSON representation

Repo for the GCP-Tinybird workshop

Host: GitHub
URL: https://github.com/tinybirdco/data-rise-gcp
Owner: tinybirdco
Created: 2023-06-21T10:28:19.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2023-06-26T21:55:41.000Z (almost 3 years ago)
Last Synced: 2025-02-27T05:24:49.934Z (over 1 year ago)
Language: Python
Size: 19.5 KB
Stars: 0
Watchers: 4
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# data-rise-gcp

Repo for the GCP-Tinybird workshop

## Part 1: Creating the first API Endpoint

For this first part of the workshop the plan is to ingest dimensional data from BigQuery, historical data from Google Cloud Storage, and realtime data from Pub/Sub.

Then, mixing the 3 sources, we will create a dynamic API Endpoint.

### Create your account and your first Workspace

Go to [https://ui.tinybird.co/signup](https://ui.tinybird.co/signup) to log in / sign up, and create a new Workspace.

Choose your region and go for empty workspace by default, we will not be using any Starter Kits for this workshop.

### Ingest dimensional data from BigQuery

Upload the [products CSV](./aux/products.csv) file to BigQuery.

In your Tinybird Workspace, create a new BigQuery Data Source from that BigQuery table following the [Big Query connector documentation](https://www.tinybird.co/docs/ingest/bigquery).

### Ingest from GCS

Let's do this step first cause the backfill may take a bit longer.

First, let's create a Data Source from this small [parquet file](./aux/ecom_events.parquet) we have as sample. Just drag and drop the file into the Tinybird UI. You can adjust data types, Sortng Key...

Now that the data source is created, we will ingest some bigger parquets from GCS.

Copy your admin token —or a new one with append rights to your newly created _ecom_events_ data source—, edit the [backfill_gcs.sh](./aux/backfill_gcs.sh) script, and run it.

Note: for private files, or to ingest every time there are new files in the bucket, you can follow the [Ingest from GCS guide](https://www.tinybird.co/docs/guides/ingest-from-google-gcs).

### Send data from Pub/Sub

Follow the steps in the [Ingest from Pub/Sub guide](https://www.tinybird.co/docs/guides/ingest-from-google-pubsub).

Note: do not use the sample script, use [this one](./aux/pub_sub_demo.py) instead, editing lines 8,9 with your project id and topic.

```python
project_id =
topic_id =
```

Note 2: do not create a Materialized View to decode the messages yet, we will do that at query time.

### Create an API Endpoint

Let's create a Pipe with several nodes:

1. A first node to decode the messages from Pub/Sub. You'll need to use `base64()` and `JSONExtract` as shown in the [example](https://www.tinybird.co/docs/guides/ingest-from-google-pubsub.html#step-4-decode-message-data).
1. A second node to filter only the _sale_ events and for _long sleeve_ category products querying the previous node where we decoded the Pub/Sub messages.
1. A third node to apply the same filter to the historical data, and only the sales for today.
1. A fourth node to make a `union all` of nodes 2 and 3, and make an aggregation —a `count()` is fine— to know the number of sales
1. Let's enrich the ranking to show product _name_ instead of _id_ and _total_revenue_ (price * units sold)

And let's create an API Endpoint from there.

### Make it dynamic

Make the endpoint accept query params with the templating language. Check the syntax [here](https://www.tinybird.co/docs/query-parameters)

For example, let's make the `category` and `event` types dynamic, and let's document them for our frontend colleagues to know what things they can pass.

## Part 2: Some optimizations with Materialized Views

### Create a Materialized View to decode Pub/Sub messages

With [Materialized Views](https://www.tinybird.co/docs/concepts/materialized-views) we can use a Pipe and persist them in a Datasource.
Choose Sorting Key and Data Types wisely. Recommended reads after the workshop: [Best Practices for faster SQL](https://www.tinybird.co/docs/guides/best-practices-for-faster-sql) and [Thinking in Tinybird](https://www.tinybird.co/blog-posts/thinking-in-tinybird).

Compare processed data —using [Service Data Sources](https://www.tinybird.co/docs/monitoring/service-datasources) like `tinybird.pipe_stats_rt`— to see the difference between querying the MV and having to decode at query time.

## Create a MV to aggregate by time (hour, day…)

AggregatingMergeTrees 101. Check [this guide](https://www.tinybird.co/docs/guides/master-materialized-views.html#doing-aggregations-the-right-way-with-materialized-views) to learn about State and Merge modifiers.

Note that if you create the MV from the UI, Tinybird will add the `State` modifier for you, but you will still need to use `Merge` and group by at query time.

Create a MV that aggregates the sales, views, or carts per product and hour/day —tip: `toStartofHour()` and `toDate()` are your allies here—.

Compare the same queries from raw data and from Aggregated MV.

## Part 3: Data as Code with data projects and CLI

### Download the CLI and check the Data project

You have already seen in the docs some resources —Data Sources and Pipes— in text format, let's download the [CLI](https://www.tinybird.co/docs/cli) and start working with it.

```bash
tb auth

tb init

tb workspace current

tb pull --auto
```

Edit a Pipe that ends in an endpoint and send it back to the Workspace with `tb push`.

### Push some resources to feed a dashboard

Go to the branch called _chart-branch_ and copy its content —/pipes, /datasources, and /endpoints— into your data project.

```bash
git checkout chart-branch
cp -r ./data-project/pipes ./pipes
cp -r ./data-project/datasources ./datasources
cp -r ./data-project/endpoints ./endpoints
```

Push the resources.

```bash
tb push pipes/events_by_*.pipe --push-deps --populate
tb push endpoints/api_*.pipe
```

Get your _dashboard_ token, go to this [webpage](https://ecommerce-svelte-tremor-dashboard.vercel.app/), and paste it in the Token input. You can start playing with the filters, hours...

Note we are assuming that the GCS Data Source is called prods, and some types may mismatch. To see the demo fully working you can check this [repo](https://github.com/tinybirdco/ecommerce-svelte).

## Extra: what we left outside the workshop

- [Apigee](https://www.tinybird.co/docs/publish/api-gateways.html#google-cloud-apigee)
- [Kafka connector](https://www.tinybird.co/docs/ingest/kafka)
- [Snowflake connector](https://www.tinybird.co/docs/ingest/snowflake), very similar to BQ.
- [Tokens](https://www.tinybird.co/docs/concepts/auth-tokens)
- [Multitenancy](https://www.tinybird.co/blog-posts/multi-tenant-saas-options), [sharing data sources between workspaces](https://www.tinybird.co/blog-posts/new-feature-sharing-data-sources-across-workspaces).
- [Copy Pipes](https://www.tinybird.co/docs/publish/copy-pipes.html)
- [Time Series](https://www.tinybird.co/blog-posts/announcing-time-series)
- Visualizing in [Grafana](https://www.tinybird.co/docs/guides/consume-api-endpoints-in-grafana) or sending data to [Datadog](https://www.tinybird.co/blog-posts/how-to-monitor-tinybird-using-datadog-with-vector-dev) using vector.dev

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tinybirdco/data-rise-gcp

Awesome Lists containing this project

README