Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/mchmarny/preprocessd

Simple example showing how to use Cloud Run to pre-process raw events from PubSub and publish them to new topic.
https://github.com/mchmarny/preprocessd

cloudrun events gcp go processing pubsub

Last synced: 18 days ago
JSON representation

Simple example showing how to use Cloud Run to pre-process raw events from PubSub and publish them to new topic.

Lists

README

        

# preprocessd

Simple example showing how to use Cloud Run to pre-process events before persisting them to the backing store (e.g. BigQuery). This is a common use-case where the raw data (e.g. submitted through REST API) needs to be pre-processed (e.g. decorated with additional attributed, classified, or simply validated) before saving.

Cloud Run is a great platform to build these kind of ingestion or pre-processing services:

* Write each one of the pre-processing steps in the most appropriate (or favorite) development language
* Bring your own runtime (or even specific version of that runtime) along with custom libraries
* Dynamically scale up and down with your PubSub event load
* Scale to 0, and don't pay anything, when there is nothing to process
* Use granular access control with service account and policy bindings

## Event Source

In this example will will use the synthetic events on PubSub topic generated by [pubsub-event-maker](https://github.com/mchmarny/pubsub-event-maker) utility. We will use it to mock synthetic `utilization` data from `3` devices and publish them to Cloud PubSub on the `eventmaker` topic in your project. The PubSub payload looks something like this:

```json
{
"source_id": "device-1",
"event_id": "eid-b6569857-232c-4e6f-bd51-cda4e81f3e1f",
"event_ts": "2019-06-05T11:39:50.403778Z",
"label": "utilization",
"mem_used": 34.47265625,
"cpu_used": 6.5,
"load_1": 1.55,
"load_5": 2.25,
"load_15": 2.49,
"random_metric": 94.05090880450125
}
```

The instructions on how to configure `pubsub-event-maker` to start sending these events are [here](https://github.com/mchmarny/pubsub-event-maker).

## Pre-requirements

### GCP Project and gcloud SDK

If you don't have one already, start by creating new project and configuring [Google Cloud SDK](https://cloud.google.com/sdk/docs/). Similarly, if you have not done so already, you will have [set up Cloud Run](https://cloud.google.com/run/docs/setup).

## Setup

### Build Container Image

Cloud Run runs container images. To build one we are going to use the included [Dockerfile](./Dockerfile) and submit the build job to Cloud Build using [bin/image](./bin/image) script.

> Note, you should review each one of the provided scripts for complete content of these commands

```shell
bin/image
```

> If this is first time you use the build service you may be prompted to enable the build API

### Service Account and IAM Policies

In this example we are going to follow the [principle of least privilege](https://searchsecurity.techtarget.com/definition/principle-of-least-privilege-POLP) (POLP) to ensure our Cloud Run service has only the necessary rights and nothing more:

* `run.invoker` - required to execute Cloud Run service
* `pubsub.editor` - required to create and publish to Cloud PubSub
* `logging.logWriter` - required for Stackdriver logging
* `cloudtrace.agent` - required for Stackdriver tracing
* `monitoring.metricWriter` - required to write custom metrics to Stackdriver

To do that we will create a GCP service account and assign the necessary IAM policies and roles using [bin/account](./bin/account) script:

```shell
bin/account
```

### Cloud Run Service

Once you have configured the GCP accounts, you can deploy a new Cloud Run service and set it to run under that account using and preventing unauthenticated access [bin/service](./bin/service) script:

```shell
bin/service
```

## PubSub Subscription

To enable PubSub to send topic data to Cloud Run service we will need to create a PubSub topic subscription and configure it to "push" events to the Cloud Service we deployed above.

```shell
bin/pubsub
```

## Log

You can see the raw data and all the application log entries made by the service in Cloud Run service logs.

Cloud Run Log

## Saving Results

The process of saving resulting data from this service will depend on your target (the place where you want to save the data). HCP has a number of existing connectors and templates so, in most cases, you do not have to even write any code. Here is an example of a Dataflow template that streams PubSub topic data to BigQuery:

```shell
gcloud dataflow jobs run JOB_NAME \
--gcs-location gs://dataflow-templates/latest/PubSub_to_BigQuery \
--parameters \
inputTopic=projects/YOUR_PROJECT_ID/topics/YOUR_TOPIC_NAME,\
outputTableSpec=YOUR_PROJECT_ID:YOUR_DATASET.YOUR_TABLE_NAME
```

This approach will automatically deal with back-pressure, retries, monitoring and is not subject to the batch insert quote limits.

## Cleanup

To cleanup all resources created by this sample execute the [bin/cleanup](bin/cleanup) script.

```shell
bin/cleanup
```

## Disclaimer

This is my personal project and it does not represent my employer. I take no responsibility for issues caused by this code. I do my best to ensure that everything works, but if something goes wrong, my apologies is all you will get.