Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/pacuna/snowplow-pipeline

End-to-end Snowplow Analytics Pipeline for real time events
https://github.com/pacuna/snowplow-pipeline

analytics big-data bigquery docker docker-compose kafka kubernetes production real-time snowplow snowplowanalytics streaming

Last synced: 3 months ago
JSON representation

End-to-end Snowplow Analytics Pipeline for real time events

Host: GitHub
URL: https://github.com/pacuna/snowplow-pipeline
Owner: pacuna
Created: 2018-09-01T14:59:16.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2023-07-18T04:42:08.000Z (over 1 year ago)
Last Synced: 2024-11-01T12:42:19.559Z (3 months ago)
Topics: analytics, big-data, bigquery, docker, docker-compose, kafka, kubernetes, production, real-time, snowplow, snowplowanalytics, streaming
Language: Python
Homepage:
Size: 24.4 KB
Stars: 28
Watchers: 2
Forks: 16
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Snowplow Analytics Pipeline

This is an example of an end-to-end Snowplow pipeline to track events using a Kafka broker.

The pipeline works in the following way:

1. A request is sent to the Scala Collector
2. The raw event (thift event) is put into the `snowplow_raw_good` (or bad) topic
3. The enricher grabs the raw events, parses them and put them into the `snowplow_parsed_good` (or bad) topic
4. A custom events processor grabs the parsed event, which is in a tab-delimited/json hybrid format and turns it into a proper
Json event using the python analytics SDK from Snowplow. This event is then put into a final topic called `snowplow_json_event`.
5. (WIP) A custom script grabs the final Json events and loads them into some storage solution (such as BigQuery or Redshift)

## Run the pipeline

Execute:

```sh
docker-compose up -d
```
After that, make sure all the containers are up with `docker-compose ps`. If not, try to run the command again.

This command will create all the components and also a simple web application that sends pageviews and other events to the collector.
Please checkout the `docker-compose.yml` file for more details.

## Run in production (Kubernetes)

The configuration files are using endpoints for Kafka provided by [kubernetes-kafka](https://github.com/Yolean/kubernetes-kafka). You can configure your own brokers in the collector and enrich configuration file (configmaps).

Assuming you have configured Kafka for all the components:

1. Deploy the collector: `kubectl apply -f ./k8s/collector`. This will create the configuration, deployment and a service that uses a Load Balancer to access the collector's endpoint.
2. Deploy the enricher: `kubectl apply -f ./k8s/stream-enrich`. This will create the configurations and deployment.
3. Build a Docker image for the events processor, upload it to some registry and add it to `k8s/events-processor/deploy.yml`. Checkout the files in `k8s/events-processor` before building the image and change the broker configuration in `app.py`.
4. Deploy the events processor application using the previous `k8s/events-processor/deploy.yml` file.
5. Run the webapp example locally and change the collector's address to the load balancer IP address created for the collector. The collector is using the port 80 so remove the port and just leave the IP address.
6. Once everything is running, open the webapp and refresh the page a couple of times. Checkout the logs of the events processor pod to see if the Json events are created correctly.