{"id":17429832,"url":"https://github.com/pacuna/snowplow-pipeline","last_synced_at":"2025-04-15T22:30:44.620Z","repository":{"id":73891813,"uuid":"147007692","full_name":"pacuna/snowplow-pipeline","owner":"pacuna","description":"End-to-end Snowplow Analytics Pipeline for real time events","archived":false,"fork":false,"pushed_at":"2023-07-18T04:42:08.000Z","size":25,"stargazers_count":29,"open_issues_count":1,"forks_count":16,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-29T03:22:36.718Z","etag":null,"topics":["analytics","big-data","bigquery","docker","docker-compose","kafka","kubernetes","production","real-time","snowplow","snowplowanalytics","streaming"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pacuna.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-09-01T14:59:16.000Z","updated_at":"2025-03-11T22:11:54.000Z","dependencies_parsed_at":null,"dependency_job_id":"2685ec57-b807-4587-871a-14df215e3c40","html_url":"https://github.com/pacuna/snowplow-pipeline","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pacuna%2Fsnowplow-pipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pacuna%2Fsnowplow-pipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pacuna%2Fsnowplow-pipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pacuna%2Fsnowplow-pipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pacuna","download_url":"https://codeload.github.com/pacuna/snowplow-pipeline/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249165874,"owners_count":21223341,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analytics","big-data","bigquery","docker","docker-compose","kafka","kubernetes","production","real-time","snowplow","snowplowanalytics","streaming"],"created_at":"2024-10-17T07:10:01.149Z","updated_at":"2025-04-15T22:30:44.225Z","avatar_url":"https://github.com/pacuna.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Snowplow Analytics Pipeline\n\nThis is an example of an end-to-end Snowplow pipeline to track events using a Kafka broker.\n\nThe pipeline works in the following way:\n\n1. A request is sent to the Scala Collector\n2. The raw event (thift event) is put into the `snowplow_raw_good` (or bad) topic\n3. The enricher grabs the raw events, parses them and put them into the `snowplow_parsed_good` (or bad) topic\n4. A custom events processor grabs the parsed event, which is in a tab-delimited/json hybrid format and turns it into a proper\nJson event using the python analytics SDK from Snowplow. This event is then put into a final topic called `snowplow_json_event`.\n5. (WIP) A custom script grabs the final Json events and loads them into some storage solution (such as BigQuery or Redshift)\n\n\n## Run the pipeline\n\nExecute:\n\n```sh\ndocker-compose up -d\n```\nAfter that, make sure all the containers are up with `docker-compose ps`. If not, try to run the command again.\n\nThis command will create all the components and also a simple web application that sends pageviews and other events to the collector.\nPlease checkout the `docker-compose.yml` file for more details.\n\n## Run in production (Kubernetes)\n\nThe configuration files are using endpoints for Kafka provided by [kubernetes-kafka](https://github.com/Yolean/kubernetes-kafka). You can configure your own brokers in the collector and enrich configuration file (configmaps).\n\nAssuming you have configured Kafka for all the components:\n\n1. Deploy the collector: `kubectl apply -f ./k8s/collector`. This will create the configuration, deployment and a service that uses a Load Balancer to access the collector's endpoint. \n2. Deploy the enricher: `kubectl apply -f ./k8s/stream-enrich`. This will create the configurations and deployment.\n3. Build a Docker image for the events processor, upload it to some registry and add it to `k8s/events-processor/deploy.yml`. Checkout the files in `k8s/events-processor` before building the image and change the broker configuration in `app.py`.\n4. Deploy the events processor application using the previous `k8s/events-processor/deploy.yml` file.\n5. Run the webapp example locally and change the collector's address to the load balancer IP address created for the collector. The collector is using the port 80 so remove the port and just leave the IP address.\n6. Once everything is running, open the webapp and refresh the page a couple of times. Checkout the logs of the events processor pod to see if the Json events are created correctly.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpacuna%2Fsnowplow-pipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpacuna%2Fsnowplow-pipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpacuna%2Fsnowplow-pipeline/lists"}