{"id":24175835,"url":"https://github.com/iht/splittable-dofns-python","last_synced_at":"2025-08-11T05:17:38.932Z","repository":{"id":47304513,"uuid":"514177198","full_name":"iht/splittable-dofns-python","owner":"iht","description":"This repository contains the code samples used for the workshop \"Splittable DoFns in Python\" of the Beam Summit 2022","archived":false,"fork":false,"pushed_at":"2022-07-20T17:27:04.000Z","size":1475,"stargazers_count":3,"open_issues_count":0,"forks_count":5,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-03T14:11:10.107Z","etag":null,"topics":["beam","python"],"latest_commit_sha":null,"homepage":"https://2022.beamsummit.org/sessions/splittable-dofns-in-python/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/iht.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-07-15T07:32:39.000Z","updated_at":"2023-04-28T11:14:00.000Z","dependencies_parsed_at":"2022-09-06T14:00:59.565Z","dependency_job_id":null,"html_url":"https://github.com/iht/splittable-dofns-python","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/iht/splittable-dofns-python","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iht%2Fsplittable-dofns-python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iht%2Fsplittable-dofns-python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iht%2Fsplittable-dofns-python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iht%2Fsplittable-dofns-python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/iht","download_url":"https://codeload.github.com/iht/splittable-dofns-python/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iht%2Fsplittable-dofns-python/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":269833310,"owners_count":24482423,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-11T02:00:10.019Z","response_time":75,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["beam","python"],"created_at":"2025-01-13T02:33:13.759Z","updated_at":"2025-08-11T05:17:38.851Z","avatar_url":"https://github.com/iht.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# Splittable DoFns in Python\n\nThis repository contains the code samples used for the workshop \"Splittable \nDoFns in Python\" of the Beam Summit 2022:\n* https://2022.beamsummit.org/sessions/splittable-dofns-in-python/\n\nThere are two branches in this repo:\n* `main`: template to follow the workshop; write your code here.\n* `solution`: full solutions provided in this branch. Check only after you have tried to write your own code.\n\nThe slides used during the workshop are available [here](docs/slides.pdf).\n\n# Dependencies\n\nCheck the `requirements.txt` file and install those dependencies before \ntrying to run the examples in this repo.\n\nYou will need Python 3.7, 3.8 or 3.9. Other versions of Python will not work.\n\nIf you want to use Kafka as well as the synthetic pipelines, you will need \nto install minikube, or alternatively, provide a Kafka server of your own. \nMore details to install minikube and Kafka are given below.\n\n# Synthetic pipelines\n\nThere are two pipelines in this repo using synthetic data: one for a batch \nexample, and another one for a streaming example.\n\n## Batch pipeline\n\nTo launch the batch pipeline, simply run \n\n`python my_batch_pipeline.py`\n\nThe pipeline generates some pseudo-files, and reads the files by chunks \nusing a splittable DoFn. The code of the `DoFn` is in \n`mydofns/synthetic_sdfn_batch.py`.\n\n**You need to write your solution for that splittable DoFn in that file**.\n\n## Streaming pipeline\n\nTo launch the batch pipeline, simply run \n\n`python my_streaming_synth_pipeline.py`\n\nIn the file `mydofns/synthetic_sdfn_streaming.py`, in line 62, you can set \nthe number of partitions for this streaming synthetic connector. By default, it is `NUM_PARTITIONS = 4`.\n\n**You need to write your solution for that splittable DoFn in that file**.\n\n# Pipeline using Kafka\n\nBefore you can use the pipeline with Kafka, you will need a Kafka server. In \nthe next section you have instructions to run Kafka locally with minikube.\n\n## Install Kafka\n\nIf you want to test your code against an actual Kafka server, follow the \nnext steps to install Kafka in a local minikube cluster.\n\n* Install minikube: https://minikube.sigs.k8s.io/docs/start/\n* Make sure that you have an alias \n  - `alias k=kubectl`\n* Create a namespace for Kafka: \n  - `k create namespace kafka`\n* Install Kafka operator\n  - `k create -f 'https://strimzi.io/install/latest?namespace=kafka' -n kafka`\n* Install single ephemeral cluster:\n  - `k apply -f manifests/kafka-cluster.yaml -n kafka`\n* Find out the port where Kafka is listening, and take note of it:\n  - `k get service my-cluster-kafka-external-bootstrap -o=jsonpath='{.spec.ports[0].nodePort}{\"\\n\"}' -n kafka`\n* Find out the local IP where Kafka is lesting, and take note of it:\n  - `k get node minikube -o=jsonpath='{range .status.addresses[*]}{.type}{\"\\t\"}{.address}{\"\\n\"}'`\n\nFor your Kafka clients configuration, the bootstrap server will be `IP:PORT`.\n\n## Topic creation and population\n\nTo test your pipeline against Kafka, you will need to write some data to \nKafka. For that, the first step is to create a topic.\n\nThere is a Python script that can help you with the creation of the topic \nand the population with data.\n\nFind out your bootstrap server details, and create an environment variable:\n\n`export BOOTSTRAP=192.168.64.3:31457`\n\n(in this example the IP is `192.168.64.3` and the port is `31457`; your details \nwill be  different, please  use the IP of your minikube cluster and the  \nport of your Kafka service, see above for more details)\n\nTo create the topic, run\n\n`./kafka_single_client.py --bootstrap $BOOTSTRAP --create`\n\nAnd to populate with data\n\n`./kafka_single_client.py --bootstrap $BOOTSTRAP`\n\nIf you want to check that the topic is working correctly, you can run a \nconsumer and check if there is data:\n\n`./kafka_single_client.py --consumer --bootstrap $BOOTSTRAP`\n\n## Pipeline using Kafka\n\nTo run the pipeline, use this script. The number of partitions is by default 4. Make sure that if you change the\nnumber of partitions in `kafka_single_client.py`, you change it/use the same value in the Kafka `DoFn` too\n\n`python my_streaming_kafka_pipeline.py --bootstrap $BOOTSTRAP `\n\nThe code of the `DoFn` functions is located in \n`mydofns/kafka_sdfn_streaming.py`.\n\n**You need to write your solution for that splittable DoFn in that file**.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiht%2Fsplittable-dofns-python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fiht%2Fsplittable-dofns-python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiht%2Fsplittable-dofns-python/lists"}