{"id":24175838,"url":"https://github.com/iht/python-profiling-beam-summit-2021","last_synced_at":"2025-09-20T20:31:19.940Z","repository":{"id":74375658,"uuid":"392600478","full_name":"iht/python-profiling-beam-summit-2021","owner":"iht","description":"This repository contains a streaming Dataflow pipeline written in Python with Apache Beam, reading data from PubSub.","archived":false,"fork":false,"pushed_at":"2021-08-16T21:04:37.000Z","size":986,"stargazers_count":9,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-03T14:11:54.367Z","etag":null,"topics":["apache","beam","profiling","python"],"latest_commit_sha":null,"homepage":"https://2021.beamsummit.org/sessions/profiling-python-pipelines/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/iht.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-08-04T07:56:26.000Z","updated_at":"2022-03-18T16:39:33.000Z","dependencies_parsed_at":null,"dependency_job_id":"8ee1db0d-9018-4376-b2fd-02dfd46e904f","html_url":"https://github.com/iht/python-profiling-beam-summit-2021","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/iht/python-profiling-beam-summit-2021","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iht%2Fpython-profiling-beam-summit-2021","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iht%2Fpython-profiling-beam-summit-2021/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iht%2Fpython-profiling-beam-summit-2021/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iht%2Fpython-profiling-beam-summit-2021/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/iht","download_url":"https://codeload.github.com/iht/python-profiling-beam-summit-2021/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iht%2Fpython-profiling-beam-summit-2021/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":276153065,"owners_count":25594323,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-20T02:00:10.207Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache","beam","profiling","python"],"created_at":"2025-01-13T02:33:17.564Z","updated_at":"2025-09-20T20:31:19.926Z","avatar_url":"https://github.com/iht.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Sample streaming Dataflow pipeline written in Python\n\nThis repository contains a streaming Dataflow pipeline written in Python with\nApache Beam, reading data from PubSub.\n\nFor more details, see the following Beam Summit 2021 talk:\n* https://2021.beamsummit.org/sessions/profiling-python-pipelines/\n* [Slides in PDF](docs/beam_summit_2021_profiling_python_pipelines.pdf)\n\nTo run this pipeline, you need to have the SDK installed, and a project in \nGoogle Cloud Platform, even if you run the pipeline locally with the direct \nrunner:\n* https://cloud.google.com/sdk/docs/quickstart\n* https://cloud.google.com/free\n\n## Description of the pipeline\n\n### Data input\n\nWe are using here a public PubSub topic with data, so we don't need to setup\nour own to run this pipeline.\n\nThe topic is `projects/pubsub-public-data/topics/taxirides-realtime`.\n\nThat topic contains messages from the NYC Taxi Ride dataset. Here is a sample\nof the data contained in a message in that topic:\n\n```json\n{\n  \"ride_id\": \"328bec4b-0126-42d4-9381-cb1dbf0e2432\",\n  \"point_idx\": 305,\n  \"latitude\": 40.776270000000004,\n  \"longitude\": -73.99111,\n  \"timestamp\": \"2020-03-27T21:32:51.48098-04:00\",\n  \"meter_reading\": 9.403651,\n  \"meter_increment\": 0.030831642,\n  \"ride_status\": \"enroute\",\n  \"passenger_count\": 1\n}\n```\n\nBut the messages also contain metadata, that is useful for streaming pipelines.\nIn this case, the messages contain an attribute of name `ts`, which contains\nthe same timestamp as the field of name `timestamp` in the data. Remember that\nPubSub treats the data as just a string of bytes, so it does not *know*\nanything about the data itself. The metadata fields are normally used to publish\nmessages with specific ids and/or timestamps.\n\nTo inspect the messages from this topic, you can create a subscription, and then\npull some messages.\n\nTo create a subscription, use the gcloud cli utility (installed by default in\nthe Cloud Shell):\n\n```\nexport TOPIC=projects/pubsub-public-data/topics/taxirides-realtime\ngcloud pubsub subscriptions create taxis --topic $TOPIC\n```\n\nTo pull messages:\n\n```gcloud pubsub subscriptions pull taxis --limit 3```\n\nor if you have [jq](https://stedolan.github.io/jq/) (for pretty printing of \nJSON)\n\n```gcloud pubsub subscriptions pull taxis --limit 3 | grep \" {\" | cut -f 2 -d ' ' | jq```\n\nPay special attention to the Attributes column (metadata). You will see that\nthe timestamp included as a field in the metadata, as well as in the\ndata. We will leverage that metadata field for the timestamps used in\nour streaming pipeline.\n\n### Data output\n\nThis pipeline writes the output to BigQuery, in streaming append-only mode.\n\nThe destination tables must exist prior to running the pipeline.\n\nIf you have the GCloud cli utility installed (for instance, it is installed\nby default in the Cloud Shell), you can create the tables from the command line.\n\nYou need to create a BigQuery dataset too, in the same region:\n* https://cloud.google.com/bigquery/docs/datasets\n\nAfter that, you can create the destination tables with the provided script\n\n`./scripts/create_tables.sh taxi_rides`\n\n## Algorithm / business rules\n\nWe are using a session window with a gap of 10 seconds. That means that all\nthe messages with the same `ride_id` will be grouped together, as long as\ntheir timestamps are 10 seconds within each other. Any message with a\ntimestamp more than 10 seconds apart will be discarded (for old timestamps) or\nwill open a new window (for newer timestamps).\n\nWith the messages inside each window (that is, each different `ride_id` will be\npart of a different window), we will calculate the duration of the session, as\nthe difference between the min and max timestamps in the window. We will also\ncalculate the number of events in that session.\n\nWe will use a `GroupByKey` to operate with all the messages in a window. This\nwill load all the messages in the window into memory. This is fine, as in\nBeam streaming, a window is always processed in a worker (windows cannot be\nsplit across different workers).\n\nThis is an example of the kind of logic that can be implemented leveraging\nwindows in streaming pipelines. This grouping of messages across `ride_id` and\nevent timestamps is automatically done by the pipeline, and we just need to\nexpress the generic operations to be performed with each window, as part of our\npipeline.\n\n## Running the pipeline\n\n### Prerequirements\n\nYou need to have a Google Cloud project, and the `gcloud` SDK configured to \nrun the pipeline. For instance, you could run it from the Cloud Shell in \nGoogle Cloud Platform (`gcloud` would be automatically configured).\n\nThen you need to create a Google Cloud Storage bucket, with the same name as \nyour project id, and in the same region where you will run Dataflow:\n* https://cloud.google.com/storage/docs/creating-buckets\n\nMake sure that you have a Python environment with Python 3 (\u003c3.9). For \ninstance a virtualenv, and install `apache-beam[gcp]` and `python-dateutil` \nin your local environment. For instance, assuming that you are running in a \nvirtualenv:\n\n`pip install \"apache-beam[gcp]\" python-dateutil`\n\n### Run the pipeline\n\nOnce the tables are created and the dependencies installed, edit \n`scripts/launch_dataflow_runner.sh` and  set your project id and region, and \nthen run it with:\n\n`./scripts/launch_dataflow_runner.sh`\n\nThe outputs will be written to the BigQuery tables, and in the `profile` \ndirectory in your bucket you should see Python `gprof` files with profiling \ninformation.\n\n## CPU profiling\n\nBeam uses the Python profiler to produce files in Python `gprof` format. You \nwill need some scripting to interpret those files and extracts insights out \nof them.\n\nIn this repository, you will find some sample output in `data/beam.prof`, \nthat you can use to check what the profiling output looks like. Use the \nfollowing Colab notebook with an example analyzing that sample profiling data:\n* https://colab.research.google.com/drive/1fmefgXctJWxyVv0_CXsQ9Hyfep488yfN#scrollTo=XvBvFs-fEcbh\n\nRefer to this post for more details about how to interpret that file:\n* https://medium.com/google-cloud/profiling-apache-beam-python-pipelines-d3cac8644fa4\n\n\n## License\n\nCopyright 2021 Israel Herraiz\n\n   Licensed under the Apache License, Version 2.0 (the \"License\");\n   you may not use this file except in compliance with the License.\n   You may obtain a copy of the License at\n\n       http://www.apache.org/licenses/LICENSE-2.0\n\n   Unless required by applicable law or agreed to in writing, software\n   distributed under the License is distributed on an \"AS IS\" BASIS,\n   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n   See the License for the specific language governing permissions and\n   limitations under the License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiht%2Fpython-profiling-beam-summit-2021","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fiht%2Fpython-profiling-beam-summit-2021","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiht%2Fpython-profiling-beam-summit-2021/lists"}