{"id":15221696,"url":"https://github.com/googlecloudplatform/dataflow-metrics-exporter","last_synced_at":"2025-10-08T05:09:40.319Z","repository":{"id":207670267,"uuid":"706781392","full_name":"GoogleCloudPlatform/dataflow-metrics-exporter","owner":"GoogleCloudPlatform","description":"CLI tool to collect dataflow resource \u0026 execution metrics and export  to either BigQuery or Google Cloud Storage.  Tool will be useful to compare \u0026 visualize the metrics while benchmarking the dataflow pipelines using various data formats, resource configurations etc","archived":false,"fork":false,"pushed_at":"2025-04-22T18:44:07.000Z","size":69,"stargazers_count":2,"open_issues_count":1,"forks_count":3,"subscribers_count":21,"default_branch":"main","last_synced_at":"2025-05-14T20:52:25.865Z","etag":null,"topics":["apache-beam","google-cloud-dataflow"],"latest_commit_sha":null,"homepage":"https://cloud.google.com/dataflow/docs/overview","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GoogleCloudPlatform.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-18T15:50:08.000Z","updated_at":"2024-12-19T03:30:13.000Z","dependencies_parsed_at":"2023-11-17T02:31:13.528Z","dependency_job_id":"6d4abcf8-6f60-4412-a369-9679cb4dc858","html_url":"https://github.com/GoogleCloudPlatform/dataflow-metrics-exporter","commit_stats":null,"previous_names":["googlecloudplatform/dataflow-metrics-exporter"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/GoogleCloudPlatform/dataflow-metrics-exporter","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fdataflow-metrics-exporter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fdataflow-metrics-exporter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fdataflow-metrics-exporter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fdataflow-metrics-exporter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GoogleCloudPlatform","download_url":"https://codeload.github.com/GoogleCloudPlatform/dataflow-metrics-exporter/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fdataflow-metrics-exporter/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278891753,"owners_count":26063858,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-08T02:00:06.501Z","response_time":56,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-beam","google-cloud-dataflow"],"created_at":"2024-09-28T15:06:49.874Z","updated_at":"2025-10-08T05:09:40.280Z","avatar_url":"https://github.com/GoogleCloudPlatform.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Overview\n\u003e Command line tool to fetch the dataflow job metrics, estimates job cost and stores the results in the output store (i.e bigquery or file system)\n\n## Commands\n\nSupports multiple functionalities through following command types:\n\n| Command Type         | Description                                          |\n|----------------------|:-----------------------------------------------------|\n| COLLECT_METRICS      | For complete details refer [here](#CollectMetrics)   |\n| LAUNCH_AND_COLLECT   | For complete details refer [here](#LaunchCollect)    |\n\n## Output Stores\nUser can choose appropriate output store based on the command line options: output_type and output_location:\n\n| Output Type | Output Location                                              |\n|-------------|:-------------------------------------------------------------|\n| BIGQUERY    | project-id.dataset.table                                     |\n| FILE        | /path/to/location \u003cbr/\u003e(Supports GCS and Local File Systems) |\n\nIn order to store the metrics in bigquery, create the table using schema specified in\n```scripts/metrics_bigquery_schema.json```.\n\nSample commands for creating dataset and table are provided in ```scripts/bigquery.sh```\n\n## Getting Started\n\n### Requirements\n\n* Java 11\n* Maven 3\n* [gcloud CLI](https://cloud.google.com/sdk/gcloud), and execution of the\n  following commands:\n    * `gcloud auth login`\n    * `gcloud auth application-default login`\n\n- Set following environment variables\n```bash\n$ export GOOGLE_APPLICATION_CREDENTIALS=\u003c/path/to/application_default_credentials.json\u003e\n```\n\n- Execute below command to format the code from the project root directory:\n```bash\n$ mvn spotless:apply\n```\n\n- To build the fat jar, execute the below command from within the project root directory which will generate\n  ```{project_dir}/target/dataflow-metrics-exporter-${version}.jar```\n\n```bash\n$ mvn clean package\n```\n\n### BIGQUERY OUTPUT STORE\u003c/a\u003e\nBigquery can be used to store the metrics collected by the tool.  In order to store the metrics create the table\nin bigquery using schema specified\n\n### \u003ca name=\"CollectMetrics\"\u003eCOLLECT_METRICS\u003c/a\u003e\nCollects the metrics for a given job id and stores the results in the output store.\n\n### Creating the configuration File\nThe configuration file provides project_id, region and job_id along with pricing (optional) of the job for\nwhich the metrics needs to be collected\n\n#### Example configuration File\nBelow is an example configuration file. pricing section is optional. If pricing is skipped then estimated cost of the\njob will not be calculated.\n\n```json\n{\n  \"project\": \"test-project\",\n  \"region\": \"us-central1\",\n  \"jobId\": \"2023-09-05_12_40_15-5693029202890139182\",\n  \"pricing\": {\n    \"vcpuPerHour\": 0.056,\n    \"memoryGbPerHour\": 0.003557,\n    \"pdGbPerHour\": 0.000054,\n    \"ssdGbPerHour\": 10,\n    \"shuffleDataPerGb\": 0.011,\n    \"streamingDataPerGb\": 0.005\n  }\n}\n```\n\n### Execute the command\n```bash\n# To store results in BigQuery\n$ java -jar /path/to/dataflow-metrics-exporter-${version}.jar --command COLLECT_METRICS --conf /path/to/config.json \\\n--output_type BIGQUERY --output_location projectid:datasetid.tableid\n\n# To store results in GCS / local file system\n$ java -jar /path/to/dataflow-metrics-exporter-${version}.jar --command COLLECT_METRICS --conf /path/to/config.json \\\n--output_type FILE --output_location /path/to/output/location\n```\n\n### Sample Output\n```json\n{\n   \"run_timestamp\": \"2023-09-18T22:50:43.419455Z\",\n   \"pipeline_name\": \"word-count-benchmark-20230908224412968\",\n   \"job_create_timestamp\": \"2023-09-08T22:44:14.475764Z\",\n   \"job_type\": \"JOB_TYPE_BATCH\",\n   \"sdk_version\": \"2.49.0\",\n   \"sdk\": \"Apache Beam SDK for Java\",\n   \"metrics\": {\n      \"TotalDpuUsage\": 0.050666635590718774,\n      \"TotalSeCuUsage\": 0.0,\n      \"TotalStreamingDataProcessed\": 0.0,\n      \"TotalDcuUsage\": 50666.63559071875,\n      \"BillableShuffleDataProcessed\": 1.603737473487854E-5,\n      \"TotalPdUsage\": 2279.0,\n      \"TotalGpuTime\": 0.0,\n      \"EstimatedJobCost\": 0.007,\n      \"TotalVcpuTime\": 182.0,\n      \"TotalShuffleDataProcessed\": 6.414949893951416E-5,\n      \"TotalMemoryUsage\": 747068.0,\n      \"TotalSsdUsage\": 0.0\n   }\n}\n```\n\n### \u003ca name=\"LaunchCollect\"\u003eLAUNCH_AND_COLLECT\u003c/a\u003e\nLaunches either Dataflow [classic](https://cloud.google.com/dataflow/docs/guides/templates/creating-templates)\nor [flex](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates) template and waits for the job\nto finish or time out (in case of streaming), collects the metrics and stores the results to Output Store\n\n* To cancel streaming jobs provide the appropriate timeout using timeoutInMinutes property in supplied config value.\n  Refer config file below for an example.\n\n### Creating the configuration File\nThe configuration file provides project_id, template name, template type, template spec and pipeline options\n\n#### Example configuration File\nBelow is an example configuration file with minimal run time environment options.\n\n```json\n{\n  \"project\": \"test-project\",\n  \"region\": \"us-central1\",\n  \"templateName\": \"SampleWordCount\",\n  \"templateVersion\": \"1.0\",\n  \"templateType\": \"classic\",\n  \"templateSpec\": \"gs://dataflow-templates/latest/Word_Count\",\n  \"jobPrefix\": \"WordCountBenchmark\",\n  \"timeoutInMinutes\" : 30,\n  \"pipelineOptions\": {\n    \"inputFile\": \"gs://dataflow-samples/shakespeare/kinglear.txt\",\n    \"output\": \"gs://bucket-name/output/wordcount/\"\n  },\n  \"environmentOptions\": {\n    \"tempLocation\": \"gs://bucket-name/temp/\"\n  }\n}\n```\nBelow is another example configuration file with additional run time environment options and pricing details.\n\n```json\n{\n  \"project\": \"test-project\",\n  \"region\": \"us-central1\",\n  \"templateName\": \"SampleWordCount\",\n  \"templateVersion\": \"1.0\",\n  \"templateType\": \"classic\",\n  \"templateSpec\": \"gs://dataflow-templates/latest/Word_Count\",\n  \"jobPrefix\": \"WordCountBenchmark\",\n  \"timeoutInMinutes\" : 30,\n  \"pipelineOptions\": {\n    \"inputFile\": \"gs://dataflow-samples/shakespeare/kinglear.txt\",\n    \"output\": \"gs://bucket-name/output/wordcount/\"\n  },\n  \"environmentOptions\": {\n    \"numWorkers\" : 2,\n    \"maxWorkers\" : 10,\n    \"service_account_email\" : \"my-svc-acct@project-id.iam.gserviceaccount.com\",\n    \"zone\" : \"us-central1-b\",\n    \"autoscalingAlgorithm\" : \"AUTOSCALING_ALGORITHM_BASIC\",\n    \"subnetwork\": \"https://www.googleapis.com/compute/v1/projects/test-project/regions/us-central1/subnetworks/default\",\n    \"tempLocation\": \"gs://bucket-name/temp/\",\n    \"ipConfiguration\": \"WORKER_IP_PRIVATE\",\n    \"additionalExperiments\": [\"enable_prime\"],\n    \"additionalUserLabels\": {\n      \"topology\":\"pointtopoint\",\n      \"dataformats\":\"text\",\n      \"datasizecategory\":\"medium\",\n      \"datasizeingb\":\"100\"\n    }\n  },\n  \"pricing\": {\n    \"vcpuPerHour\": 0.056,\n    \"memoryGbPerHour\": 0.003557,\n    \"pdGbPerHour\": 0.000054,\n    \"ssdGbPerHour\": 10,\n    \"shuffleDataPerGb\": 0.011,\n    \"streamingDataPerGb\": 0.005\n  }\n}\n```\nFor complete list of run time environment options for various template types check below commands:\n```bash\n# For classic templates\n$ gcloud dataflow jobs run --help\n\n# For flex templates\n$ gcloud dataflow flex-template run --help\n```\n\n### Execute the command\n```bash\n# To store results in BigQuery\n$ java -jar /path/to/dataflow-metrics-exporter-${version}.jar --command LAUNCH_AND_COLLECT --conf /path/to/config.json \\\n--output_type BIGQUERY --output_location projectid:datasetid.tableid\n\n# To store results in GCS / local file system\n$ java -jar /path/to/dataflow-metrics-exporter-${version}.jar --command LAUNCH_AND_COLLECT --conf /path/to/config.json \\\n--output_type FILE --output_location /path/to/output/location\n```\n\n### Sample Output\n```json\n{\n  \"run_timestamp\": \"2023-09-18T22:50:43.419455Z\",\n  \"pipeline_name\": \"word-count-benchmark-20230908224412968\",\n  \"job_create_timestamp\": \"2023-09-08T22:44:14.475764Z\",\n  \"job_type\": \"JOB_TYPE_BATCH\",\n  \"template_name\": \"SampleWordCount\",\n  \"template_version\": \"1.0\",\n  \"sdk_version\": \"2.49.0\",\n  \"template_type\": \"classic\",\n  \"sdk\": \"Apache Beam SDK for Java\",\n  \"metrics\": {\n    \"TotalDpuUsage\": 0.050666635590718774,\n    \"TotalSeCuUsage\": 0.0,\n    \"TotalStreamingDataProcessed\": 0.0,\n    \"TotalDcuUsage\": 50666.63559071875,\n    \"BillableShuffleDataProcessed\": 1.603737473487854E-5,\n    \"TotalPdUsage\": 2279.0,\n    \"TotalGpuTime\": 0.0,\n    \"EstimatedJobCost\": 0.045,\n    \"TotalVcpuTime\": 182.0,\n    \"TotalShuffleDataProcessed\": 6.414949893951416E-5,\n    \"TotalMemoryUsage\": 747068.0,\n    \"TotalSsdUsage\": 0.0\n  },\n  \"parameters\": {\n    \"output\": \"gs://bucket-name/output/wordcount/\",\n    \"tempLocation\": \"gs://bucket-name/temp/\",\n    \"additionalExperiments\": \"[enable_prime, workerMachineType=n1-standard-2, minNumWorkers=2]\",\n    \"inputFile\": \"gs://dataflow-samples/shakespeare/kinglear.txt\",\n    \"maxWorkers\": \"10\",\n    \"additionalUserLabels\": \"{sources=gcs, sinks=bigtable, topology=pointtopoint, dataformats=text, datasizecategory=medium, datasizeingb=100}\",\n    \"subnetwork\": \"https://www.googleapis.com/compute/v1/projects/test-project/regions/us-central1/subnetworks/default\",\n    \"numWorkers\": \"2\"\n  }\n}\n```\n\n## Output Metrics\n\n| MetricName                   | Metric Description                                                                                      |\n|------------------------------|:--------------------------------------------------------------------------------------------------------|\n| TotalVcpuTime                | The total vCPU seconds used by Dataflow job                                                             |\n| TotalGpuTime                 | The total proportion of time in which the GPU was used by Dataflow job                                  |\n| TotalMemoryUsage             | The total GB seconds of memory allocated to Dataflow job                                                |\n| TotalPdUsage                 | The total GB seconds for all persistent disk used by all workers associated with Dataflow job           |\n| TotalSsdUsage                | The total GB seconds for all SSD used by all workers associated with Dataflow job                       |\n| TotalShuffleDataProcessed    | The total bytes of shuffle data processed by Dataflow job                                               |\n| TotalStreamingDataProcessed  | The total bytes of streaming data processed by Dataflow job                                             |\n| BillableShuffleDataProcessed | The billable bytes of shuffle data processed by Dataflow job                                            |\n| TotalDcuUsage                | The total amount of DCUs (Data Compute Unit) used by the Dataflow job since it was launched.            |\n| TotalElapsedTimeSec          | Total duration of the pipeline that job is in RUNNING_STATE in seconds                                  |\n| EstimatedJobCost             | Estimated cost of the dataflow job (if pricing info is provided). Not applicable for prime enabled jobs |\n\nFor complete list of metrics refer [Dataflow Monitoring Metrics](https://cloud.google.com/monitoring/api/metrics_gcp#gcp-dataflow)\n\n## Cost Estimation\n\nJob cost can be estimated by providing the resource pricing information inside the config file as shown below:\n\n```json\n{\n   \"project\": \"test-project\",\n   \"region\": \"us-central1\",\n   \"templateName\": \"SampleWordCount\",\n   \"pipelineOptions\": {},\n   \"environmentOptions\": {},\n   \"pricing\": {\n      \"vcpuPerHour\": 0.056,\n      \"memoryGbPerHour\": 0.003557,\n      \"pdGbPerHour\": 0.000054,\n      \"ssdGbPerHour\": 10,\n      \"shuffleDataPerGb\": 0.011,\n      \"streamingDataPerGb\": 0.005\n   }\n}\n```\nPricing for various resource by region can be obtained from [Dataflow Pricing](https://cloud.google.com/dataflow/pricing) docs.\n\nNote:\n* Estimated cost is appropriate and doesn't factor any discounts provided.\n* For prime enabled jobs, estimated cost feature is not supported and remains 0.0\n\n## Contributing\n\nCheck [CONTRIBUTING.md](CONTRIBUTING.md) for details.\n\n## License\n\nApache 2.0; Check [LICENSE](LICENSE) for details.\n\n## Disclaimer\n\nThis project is not an official Google project. It is not supported by\nGoogle and disclaims all warranties as to its quality, merchantability, or\nfitness for a particular purpose.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgooglecloudplatform%2Fdataflow-metrics-exporter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgooglecloudplatform%2Fdataflow-metrics-exporter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgooglecloudplatform%2Fdataflow-metrics-exporter/lists"}