{"id":13906355,"url":"https://github.com/LucaCanali/sparkMeasure","last_synced_at":"2025-07-18T04:31:02.280Z","repository":{"id":41565986,"uuid":"85240663","full_name":"LucaCanali/sparkMeasure","owner":"LucaCanali","description":"This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.","archived":false,"fork":false,"pushed_at":"2024-08-13T08:25:43.000Z","size":1946,"stargazers_count":704,"open_issues_count":2,"forks_count":145,"subscribers_count":34,"default_branch":"master","last_synced_at":"2024-10-29T14:53:32.057Z","etag":null,"topics":["apache-spark","performance-metrics","performance-troubleshooting","python","scala","spark"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LucaCanali.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-03-16T20:53:26.000Z","updated_at":"2024-10-25T05:24:07.000Z","dependencies_parsed_at":"2023-01-22T04:02:54.241Z","dependency_job_id":"5a77725b-70a0-4a4b-82b9-7d29c8384e8f","html_url":"https://github.com/LucaCanali/sparkMeasure","commit_stats":{"total_commits":316,"total_committers":17,"mean_commits":18.58823529411765,"dds":0.09810126582278478,"last_synced_commit":"c1298114d18ee87369849aff72632ca4bad2a92f"},"previous_names":[],"tags_count":13,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LucaCanali%2FsparkMeasure","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LucaCanali%2FsparkMeasure/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LucaCanali%2FsparkMeasure/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LucaCanali%2FsparkMeasure/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LucaCanali","download_url":"https://codeload.github.com/LucaCanali/sparkMeasure/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226344611,"owners_count":17610176,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","performance-metrics","performance-troubleshooting","python","scala","spark"],"created_at":"2024-08-06T23:01:34.054Z","updated_at":"2025-07-18T04:31:02.245Z","avatar_url":"https://github.com/LucaCanali.png","language":"Scala","funding_links":[],"categories":["Scala","大数据"],"sub_categories":[],"readme":"# SparkMeasure - a performance tool for Apache Spark\n\n[![Test](https://github.com/LucaCanali/sparkmeasure/actions/workflows/build_with_scala_and_python_tests.yml/badge.svg)](https://github.com/LucaCanali/sparkmeasure/actions/workflows/build_with_scala_and_python_tests.yml)\n[![Maven Central](https://maven-badges.herokuapp.com/maven-central/ch.cern.sparkmeasure/spark-measure_2.12/badge.svg)](https://maven-badges.herokuapp.com/maven-central/ch.cern.sparkmeasure/spark-measure_2.12)\n[![DOI](https://zenodo.org/badge/85240663.svg)](https://doi.org/10.5281/zenodo.7891017)\n[![PyPI](https://img.shields.io/pypi/v/sparkmeasure.svg)](https://pypi.org/project/sparkmeasure/)\n[![PyPI - Downloads](https://img.shields.io/pypi/dm/sparkmeasure)](https://pypistats.org/packages/sparkmeasure)\n[![API Documentation](https://img.shields.io/badge/API-Documentation-brightgreen)](docs/Reference_SparkMeasure_API_and_Configs.md)\n\nSparkMeasure is a tool and a library designed to ease performance measurement and troubleshooting of \nApache Spark jobs. It focuses on easing the collection and analysis of Spark metrics, \nmaking it a practical choice for both developers and data engineers.\nWith sparkMeasure, users gain a deeper understanding of their Spark job performance,\nenabling faster and more reliable data processing workflows.\n\n### ✨ Highlights\n- **Interactive Troubleshooting:** Ideal for real-time analysis of Spark workloads in notebooks \nand spark-shell/pyspark environments.\n- **Development \u0026 CI/CD Integration:** Facilitates testing, measuring, and comparing execution metrics\n  of Spark jobs under various configurations or code changes.\n- **Batch Job Analysis:** With Flight Recorder mode sparkMeasure transparently records batch job metrics \n  for later analysis.\n- **Monitoring Capabilities:** Integrates with external systems like Apache Kafka, InfluxDB,\n  and Prometheus Push Gateway for extensive monitoring.\n- **Educational Tool:** Serves as a practical example of implementing Spark Listeners for the collection\n  of detailed Spark task metrics.\n- **Language Compatibility:** Fully supports Scala, Java, and Python, making it versatile for a wide range\n  of Spark applications.\n\n### 📚 Table of Contents\n- [Getting started with sparkMeasure](#getting-started-with-sparkmeasure)\n  - [Demo](#demo) \n  - [Examples of sparkMeasure on notebooks](#examples-of-sparkmeasure-on-notebooks)  \n  - [Examples of sparkMeasure on the CLI](#examples-of-sparkmeasure-on-the-cli)\n- [Setting up SparkMeasure with Spark](#setting-up-sparkmeasure-with-spark)\n  - [Version vompatibility for SparkMeasure](#version-compatibility-for-sparkmeasure)\n  - [Downloading sparkMeasure](#downloading-sparkmeasure)\n  - [Setup examples](#setup-examples)\n- [Notes on Metrics](#notes-on-metrics)\n- [Documentation and API reference](#documentation-api-and-examples)\n- [Architecture diagram](#architecture-diagram)\n- [Concepts and FAQ](#main-concepts-underlying-sparkmeasure-implementation)\n\n### Links to related work on Spark Performance\n\n- **[Building an Apache Spark Performance Lab](https://db-blog.web.cern.ch/node/195)**  \n  Guide to setting up a Spark performance testing environment.\n- **[TPC-DS Benchmark with PySpark](https://github.com/LucaCanali/Miscellaneous/tree/master/Performance_Testing/TPCDS_PySpark)**  \n  Tool for running TPC-DS with PySpark, instrumented with `sparkMeasure`.\n- **[Spark Monitoring Dashboard](https://github.com/cerndb/spark-dashboard)**  \n  Custom monitoring solution with real-time dashboards for Spark.\n- **[Flamegraphs for Profiling Spark Jobs](https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Tools_Spark_Pyroscope_FlameGraph.md)**  \n  Guide to profiling Spark with Pyroscope and Flamegraphs.\n- **[Advanced Notes on Apache Spark](https://github.com/LucaCanali/Miscellaneous/tree/master/Spark_Notes)**  \n  Tips, configuration, and troubleshooting for Spark.\n- **[Introductory Course on Apache Spark](https://sparktraining.web.cern.ch/)**  \n  Beginner-friendly course on Spark fundamentals.\n\nMain author and contact: Luca.Canali@cern.ch\n\n---\n## 🚀 Quick start\n\n[![Watch the video](https://www.youtube.com/s/desktop/050e6796/img/favicon_32x32.png) Watch sparkMeasure's getting started demo tutorial](https://www.youtube.com/watch?v=NEA1kkFcZWs)\n\n### Examples of sparkMeasure on notebooks\n- Run locally or on hosted resources like Google Colab, Databricks, GitHub Codespaces, etc on Jupyter notebooks\n\n- [\u003cimg src=\"https://raw.githubusercontent.com/googlecolab/open_in_colab/master/images/icon128.png\" height=\"50\"\u003e Jupyter notebook on Google Colab Research](https://colab.research.google.com/github/LucaCanali/sparkMeasure/blob/master/examples/SparkMeasure_Jupyter_Colab_Example.ipynb)\n\n- [\u003cimg src=\"https://upload.wikimedia.org/wikipedia/commons/6/63/Databricks_Logo.png\" height=\"40\"\u003e Scala notebook on Databricks](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2061385495597958/2910895789597333/442806354506758/latest.html)  \n  \n- [\u003cimg src=\"https://upload.wikimedia.org/wikipedia/commons/6/63/Databricks_Logo.png\" height=\"40\"\u003e Python notebook on Databricks](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2061385495597958/2910895789597316/442806354506758/latest.html)  \n  \n- [\u003cimg src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/3/38/Jupyter_logo.svg/250px-Jupyter_logo.svg.png\" height=\"50\"\u003e Local Python/Jupyter Notebook](examples/SparkMeasure_Jupyter_Python_getting_started.ipynb)\n\n### Examples of sparkMeasure on the CLI \n  - Run locally or on hosted resources\n    - [![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://codespaces.new/LucaCanali/sparkMeasure)\n\n#### Python CLI\n  ```\n  # Python CLI\n  # pip install pyspark\n  pip install sparkmeasure\n  pyspark --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25\n\n  # Import sparkMeasure\n  from sparkmeasure import StageMetrics\n  stagemetrics = StageMetrics(spark)\n  # Simple one-liner to run a Spark SQL query and measure its performance\n  stagemetrics.runandmeasure(globals(), 'spark.sql(\"select count(*) from range(1000) cross join range(1000) cross join range(1000)\").show()')\n  \n  # Alternatively, you can use the begin() and end() methods to measure performance\n  # Start measuring\n  stagemetrics.begin()\n\n  spark.sql(\"select count(*) from range(1000) cross join range(1000) cross join range(1000)\").show()\n\n  # Set a stop point for measuring metrics delta values\n  stagemetrics.end()\n  # Print the metrics report\n  stagemetrics.print_report()\n  stagemetrics.print_memory_report()\n\n  # get metrics as a dictionary\n  metrics = stagemetrics.aggregate_stage_metrics()\n  ```\n#### Scala CLI\n  ```\n  spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25\n\n  val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark)\n  stageMetrics.runAndMeasure(spark.sql(\"select count(*) from range(1000) cross join range(1000) cross join range(1000)\").show())\n  ```\n\nThe output should look like this:\n```\n+----------+\n|  count(1)|\n+----------+\n|1000000000|\n+----------+\n\nTime taken: 3833 ms\n\nScheduling mode = FIFO\nSpark Context default degree of parallelism = 8\n\nAggregated Spark stage metrics:\nnumStages =\u003e 3\nnumTasks =\u003e 17\nelapsedTime =\u003e 1112 (1 s)\nstageDuration =\u003e 864 (0.9 s)\nexecutorRunTime =\u003e 3358 (3 s)\nexecutorCpuTime =\u003e 2168 (2 s)\nexecutorDeserializeTime =\u003e 892 (0.9 s)\nexecutorDeserializeCpuTime =\u003e 251 (0.3 s)\nresultSerializationTime =\u003e 72 (72 ms)\njvmGCTime =\u003e 0 (0 ms)\nshuffleFetchWaitTime =\u003e 0 (0 ms)\nshuffleWriteTime =\u003e 36 (36 ms)\nresultSize =\u003e 16295 (15.9 KB)\ndiskBytesSpilled =\u003e 0 (0 Bytes)\nmemoryBytesSpilled =\u003e 0 (0 Bytes)\npeakExecutionMemory =\u003e 0\nrecordsRead =\u003e 2000\nbytesRead =\u003e 0 (0 Bytes)\nrecordsWritten =\u003e 0\nbytesWritten =\u003e 0 (0 Bytes)\nshuffleRecordsRead =\u003e 8\nshuffleTotalBlocksFetched =\u003e 8\nshuffleLocalBlocksFetched =\u003e 8\nshuffleRemoteBlocksFetched =\u003e 0\nshuffleTotalBytesRead =\u003e 472 (472 Bytes)\nshuffleLocalBytesRead =\u003e 472 (472 Bytes)\nshuffleRemoteBytesRead =\u003e 0 (0 Bytes)\nshuffleRemoteBytesReadToDisk =\u003e 0 (0 Bytes)\nshuffleBytesWritten =\u003e 472 (472 Bytes)\nshuffleRecordsWritten =\u003e 8\n\nAverage number of active tasks =\u003e 3.0\n\nStages and their duration:\nStage 0 duration =\u003e 355 (0.4 s)\nStage 1 duration =\u003e 411 (0.4 s)\nStage 3 duration =\u003e 98 (98 ms)\n```\n\n### Memory report\nStage metrics collection mode has an optional memory report command:\n```\n(scala)\u003e stageMetrics.printMemoryReport\n(python)\u003e stagemetrics.print_memory_report()\n\nAdditional stage-level executor metrics (memory usage info updated at each heartbeat):\n\nStage 0 JVMHeapMemory maxVal bytes =\u003e 322888344 (307.9 MB)\nStage 0 OnHeapExecutionMemory maxVal bytes =\u003e 0 (0 Bytes)\nStage 1 JVMHeapMemory maxVal bytes =\u003e 322888344 (307.9 MB)\nStage 1 OnHeapExecutionMemory maxVal bytes =\u003e 0 (0 Bytes)\nStage 3 JVMHeapMemory maxVal bytes =\u003e 322888344 (307.9 MB)\nStage 3 OnHeapExecutionMemory maxVal bytes =\u003e 0 (0 Bytes)\n```\nNotes:\n- this report makes use of per-stage memory (executor metrics) data which is sent by the\n  executors at each heartbeat to the driver, there could be a small delay or the order of\n  a few seconds between the end of the job and the time the last metrics value is received.\n\n### CLI example for Task Metrics:\nThis is similar but slightly different from the example above as it collects metrics at the Task-level rather than Stage-level\n  ```\n  # Scala CLI\n  spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.12:0.25\n\n  val taskMetrics = ch.cern.sparkmeasure.TaskMetrics(spark)\n  taskMetrics.runAndMeasure(spark.sql(\"select count(*) from range(1000) cross join range(1000) cross join range(1000)\").show())\n  ```\n  ```\n  # Python CLI\n  # pip install pyspark\n  pip install sparkmeasure\n  pyspark --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25\n\n  from sparkmeasure import TaskMetrics\n  taskmetrics = TaskMetrics(spark)\n  taskmetrics.runandmeasure(globals(), 'spark.sql(\"select count(*) from range(1000) cross join range(1000) cross join range(1000)\").show()')\n  ```\n---\n\n## Setting Up SparkMeasure with Spark\n\n### Version Compatibility for SparkMeasure\n\n| Spark Version  | Recommended SparkMeasure Version | Scala Version       |\n| -------------- |----------------------------------|---------------------|\n| Spark 4.x      | 0.25 (latest)                    | Scala 2.13          |\n| Spark 3.x      | 0.25 (latest)                    | Scala 2.12 and 2.13 |\n| Spark 2.4, 2.3 | 0.19                             | Scala 2.11          |\n| Spark 2.2, 2.1 | 0.16                             | Scala 2.11          |\n\n### 📥 Downloading SparkMeasure\n\nTo get SparkMeasure, choose one of the following options:\n\n1. **Stable Releases:**\n\n    * Available on [Maven Central Repository](https://mvnrepository.com/artifact/ch.cern.sparkmeasure/spark-measure).\n\n2. **Specific Versions:**\n\n    * Download JAR files from the [sparkMeasure release notes](https://github.com/LucaCanali/sparkMeasure/releases/tag/v0.25).\n\n3. **Latest Development Builds:**\n\n    * Access the latest development builds from [GitHub Actions](https://github.com/LucaCanali/sparkMeasure/actions).\n\n4. **Build from Source:**\n\n    * Clone the repository and use sbt to build: `sbt +package`.\n\n### Setup Examples\n\n#### Spark 4 with Scala 2.13\n\n* **Scala:** `spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25`\n* **Python:**\n\n  ```bash\n  pyspark --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25\n  pip install sparkmeasure\n  ```\n\n#### Spark 3 with Scala 2.12\n\n* **Scala:** `spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.12:0.25`\n* **Python:**\n\n  ```bash\n  pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.25\n  pip install sparkmeasure\n  ```\n### Including sparkMeasure in your Spark environment\n\nChoose your preferred method:\n\n* Use the `--packages` option:\n\n  ```bash\n  --packages ch.cern.sparkmeasure:spark-measure_2.13:0.25\n  ```\n* Directly reference the JAR file:\n\n  ```bash\n  --jars /path/to/spark-measure_2.13-0.25.jar\n  --jars https://github.com/LucaCanali/sparkMeasure/releases/download/v0.25/spark-measure_2.13-0.25.jar\n  --conf spark.driver.extraClassPath=/path/to/spark-measure_2.13-0.25.jar\n  ```\n\n---\n## Notes on Spark Metrics\nSpark is instrumented with several metrics, collected at task execution, they are described in the documentation:  \n- [Spark Task Metrics docs](https://spark.apache.org/docs/latest/monitoring.html#executor-task-metrics)\n\nSome of the key metrics when looking at a sparkMeasure report are:\n- **elapsedTime:** the time taken by the stage or task to complete (in millisec)\n- **executorRunTime:** the time the executors spent running the task, (in millisec). Note this time is cumulative across all tasks executed by the executor.\n- **executorCpuTime:** the time the executors spent running the task, (in millisec). Note this time is cumulative across all tasks executed by the executor.\n- **jvmGCTime:** the time the executors spent in garbage collection, (in millisec).\n- shuffle metrics: several metrics with details on the I/O and time spend on shuffle\n- I/O metrics: details on the I/O (reads and writes). Note, currently there are no time-based metrics for I/O operations.\n\nTo learn more about the metrics, I advise you set up your lab environment and run some tests to see the metrics in action.\nA good place to start with is [TPCDS PySpark](https://github.com/LucaCanali/Miscellaneous/tree/master/Performance_Testing/TPCDS_PySpark) - A tool you can use run TPCDS with PySpark, instrumented with sparkMeasure\n\n---\n## Documentation, API, and examples \nSparkMeasure is one tool for many different use cases, languages, and environments:\n  * [![API Documentation](https://img.shields.io/badge/API-Documentation-brightgreen)](docs/Reference_SparkMeasure_API_and_Configs.md) \n    - [SparkMeasure's API and configurations](docs/Reference_SparkMeasure_API_and_Configs.md)\n\n\n  * **Interactive mode**   \n    Use sparkMeasure to collect and analyze Spark workload metrics in interactive mode when\n    working with shell or notebook environments, such as `spark-shell` (Scala), `PySpark` (Python) and/or\n    from `jupyter notebooks`.  \n    - **[SparkMeasure on PySpark and Jupyter notebooks](docs/Python_shell_and_Jupyter.md)**\n    - **[SparkMeasure on Scala shell and notebooks](docs/Scala_shell_and_notebooks.md)**\n   \n    \n  * **Batch and code instrumentation**  \n    Instrument your code with the sparkMeasure API, for collecting, saving,\n    and analyzing Spark workload metrics data. Examples and how-to guides:\n     - **[Instrument Spark-Python code](docs/Instrument_Python_code.md)**\n     - **[Instrument Spark-Scala code](docs/Instrument_Scala_code.md)**\n     - See also [Spark_CPU_memory_load_testkit](https://github.com/LucaCanali/Miscellaneous/tree/master/Performance_Testing/Spark_CPU_memory_load_testkit)\n       as an example of how to use sparkMeasure to instrument Spark code for performance testing.\n       \n\n  * **Flight Recorder mode**\n    SparkMeasure in flight recorder will collect metrics transparently, without any need for you \n    to change your code. \n    * Metrics can be saved to a file, locally, or to a Hadoop-compliant filesystem\n    * or you can write metrics in near-realtime to the following sinks: InfluxDB, Apache Kafka, Prometheus PushPushgateway\n    * More details:\n      - **[Flight Recorder mode with file sink](docs/Flight_recorder_mode_FileSink.md)**\n      - **[Flight Recorder mode with InfluxDB sink](docs/Flight_recorder_mode_InfluxDBSink.md)**\n      - **[Flight Recorder mode with Apache Kafka sink](docs/Flight_recorder_mode_KafkaSink.md)**\n      - **[Flight Recorder mode with Prometheus Pushgateway sink](docs/Flight_recorder_mode_PrometheusPushgatewaySink.md)**\n  \n\n  * **Limitations and known issues**\n    * **Support for Spark Connect**  \n    SparkMeasure cannot yet provide full integration with [Spark Connect](https://spark.apache.org/docs/latest/spark-connect-overview.html)\n    because it needs direct access to the `SparkContext` and its listener interface. You can run sparkMeasure in **Flight Recorder** mode\n    on the Spark Connect drive* to capture metrics for the entire application, but per-client (Spark Connect client-side)\n    metrics are not reported.\n    * **Single-threaded applications**  \n    The sparkMeasure APIs for generating reports using `StageMetric` and `TaskMetric` are best suited\n    for a single-threaded driver environment. These APIs capture metrics by calculating deltas between\n    snapshots taken at the start and end of an execution. If multiple Spark actions run concurrently on\n    the Spark driver, it may result in double-counting of metric values.\n  \n\n  * **Additional documentation and examples**\n    - Presentations at Spark/Data+AI Summit:\n      - [Performance Troubleshooting Using Apache Spark Metrics](https://databricks.com/session_eu19/performance-troubleshooting-using-apache-spark-metrics)\n      - [Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Methodologies](http://canali.web.cern.ch/docs/Spark_Summit_2017EU_Performance_Luca_Canali_CERN.pdf)\n    - Blog articles:\n      - [2018: SparkMeasure, a tool for performance troubleshooting of Apache Spark workloads](https://db-blog.web.cern.ch/blog/luca-canali/2018-08-sparkmeasure-tool-performance-troubleshooting-apache-spark-workloads),\n      - [2017: SparkMeasure blog post](http://db-blog.web.cern.ch/blog/luca-canali/2017-03-measuring-apache-spark-workload-metrics-performance-troubleshooting)\n  - [TODO list and known issues](docs/TODO_and_issues.md)\n  - [TPCDS-PySpark](https://github.com/LucaCanali/Miscellaneous/tree/master/Performance_Testing/TPCDS_PySpark) \n    a tool for running the TPCDS benchmark workload with PySpark and instrumented with sparkMeasure\n\n---\n## Architecture diagram  \n![sparkMeasure architecture diagram](docs/sparkMeasure_architecture_diagram.png)\n\n---\n## Main concepts underlying sparkMeasure implementation \n* The tool is based on the Spark Listener interface. Listeners transport Spark executor \n  [Task Metrics](https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_TaskMetrics.md)\n  data from the executor to the driver.\n  They are a standard part of Spark instrumentation, used by the Spark Web UI and History Server for example.     \n* The tool is built on multiple modules implemented as classes\n    * metrics collection and processing can be at the Stage-level or Task-level. The user chooses which mode to use with the API. \n    * metrics are can be buffered into memory for real-time reporting, or they can be dumped to an external\n    system in the \"flight recorder mode\".\n    * supported external systems are File Systems supported by the Hadoop API, InfluxDB, Apache Kafka, Prometheus Pushgateway.\n* Metrics are flattened and collected into local memory structures in the driver (ListBuffer of a custom case class).\n  * sparkMeasure in flight recorder mode when using one between the InfluxDB sink, Apache Kafka sink, and Prometheus Pushgateway sink, does not buffer,\n  but rather writes the collected metrics directly \n* Metrics processing:\n  * metrics can be aggregated into a report showing the cumulative values for each metric\n  * aggregated metrics can also be returned as a Scala Map or Python dictionary\n  * metrics can be converted into a Spark DataFrame for custom querying  \n* Metrics data and reports can be saved for offline analysis.\n\n### FAQ:  \n  - Why measuring performance with workload metrics instrumentation rather than just using execution time measurements?\n    - When measuring just the jobs' elapsed time, you treat your workload as \"a black box\" and most often this does\n      not allow you to understand the root causes of performance regression.   \n      With workload metrics you can (attempt to) go further in understanding and perform root cause analysis,\n      bottleneck identification, and resource usage measurement. \n\n  - What are Apache Spark task metrics and what can I use them for?\n     - Apache Spark measures several details of each task execution, including run time, CPU time,\n     information on garbage collection time, shuffle metrics, and task I/O. \n     See also Spark documentation for a description of the \n     [Spark Task Metrics](https://spark.apache.org/docs/latest/monitoring.html#executor-task-metrics)\n\n  - How is sparkMeasure different from Web UI/Spark History Server and EventLog?\n     - sparkMeasure uses the same ListenerBus infrastructure used to collect data for the Web UI and Spark EventLog.\n       - Spark collects metrics and other execution details and exposes them via the Web UI.\n       - Notably, Task execution metrics are also available through the [REST API](https://spark.apache.org/docs/latest/monitoring.html#rest-api)\n       - In addition, Spark writes all details of the task execution in the EventLog file \n       (see config of `spark.eventlog.enabled` and `spark.eventLog.dir`)\n       - The EventLog is used by the Spark History server + other tools and programs that can read and parse\n        the EventLog file(s) for workload analysis and performance troubleshooting, see a [proof-of-concept example of reading the EventLog with Spark SQL](https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_EventLog.md)  \n     - There are key differences that motivate this development: \n        - sparkMeasure can collect data at the stage completion-level, which is more lightweight than measuring\n        all the tasks, in case you only need to compute aggregated performance metrics. When needed, \n        sparkMeasure can also collect data at the task granularity level.\n        - sparkMeasure has an API that makes it simple to add instrumentation/performance measurements\n         in notebooks and in application code for Scala, Java, and Python. \n        - sparkMeasure collects data in a flat structure, which makes it natural to use Spark SQL for \n        workload data analysis/\n        - sparkMeasure can sink metrics data into external systems (Filesystem, InfluxDB, Apache Kafka, Prometheus Pushgateway)\n\n  - What are known limitations and gotchas?\n    - sparkMeasure does not collect all the data available in the EventLog\n    - See also the [TODO and issues doc](docs/TODO_and_issues.md)\n    - The currently available Spark task metrics can give you precious quantitative information on \n      resources used by the executors, however there do not allow to fully perform time-based analysis of\n      the workload performance, notably they do not expose the time spent doing I/O or network traffic.\n    - Metrics are collected on the driver, which could become a bottleneck. This is an issues affecting tools \n      based on Spark ListenerBus instrumentation, such as the Spark WebUI.\n      In addition, note that sparkMeasure in the current version buffers all data in the driver memory.\n      The notable exception is when using the Flight recorder mode with InfluxDB or \n      Apache Kafka or Prometheus Pushgateway sink, in this case metrics are directly sent to InfluxDB/Kafka/Prometheus Pushgateway.\n    - Task metrics values are collected by sparkMeasure only for successfully executed tasks. Note that \n      resources used by failed tasks are not collected in the current version. The notable exception is\n      with the Flight recorder mode with InfluxDB or Apache Kafka or Prometheus Pushgateway sink.\n    - sparkMeasure collects and processes data in order of stage and/or task completion. This means that\n      the metrics data is not available in real-time, but rather with a delay that depends on the workload\n      and the size of the data. Moreover, performance data of jobs executing at the same time can be mixed.\n      This can be a noticeable issue if you run workloads with many concurrent jobs.\n    - Task metrics are collected by Spark executors running on the JVM, resources utilized outside the\n      JVM are currently not directly accounted for (notably the resources used when running Python code\n      inside the python.daemon in the case of Python UDFs with PySpark).\n\n  - When should I use Stage-level metrics and when should I use Task-level metrics?\n     - Use stage metrics whenever possible as they are much more lightweight. Collect metrics at\n     the task granularity if you need the extra information, for example if you want to study \n     effects of skew, long tails and task stragglers.\n\n  - How can I save/sink the collected metrics?\n     - You can print metrics data and reports to standard output or save them to files, using\n      a locally mounted filesystem or a Hadoop compliant filesystem (including HDFS).\n     Additionally, you can sink metrics to external systems (such as Prometheus Pushgateway). \n     The Flight Recorder mode can sink to InfluxDB, Apache Kafka or Prometheus Pushgateway. \n\n  - How can I process metrics data?\n     - You can use Spark to read the saved metrics data and perform further post-processing and analysis.\n     See the also [Notes on metrics analysis](docs/Notes_on_metrics_analysis.md).\n\n  - How can I contribute to sparkMeasure?\n    - SparkMeasure has already profited from users submitting PR contributions. Additional contributions are welcome. \n    See the [TODO_and_issues list](docs/TODO_and_issues.md) for a list of known issues and ideas on what \n    you can contribute.  \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLucaCanali%2FsparkMeasure","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FLucaCanali%2FsparkMeasure","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLucaCanali%2FsparkMeasure/lists"}