{"id":18810345,"url":"https://github.com/absaoss/spot","last_synced_at":"2025-04-13T20:30:56.807Z","repository":{"id":43148712,"uuid":"273510510","full_name":"AbsaOSS/spot","owner":"AbsaOSS","description":"Aggregate and analyze Spark history, export to elasticsearch, visualize and monitor with Kibana.","archived":false,"fork":false,"pushed_at":"2024-07-31T14:28:43.000Z","size":415,"stargazers_count":5,"open_issues_count":20,"forks_count":0,"subscribers_count":7,"default_branch":"develop","last_synced_at":"2024-08-01T18:16:40.339Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AbsaOSS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-06-19T14:15:37.000Z","updated_at":"2023-04-10T11:37:29.000Z","dependencies_parsed_at":"2023-12-03T21:24:24.793Z","dependency_job_id":"e23bdd32-0d4d-43e5-9a13-8c241abcee3f","html_url":"https://github.com/AbsaOSS/spot","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fspot","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fspot/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fspot/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fspot/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AbsaOSS","download_url":"https://codeload.github.com/AbsaOSS/spot/tar.gz/refs/heads/develop","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223603201,"owners_count":17172059,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-07T23:19:51.840Z","updated_at":"2024-11-07T23:19:52.436Z","avatar_url":"https://github.com/AbsaOSS.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"![Spot logo](https://user-images.githubusercontent.com/8556576/149575510-48f57d83-a482-454a-bb14-709bfa7e6fb1.png)\n\n\u003c!-- toc --\u003e\n- [What is Spot?](#what-is-spot)\n- [Monitoring examples: How Spot is used to tune Spark apps](#monitoring-examples)\n- [Modules](#modules)\n    - [Crawler](#crawler)\n    - [Regression](#regression)\n    - [Setter](#setter)\n    - [Enceladus](#enceladus)\n- [Deployment](#deployment)\n\u003c!-- tocstop --\u003e\n\n## What is Spot?\nSpot is a set of tools for monitoring and performance tuning of [Spark](https://github.com/apache/spark) applications.\nThe main idea is to continuously apply statistical analysis on repeating (production) runs of the same applications.\nThis enables comparison of target metrics (e.g. time, cluster load, cloud cost) between different code versions and configurations.\nFurthermore, ML models and optimization techniques can be applied to configure new application runs automatically [Future].\n\nOne of the primary use cases considered is ETL (Extract Transform Load) in batch mode.\n[Enceladus](https://github.com/AbsaOSS/enceladus) is an example of one such projects. Such an application runs repeatedly\n(e.g. thousands of runs per hour) on new data instances which vary greatly in size and processing complexity. For this reason,\na uniform setup would not be optimal for the entire spectrum of runs.\nIn contrast, the statistical approach allows for the categorization of cases and an automatic setup of configurations for new runs.\n\nSpot relies on metadata available in Spark History and therefore does not require additional instrumentation of Spark apps.\nThis enables collection of statistics of production runs without compromising their performance.\n\nSpot consists of the following modules:\n\n|     Module     |          Short description          |\n|----------------|-------------------------------------|\n| Crawler        | The crawler performs collection and initial processing of Spark history data. The output is stored in Elasticsearch and can be visualized with Kibana for monitoring. |\n| Regression     |(Future) The regression models use the stored data in order to interpolate time VS. config values. |\n| Setter         |(Future) The Setter module suggests config values for new runs of Spark apps based on the regression model.|\n| Enceladus      |The Enceladus module provides integration capabilities for Spot usage with [Enceladus](https://github.com/AbsaOSS/enceladus).|\n| Yarn           | The module contains its own crawler which provides data collection from [YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html) API. The data can be visualized with provided Kibana dashboards. (Future) The YARN data is merged with data from other sources (Spark, Enceladus) for a more complete analyses.|\n| Kibana         | A collection of Kibana dashboards and alerts which provide visualization and monitoring for the Spot data. |\n\nA detailed description of each module can be found in section [Modules](#modules).\n\nThe diagram below shows current Spot architecture.\n\n![Spot architecture](https://user-images.githubusercontent.com/8556576/87431759-5e64c100-c5e7-11ea-84bb-ae1e2403c84a.png)\n\n## Monitoring examples\nIn this section we provide examples of plots and analysis which demonstrate how Spot is applied to monitor and tune\nthe performance of Spark jobs.\n\n#### Example: Cluster usage over time\n![Cluster usage](https://user-images.githubusercontent.com/8556576/88381248-5efb1580-cda6-11ea-8eb1-80524b4f167a.png)\nThis plot shows how many CPU cores were allocated for Spark apps by each user over time. Similar plots can be obtained for memory used by executors,\nthe amount of shuffled data and so on. The series can also be split by application name or other metadata.\nKibana time series, used in this example, does not account for the duration of allocation. This is planned to be addressed using custom plots in the future.\n\n#### Example: Characteristics of a particular Spark application\nWhen an application is running repeatedly, statistics of runs can be used to focus code optimization towards the most\ncritical and common cases. Such statistics can also be used to compare app versions.\nThe next two plots show histograms of run duration in milliseconds (attempt.duration) and size of input data in bytes\n(attempts.aggs.stages.inputBytes.max). Filters on max values are applied to both plots in order to keep a reasonable scale.\n![Time histogram](https://user-images.githubusercontent.com/8556576/88382148-23614b00-cda8-11ea-9965-654b1b0bf691.png)\n![Size histogram](https://user-images.githubusercontent.com/8556576/88382162-2e1be000-cda8-11ea-93e3-d9cc47a27f15.png)\nThe next figure shows statistics of configurations used in different runs of the same app.\n![Configurations](https://user-images.githubusercontent.com/8556576/88383534-01b59300-cdab-11ea-8080-f8fc6c454a9d.png)\nWhen too many executors are allocated to a relatively small job or partitioning is not working properly\n(e.g. unsplitable data formats), some of the executors remain idle for the entire run. In such cases the resource\nallocation can be  safely decreased in order to reduce the cluster load. The next histogram illustrates such a case.\n![Zero tasks](https://user-images.githubusercontent.com/8556576/88386609-0aa96300-cdb1-11ea-9cf5-be970e53eec6.png)\n\n#### Example: Dynamic VS Fixed resource allocation\n![Dynamic Resource Allocation](https://user-images.githubusercontent.com/8556576/88194526-3a385e00-cc3f-11ea-817b-b72254f16cf9.png)\nThe plot above shows the relationship between run duration, input size and total CPU core allocation for Enceladus runs on a particular dataset.\nThe left sub-plot corresponds to a fixed resource allocation which was the default. Due to great variation of input size\nin data pipelines, fixed allocation often leads to either: 1) extended time in case of under-allocation or 2) wasted\nresources in case of over-allocation. The right sub-plot demonstrates how\n[Dynamic Resource Allocation](http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation),\nset as a new default, solves this issue.\nHere the number of cores is adjusted based on the input size, and as a result the total job duration stabilizes and efficiency improves.\n\n #### Example: Small files issue\n![Small files issue](https://user-images.githubusercontent.com/8556576/88194561-41f80280-cc3f-11ea-97ed-75657585392f.png)\nWhen shuffle operations are present, Spark creates 200 partitions by default regardless of the data size. Excessive\nfragmentation of small files compromises HDFS performance.  The presented plot, produced by Spot, shows how the number\nof output partitions depends on the input size with old/new configurations. As can be seen in the plot,\n[Adaptive Execution](https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html)\ncreates a reasonable number of partitions proportional to data size. Based on such analysis, enabled by Spot,\nit was set as a new default for Enceladus.\n\n#### Example: Parallelism\nHere we demonstrate application of selected metrics from [parellel algorithms theory](https://en.wikipedia.org/wiki/Analysis_of_parallel_algorithms)\nto [Spark execution model](https://spark.apache.org/docs/latest/cluster-overview.html#cluster-mode-overview).\n\nThe diagram below shows the execution timeline of a Spark app. Here, for simplicity of the demonstration, we assume each\nexecutor has a single CPU core. The duration of the run on _m_ executors is denoted _T(m)_. The allocation time of each\nexecutor is presented by a dotted orange rectangle. Tasks on the executors are shown as green rectangles. The tasks\nare organized in stages which may overlap. Tasks in each stage are executed in parallel.\nThe parts of the driver program which do not overlap with stages are pictured as red rectangles. In our analysis,\nwe assume these parts make up the _sequential part_ of the program which has a fixed duration.\nThis includes the code which is not parallelizable (on executors):\nstartup and scheduling overheads, Spark's query optimizations, external API calls, custom driver code, etc. In other words,\nthis is the part of the program during which there are no tasks running on the executors.\nThe rest of the run duration corresponds to the _parallel part_, i.e. when tasks can be executed.\n![Spark parallelism](https://user-images.githubusercontent.com/8556576/88536230-bc43d080-d00b-11ea-8841-f7a9b925ef6b.png)\nTotal _allocated core time_ is the sum of the products of allocation time per executor and the number of cores allocated to that executor, as defined by this formula:\n\n\u003cimg src=\"https://user-images.githubusercontent.com/8556576/89024127-e99ec000-d324-11ea-8f38-0ce072024e0e.gif\"/\u003e\n\n\nKnowing the duration of the sequential part and the total duration of all of the tasks, we can also estimate the duration of a (hypothetical)\nrun on a single executor. The next plot shows an example of how Spot visualizes the described metrics.\nHere, the values averaged over multiple runs are shown for two types of Enceladus apps.\n\n![Parallelism per job](https://user-images.githubusercontent.com/8556576/88376482-ca8cb500-cd9d-11ea-9692-78b659f8b2f9.png)\n\nThe efficiency and speedup are estimated using the following formulas:\n\n\u003cimg src=\"https://user-images.githubusercontent.com/8556576/88538274-578a7500-d00f-11ea-9b91-bc1391504f97.png\" width=\"350px\" /\u003e\n\nPlease note that in this analysis we focus on parallelism on executors; the possible parallelism of the driver part\non multiple driver cores requires a separate investigation.\n\nThe next two histograms display the efficiency and speedup of multiple runs for a sample Spark app with different\n inputs and configurations.\n![Efficiency hist](https://user-images.githubusercontent.com/8556576/88551797-93c7d080-d023-11ea-876c-de6ff173dbc4.png)\n![Speedup hist](https://user-images.githubusercontent.com/8556576/88552001-cffb3100-d023-11ea-8c85-fd8b97e8359e.png)\nFurther analysis of such metrics may include dependencies on particular configuration values.\n\n## Modules\n\n### Crawler\nThe Crawler module aggregates [Spark history](https://spark.apache.org/docs/latest/monitoring.html#rest-api) data and\nstores it in [Elasticsearch](https://github.com/elastic/elasticsearch) for further analysis by tools such as [Kibana](https://github.com/elastic/kibana).\nThe Spark History data are merged from several [APIs](https://spark.apache.org/docs/latest/monitoring.html#rest-api)\n(attempts, executors, stages) into a single raw JSON document.\n\nInformation from external services (currently: Menas) is added for supported\napplications (currently: [Enceladus](https://github.com/AbsaOSS/enceladus)). The raw documents are stored in a separate\nElasticsearch collection. Aggregations for each document are stored in a separate\ncollection. The aggregations are performed in the following way: custom aggregations\n(e.g. min, max, mean, non-zero) are calculated for each value (e.g. completed tasks)\nacross elements of each array in the original raw document (e.g. executors). Custom\ncalculated values are added, e.g. total CPU allocation, estimated efficiency and speedup.\nSome of the records can be inconsistent due to external services (e.g Spark History Server error)\nand raise exceptions during processing. Such exceptions are handled and corresponding records\nare stored in a separate collection along with error messages.\n\n\n### Regression\n(Future) The regression models are using the stored data in order to interpolate time VS. config values.\n\n### Setter\n(Future) The Setter module suggests config values for new runs of Spark apps based on the regression model.\n\n### Enceladus\nThe Enceladus module provides integration capabilities for Spot usage with [Enceladus](https://github.com/AbsaOSS/enceladus)\n\n### YARN\n\nThe module contains its own crawler which provides data collection from [YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html) API. The data can be visualized with provided Kibana dashboards. (Future) The YARN data is merged with data from other sources (Spark, Enceladus) for a more complete analyses.|\n\n\n### Kibana\nA collection of Kibana dashboards and alerts which provide visualization and monitoring for the Spot data.\n\n## Deployment\n- Install Python **3.7.16**\n- Clone code to a location of your choice\n- Install required modules (see requirements.txt) `pip3 install --user -r requirements.txt`\n- Add project root directory to PYTHONPATH e.g. `export PYTHONPATH=\"${PYTHONPATH}:/path/to/spot\"` if PYTHONPATH is already defined, otherwise `export PYTHONPATH=\"$(which python3):/path/to/spot\"`\n- Check access to external services:\n    - Elasticsearch and Kibana (OR OpenSearch and OpenSearch Dashboards)\n    - Spark History (2.4 and later recommended)\n    - (Optional) [Menas](https://github.com/AbsaOSS/enceladus) (2.1.0 and later recommended) Requires username and password\n    - (Optional) [YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html)\n- Create configuration: in /spot/config copy config.ini.template to config.ini and set parameters from the above step\n    - For a new deployment set new index names which do not exist in elasticsearch.\n    In order to be compatible with the provided [Kibana objects](spot/kibana/) the indexes should match the following patterns:\n    (optional) raw_index=spot\\_raw\\_\\\u003ccluster_name\\\u003e\\_\\\u003cid\\\u003e\n    agg_index=spot\\_agg\\_\\\u003ccluster_name\\\u003e\\_\\\u003cid\\\u003e\n    err_index=spot\\_err\\_\\\u003ccluster_name\\\u003e\\_\\\u003cid\\\u003e\n- Configure logging: in /spot/config copy logging_confg.template to logging_confg.ini and adjust the parameters (see [Logging](https://docs.python.org/2/library/logging.config.html#configuration-file-format))\n\n### Multicluster configuration\nIt is possible to monitor multiple clusters (each with its own Spark History server) with Spot.\nFor this scenario a separate Spot crawler process needs to be running for each Spark History server (and optionally Menas).\nEach process writes to its own set of indexes within the same elasticsearch instance.\nIf the index names follow the defined pattern (spot\\_\\\u003craw/agg/err\\\u003e\\_\\\u003ccluster_name\\\u003e\\_\\\u003cid\\\u003e)\nthe data can be visualized in Kibana using the setup provided in [Kibana directory](spot/kibana/).\nThere the data can be filtered by history_host.keyword if required.\n\n### Run Crawler\n`cd spot/crawler`\n\n`python3 crawler.py [options]`\n\n|    Option     |      Default     |                    Description                 |\n|---------------|------------------|------------------------------------------------|\n|--min_end_date | None             |Optional. Minimal completion date of the Spark job in the format YYYY-MM-DDThh:mm:ss. Crawler processes Spark jobs completed after the latest of a) the max completion date among already processed jobs (stored in the database) b) this option. In the first run, when there are niether processed jobs in the database nor this option is specified, the crawler starts with the earliest completed job in Spark History.|\n\nThis will start the main loop of the crawler. It gets new completed apps, processes and stores them in the database. When all the new apps are processed the crawler sleeps `sleep_seconds` (see config.ini) before the next iteration. To exit the loop, kill the process.\n\n\n### Import Kibana Demo Dashboard\n[Kibana directory](spot/kibana/) contains objects which can be\n[imported to Kibana](https://www.elastic.co/guide/en/kibana/current/managing-saved-objects.html#:~:text=Importedit,already%20in%20Kibana%20are%20overwritten.).\nFor example, there is a [demo dashboard](spot/kibana/dashboards/spot_demo.ndjson) demonstrating basic statistics of Spark applications.\n\n### Configure Alerts\nTo trigger an [alert in Kibana](https://www.elastic.co/guide/en/kibana/master/alerting-getting-started.html)\nwhen a critical error occurs in Spot (e.g. Spark History server is in an incorrect state)\nthe [example queries](spot/kibana/alerting/internal_errors/spot_severe_internal_errors.txt) can be used.\n\nThe Kibana alerts can be configured to [use an AWS SNS topic as a destination](https://aws.amazon.com/blogs/big-data/setting-alerts-in-amazon-elasticsearch-service/), which can then be configured to send notifications via email, etc.\n In addition, an [encrypted SNS topic](https://aws.amazon.com/blogs/compute/encrypting-messages-published-to-amazon-sns-with-aws-kms/) can be used (recommended)\n which requires additional configuration of an IAM role, as documented in the referenced tutorial.\n An example of [generating an alert message](spot/kibana/alerting/internal_errors/spot_severe_internal_errors_message.mustache) used together with the example query is provided.\n\n## YARN integration\nSpot can import and visualize monitoring metrics from YARN API.\nThe import is performed in a separate [yarn_crawler.py](spot/yarn/yarn_crawler.py) process.\nThis process should be run on a host where it can access YARN API and Elasticserach.\nIt uses the same configuration `config.ini` as the main `crawler.py` process, where some of the configurations are shared and more are added for YARN specifically.\nThe relevant parameters are:\n - `yarn_api_base_url = http://localhost:8088/ws/v1` base url to access YARN API\n - `yarn_sleep_seconds = 60` sleep time between API calls\n - Elasticsearch indexes:\n   - `yarn_clust_index = spot_yarn_cluster_\u003ccluster_name\u003e_\u003cid\u003e` stores general cluster statistics sampled at each iteration\n   - `yarn_apps_index = spot_yarn_apps_\u003ccluster_name\u003e_\u003cid\u003e` stores details of completed applications\n   - `yarn_scheduler_index = spot_yarn_scheduler_\u003ccluster_name\u003e_\u003cid\u003e` stores statistics sampled from the scheduler. It contains documents of multiple types (which can be filtered by `spot.doc_type` filed) for queues, partitions and users\n   - `err_index` is shared with the main crawler config. It stores exception messages that appear during yarn_crawler run\n - `skip_exceptions` parameter is shared with the main crawler\n - Elasticsearch configuration (URL and authentication) is shared with the main crawler\n\n[Kibana directory](spot/kibana/) contains dashboards which visualize the data collected from YARN.\nDescription of available metrics can be found in [YARN documentation](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html).\n\nIt is planned to enrich Spark jobs metadata with the YARN metadata in future.\nFor instance it would add exact details which are not available from Spark History alone, e.g. vCoresSeconds and memorySeconds.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fspot","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabsaoss%2Fspot","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fspot/lists"}