{"id":13906357,"url":"https://github.com/qubole/sparklens","last_synced_at":"2025-04-12T23:40:47.502Z","repository":{"id":45634906,"uuid":"125554232","full_name":"qubole/sparklens","owner":"qubole","description":"Qubole Sparklens tool for performance tuning Apache Spark","archived":false,"fork":false,"pushed_at":"2024-06-26T16:08:19.000Z","size":179,"stargazers_count":574,"open_issues_count":51,"forks_count":141,"subscribers_count":28,"default_branch":"master","last_synced_at":"2025-04-12T23:40:02.771Z","etag":null,"topics":["cluster","performance","performance-analysis","performance-metrics","performance-tuning","performance-visualization","scala","scheduler","scheduling","simulation","spark","spark-applications","spark-job","spark-ml","spark-mllib","spark-sql","sparkjava"],"latest_commit_sha":null,"homepage":"http://sparklens.qubole.com","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/qubole.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-03-16T18:20:20.000Z","updated_at":"2025-04-08T23:44:24.000Z","dependencies_parsed_at":"2024-12-07T03:15:30.647Z","dependency_job_id":null,"html_url":"https://github.com/qubole/sparklens","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qubole%2Fsparklens","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qubole%2Fsparklens/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qubole%2Fsparklens/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qubole%2Fsparklens/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/qubole","download_url":"https://codeload.github.com/qubole/sparklens/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248647254,"owners_count":21139081,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cluster","performance","performance-analysis","performance-metrics","performance-tuning","performance-visualization","scala","scheduler","scheduling","simulation","spark","spark-applications","spark-job","spark-ml","spark-mllib","spark-sql","sparkjava"],"created_at":"2024-08-06T23:01:34.111Z","updated_at":"2025-04-12T23:40:47.470Z","avatar_url":"https://github.com/qubole.png","language":"Scala","funding_links":[],"categories":["Scala"],"sub_categories":[],"readme":"[![Gitter](https://badges.gitter.im/qubole-sparklens/community.svg)](https://gitter.im/qubole-sparklens/community?utm_source=badge\u0026utm_medium=badge\u0026utm_campaign=pr-badge)\n\n# README #\n\nSparklens is a profiling tool for Spark with a built-in Spark scheduler simulator. Its primary goal is to make it easy \nto understand the scalability limits of Spark applications. It helps in understanding how efficiently a given \nSpark application is using the compute resources provided to it. Maybe your application will run faster with more \nexecutors and may be it wont. Sparklens can answer this question by looking at a single run of your application. \n\nIt helps you narrow down to few stages (or driver, or skew or lack of tasks) which are limiting your application \nfrom scaling out and provides contextual information about what could be going wrong with these stages. Primarily \nit helps you approach spark application tuning as a well defined method/process instead of something you learn by \ntrial and error, saving both developer and compute time. \n\n## Sparklens Reporting as a Service\n\nhttp://sparklens.qubole.com is a reporting service built on top of Sparklens. This service was built to lower the pain of sharing and discussing Sparklens \noutput. Users can upload the Sparklens JSON file to this service and retrieve a global sharable \nlink. The link delivers the Sparklens report in an easy-to-consume HTML format with intuitive \ncharts and animations. It is also useful to have a link for easy reference for yourself, in case \nsome code changes result in lower utilization or make the application slower.\n\n## What does it report?\n\n* Estimated completion time and estimated cluster utilisation with different numbers of executors\n \n ```\n Executor count    31  ( 10%) estimated time 87m 29s and estimated cluster utilization 92.73%\n Executor count    62  ( 20%) estimated time 47m 03s and estimated cluster utilization 86.19%\n Executor count   155  ( 50%) estimated time 22m 51s and estimated cluster utilization 71.01%\n Executor count   248  ( 80%) estimated time 16m 43s and estimated cluster utilization 60.65%\n Executor count   310  (100%) estimated time 14m 49s and estimated cluster utilization 54.73%\n```\nGiven a single run of a Spark application, Sparklens can estimate how your application will perform \ngiven any arbitrary number of executors. This helps you understand the ROI on adding executors. \n\n* Job/stage timeline which shows how the parallel stages were scheduled within a job. This makes it easy to visualise \nthe DAG with stage dependencies at the job level. \n\n```\n07:05:27:666 JOB 151 started : duration 01m 39s \n[    668 ||||||||||||||||||||||                                                          ]\n[    669 |||||||||||||||||||||||||||                                                     ]\n[    673                                                                                 ]\n[    674                         ||||                                                    ]\n[    675                            |||||||                                              ]\n[    676                                   ||||||||||||||                                ]\n[    677                         |||||||                                                 ]\n[    678                                                 |                               ]\n[    679                                                  |||||||||||||||||||||||||||||||]\n```\n\n*Lots of interesting per-stage metrics like Input, Output, Shuffle Input and Shuffle Output per stage. **OneCoreComputeHours** \navailable and used per stage to discover inefficient stages. \n\n```\nTotal tasks in all stages 189446\nPer Stage  Utilization\nStage-ID   Wall    Task      Task     IO%    Input     Output    ----Shuffle-----    -WallClockTime-    --OneCoreComputeHours---   MaxTaskMem\n          Clock%  Runtime%   Count                               Input  |  Output    Measured | Ideal   Available| Used%|Wasted%                                  \n       0    0.00    0.00         2    0.0  254.5 KB    0.0 KB    0.0 KB    0.0 KB    00m 04s   00m 00s    05h 21m    0.0  100.0    0.0 KB \n       1    0.00    0.01        10    0.0  631.1 MB    0.0 KB    0.0 KB    0.0 KB    00m 07s   00m 00s    08h 18m    0.2   99.8    0.0 KB \n       2    0.00    0.40      1098    0.0    2.1 GB    0.0 KB    0.0 KB    5.7 GB    00m 14s   00m 00s    16h 25m    3.2   96.8    0.0 KB \n       3    0.00    0.09       200    0.0    0.0 KB    0.0 KB    5.7 GB    2.3 GB    00m 03s   00m 00s    04h 35m    2.6   97.4    0.0 KB \n       4    0.00    0.03       200    0.0    0.0 KB    0.0 KB    2.3 GB    0.0 KB    00m 01s   00m 00s    01h 13m    2.9   97.1    0.0 KB \n       7    0.00    0.03       200    0.0    0.0 KB    0.0 KB    2.3 GB    2.7 GB    00m 02s   00m 00s    02h 27m    1.7   98.3    0.0 KB \n       8    0.00    0.03        38    0.0    0.0 KB    0.0 KB    2.7 GB    2.7 GB    00m 05s   00m 00s    06h 20m    0.6   99.4    0.0 KB \n```\n\nInternally, Sparklens has the concept of an analyzer which is a generic component for emitting interesting events. \nThe following analyzers are currently available:\n\n1. AppTimelineAnalyzer\n2. EfficiencyStatisticsAnalyzer\n3. ExecutorTimelineAnalyzer\n4. ExecutorWallclockAnalyzer\n5. HostTimelineAnalyzer\n6. JobOverlapAnalyzer\n7. SimpleAppAnalyzer\n8. StageOverlapAnalyzer\n9. StageSkewAnalyzer\n\nWe are hoping that Spark experts the world over will help us with ideas or contributions to extend this set. Similarly, \nSpark users can help us in finding what is missing here by raising challenging tuning questions.   \n\n## How to use Sparklens?\n\n#### 1. Using the Sparklens package while running your app #### \n\nNote: Apart from the console based report, you can also get an UI based report similar to \n[this](http://sparklens.qubole.com/report_view/1b3868a49388e7ab6a16) in your email. You have to pass\n `--conf spark.sparklens.report.email=\u003cemail\u003e` along with other relevant confs mentioned below.\n This functionality is available in Sparklens 0.3.2 and above.  \n\nUse the following arguments to `spark-submit` or `spark-shell`:\n```\n--packages qubole:sparklens:0.3.2-s_2.11\n--conf spark.extraListeners=com.qubole.sparklens.QuboleJobListener\n```\n\n#### 2. Run from Sparklens offline data ####\n\nYou can choose not to run sparklens inside the app, but at a later time. Run your app as above \nwith additional configuration parameters:\n```\n--packages qubole:sparklens:0.3.2-s_2.11\n--conf spark.extraListeners=com.qubole.sparklens.QuboleJobListener\n--conf spark.sparklens.reporting.disabled=true\n```\n\nThis will not run reporting, but instead create a Sparklens JSON file for the application which is \nstored in the **spark.sparklens.data.dir** directory (by default, **/tmp/sparklens/**). Note that this will be stored on HDFS by default. To save this file to s3, please set **spark.sparklens.data.dir** to s3 path. This data file can now be used to run Sparklens reporting independently, using `spark-submit` command as follows:\n\n`./bin/spark-submit --packages qubole:sparklens:0.3.2-s_2.11 --class com.qubole.sparklens.app.ReporterApp qubole-dummy-arg \u003cfilename\u003e`\n\n`\u003cfilename\u003e` should be replaced by the full path of sparklens json file. If the file is on s3 use the full s3 path. For files on local file system, use file:// prefix with the local file location. HDFS is supported as well. \n\nYou can also upload a Sparklens JSON data file to http://sparklens.qubole.com to see this report as an HTML page. \n\n#### 3. Run from Spark event-history file ####\n\nYou can also run Sparklens on a previously run spark-app using an event history file, (similar to \nrunning via `sparklens-json-file` above) with another option specifying that is file is an \nevent history file. This file can be in any of the formats the event history files supports, i.e. **text, snappy, lz4 \nor lzf**. Note the extra `source=history` parameter in this example:\n\n`./bin/spark-submit --packages qubole:sparklens:0.3.2-s_2.11 --class com.qubole.sparklens.app.ReporterApp qubole-dummy-arg \u003cfilename\u003e source=history`\n\nIt is also possible to convert an event history file to a Sparklens json file using the following command:\n\n`./bin/spark-submit --packages qubole:sparklens:0.3.2-s_2.11 --class com.qubole.sparklens.app.EventHistoryToSparklensJson qubole-dummy-arg \u003csrcDir\u003e \u003ctargetDir\u003e`\n\nEventHistoryToSparklensJson is designed to work on local file system only. Please make sure that the source and target directories are on local file system.\n\n#### 4. Checkout the code and use the normal sbt commands: #### \n\n```\nsbt compile \nsbt package \nsbt clean \n```\nYou will find the Sparklens jar in the `target/scala-2.11` directory. Make sure the Scala and Java versions correspond to those required by your Spark cluster. We have tested it with Java 7/8, \nScala 2.11.8 and Spark versions 2.0.0 and onwards. \n\nOnce you have the Sparklens JAR available, add the following options to your `spark-submit` command line:\n```\n--jars /path/to/sparklens_2.11-0.3.2.jar \n--conf spark.extraListeners=com.qubole.sparklens.QuboleJobListener\n```\nYou could also add this to your cluster's **spark-defaults.conf** so that it is automatically available for all applications.\n\n\n## Working with Notebooks\nIt is possible to use Sparklens in your development cycle using notebooks. Sparklens keeps lots of information in-memory. \nTo make it work with notebooks, it tries to minimize the amount of memory by keeping limited history of jobs executed \nin Spark.\n\n## How to use Sparklens with Python notebooks (e.g. Zeppelin)\n\n1) Add this as the first cell\n\n```\nQNL = sc._jvm.com.qubole.sparklens.QuboleNotebookListener.registerAndGet(sc._jsc.sc())\nimport time\n\ndef profileIt(callableCode, *args):\nif (QNL.estimateSize() \u003e QNL.getMaxDataSize()):\n  QNL.purgeJobsAndStages()\nstartTime = long(round(time.time() * 1000))\nresult = callableCode(*args)\nendTime = long(round(time.time() * 1000))\ntime.sleep(QNL.getWaiTimeInSeconds())\nprint(QNL.getStats(startTime, endTime))\n```\n2) Wrap your code in some python function say myFunc\n3) `profileIt(myFunc)`\n\nAs you can see this is not the only way to use it from Python. The core function is:\n     **QNL.getStats(startTime, endTime)**\n\nAnother way to use this tool, so that we don’t need to worry about objects going out of scope is:\n\nCreate the QNL object as part of the first paragraph \nFor every piece of code that requires profiling:\n\n```\nif (QNL.estimateSize() \u003e QNL.getMaxDataSize()):\n  QNL.purgeJobsAndStages()\nstartTime = long(round(time.time() * 1000))\n\n\u003c-- Your Python code here --\u003e\n\nendTime = long(round(time.time() * 1000))\ntime.sleep(QNL.getWaiTimeInSeconds())\nprint(QNL.getStats(startTime, endTime))\n```\n\n`QNL.purgeJobsAndStages()` is responsible for making sure that the tool doesn’t use too much memory. \nIt removes historical information, throwing away data about old stages to keep the memory usage \nby the tool modest.\n\n## How to use Sparklens with Scala notebooks (e.g. Zeppelin)\n\n\n1) Add this as the first cell\n\n```\nimport com.qubole.sparklens.QuboleNotebookListener\nval QNL = new QuboleNotebookListener(sc.getConf)\nsc.addSparkListener(QNL)\n```\n2) Anywhere you need to profile the code:\n\n```\nQNL.profileIt {\n    //Your code here\n}\n```\n\nIt is important to realize that `QNL.profileIt` takes a block of code as input. Hence any variables declared in this\npart are not accessible after the method returns. Of course it can refer to other code/variables in scope. \n\nThe way to go about using this tool with notebooks is to have only one cell in the profiling scope. The moment \nyou are happy with the results, just remove the profiling wrapper and execute the same cell again. This will ensure \nthat your variables come back in scope and are accessible to next cell. Also note that, the output of the tool \nin notebooks is little different from what you would see in command line. This is just to make the information concise. \nWe will be making this part configurable.\n\n### Working with Streaming Applications ###\nFor using Sparklens with Spark Streaming applications, check out our new project [Streaminglens](https://github.com/qubole/streaminglens).\n\n## More informtaion?\n* [Introduction to Sparklens](https://www.qubole.com/blog/introducing-quboles-spark-tuning-tool/)\n* [Video from meetup: Concepts behind Sparklens](https://www.youtube.com/watch?v=0a2U4_6zsCc)\n* [Slides from meetup](https://lnkd.in/fCsrKXj)\n* [Video from Fifth Elephant Conference](https://www.youtube.com/watch?v=SOFztF-3GGk)\n* [Video from Spark AI Summit London 2018](https://www.youtube.com/watch?v=KS5vRZPLo6c)\n\n## Release Notes\n- [03/20/2018] Version 0.1.1 - Sparklens Core\n- [04/06/2018] Version 0.1.2 - Package name fixes\n- [08/07/2018] Version 0.2.0 - Support for offline reporting\n- [01/10/2019] Version 0.2.1 - Stability fixes\n- [05/10/2019] Version 0.3.0 - Support for handling parallel Jobs\n- [05/10/2019] Version 0.3.1 - Fixed JSON parsing issue with Spark 2.4.0 and above\n- [05/06/2020] Version 0.3.2 - Support for generating email based report using sparklens.qubole.com\n\n## Contributing\nWe haven't given this much thought. Just raise a PR and if you don't hear from us, shoot an email to \n[help@qubole.com](mailto:help@qubole.com) to get our attention. \n\n## Reporting bugs or feature requests\nPlease use the GitHub issues for the Sparklens project to report issues or raise feature requests. If you can code,\nbetter raise a PR.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqubole%2Fsparklens","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fqubole%2Fsparklens","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqubole%2Fsparklens/lists"}