{"id":15347638,"url":"https://github.com/michaelmior/spark-log-analysis","last_synced_at":"2025-07-21T11:35:17.076Z","repository":{"id":139402720,"uuid":"105778820","full_name":"michaelmior/spark-log-analysis","owner":"michaelmior","description":null,"archived":false,"fork":false,"pushed_at":"2017-10-11T17:59:47.000Z","size":5,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-06-01T02:45:40.788Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/michaelmior.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-10-04T14:32:34.000Z","updated_at":"2018-04-23T22:47:48.000Z","dependencies_parsed_at":null,"dependency_job_id":"16114de3-b983-4a53-8e19-339315902592","html_url":"https://github.com/michaelmior/spark-log-analysis","commit_stats":{"total_commits":4,"total_committers":1,"mean_commits":4.0,"dds":0.0,"last_synced_commit":"ab3d4f6ec22b53060ecb5f40b80e0cb43ba8cbf7"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/michaelmior/spark-log-analysis","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaelmior%2Fspark-log-analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaelmior%2Fspark-log-analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaelmior%2Fspark-log-analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaelmior%2Fspark-log-analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/michaelmior","download_url":"https://codeload.github.com/michaelmior/spark-log-analysis/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaelmior%2Fspark-log-analysis/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261967132,"owners_count":23237663,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-01T11:36:52.695Z","updated_at":"2025-06-25T23:05:55.316Z","avatar_url":"https://github.com/michaelmior.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spark log analysis\n\nAll of the scripts below are designed to analyze Spark event logs and print out various statistics.\nTo produce a log from your Spark application, the following configuration options need to be set:\n\n    spark.eventLog.enabled true\n    spark.eventLog.dir hdfs://namenode/shared/spark-logs\n\nEach script takes a log file name as an argument.\nBelow is a list of all the scripts and the data they produce as output.\n\n## extract-cache-levels.py\n\nProduces JSON output which in a format suitable for feeding back into Spark to modify the storage level of each partition of an RDD during each stage of execution.\nNote that this script requires a [modified version of Spark](https://github.com/michaelmior/spark/tree/track-rdd-size).\n\n## graph-job.py\n\nProduces a graph of the application in [DOT](http://www.graphviz.org/content/dot-language) format which can be fed to Graphviz to produce a visualization of a spark application.\n\n## rdd-sizes.py\n\nNote that this script requires a log produced by a [modified version of Spark](https://github.com/michaelmior/spark/tree/track-rdd-size).\n\n| Column | Description |\n| --- | --- |\n| RDD ID | ID of the estimated RDD partition |\n| Partition | partition corresponding to the size estimate |\n| Estimated Size | average estimated size for the partition |\n\n## rdd-summary.py\n\n| Column | Description |\n| --- | --- |\n| RDD ID | ID of the RDD |\n| Name | name for the RDD defined in the application |\n| Storage Level | persistence level for this RDD (does not account for changes) |\n| Callsite | information on where this RDD was created |\n\n## uses-caching.py\n\n| Column | Description |\n| --- | --- |\n| Job ID | ID of a particular job |\n| Runtime | total runtime of the job |\n| Cached Partitions | partitions read from the cache |\n| Cacheable Partitions | partitions which could have been cached because they were previously computed |\n| Annotated Partitions | partitions which were annotated for caching that should be in the cache (unless evicted) |\n| Total Partitions | total number of partitions for all RDDs in the job |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmichaelmior%2Fspark-log-analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmichaelmior%2Fspark-log-analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmichaelmior%2Fspark-log-analysis/lists"}