{"id":22821965,"url":"https://github.com/darule0/sparkdiff","last_synced_at":"2026-05-05T05:33:56.223Z","repository":{"id":216172428,"uuid":"422673163","full_name":"darule0/sparkdiff","owner":"darule0","description":"A rudimentary command line utility for contrasting Apache Spark event logs.","archived":false,"fork":false,"pushed_at":"2024-01-08T19:06:12.000Z","size":720,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-30T23:41:15.408Z","etag":null,"topics":["apache-spark","compare-files","diff","difference","diffing","spark","spark-sql","spark-streaming","sparksql"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/darule0.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-10-29T18:15:52.000Z","updated_at":"2024-05-04T01:46:24.000Z","dependencies_parsed_at":null,"dependency_job_id":"d34eacf5-57de-4399-a6c9-b199ad2ac284","html_url":"https://github.com/darule0/sparkdiff","commit_stats":null,"previous_names":["darule0/sparkdiff"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/darule0/sparkdiff","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/darule0%2Fsparkdiff","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/darule0%2Fsparkdiff/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/darule0%2Fsparkdiff/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/darule0%2Fsparkdiff/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/darule0","download_url":"https://codeload.github.com/darule0/sparkdiff/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/darule0%2Fsparkdiff/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32637151,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-04T10:08:07.713Z","status":"online","status_checked_at":"2026-05-05T02:00:06.033Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","compare-files","diff","difference","diffing","spark","spark-sql","spark-streaming","sparksql"],"created_at":"2024-12-12T16:09:58.936Z","updated_at":"2026-05-05T05:33:56.217Z","avatar_url":"https://github.com/darule0.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# sparkdiff \u003cevent log 1\u003e \u003cevent log 2\u003e\nA rudimentary command line utility for contrasting Apache Spark event logs.\n\n## Motivation\nI have been troubleshooting Apache Spark application issues full-time since around 2015. When a spark application slows down or stops working, I try to find out more information such as: Did the inputs change? Did the configuration change?\n\nSpark logs from two runs of the same application cannot be contrasted using a general purpose diff tool as there would be thousands of changes detected which are not useful for troubleshooting.\n\nI have decided to automate this part my job function in the form of a bash script which examines spark logs and identifies differences which I find useful when troubleshooting spark application performance and functionality problems. \n  \n## Description\nsparkdiff is a Linux command line utility which contrasts spark logs from two runs of a spark application and displays log entries which both, have changed; and that I find helpful when troubleshooting spark application performance and/or functionality problems.\n\nFor example, if a spark application has been running without problems for years and then suddenly slows down or stops working, then I will pass in the logs from both a known working run as well as the logs from the run which had problems. With a little luck, the sparkdiff output helps guide me towards the root cause and solution.\n\n## Online Installation w/ CI\n```console\n\nmkdir ~/bin\nchmod u+rx ~/bin\nwget -O ~/bin/sparkdiff https://github.com/darule0/sparkdiff/blob/main/sparkdiff?raw=true\nchmod u+rx ~/bin/sparkdiff\nsource ~/.profile\n\n```\n\n\n\n## Offline Installation w/o CI\n```console\n\nsudo mkdir /opt/sparkdiff\nsudo chmod o+rx /opt/sparkdiff\nsudo git clone https://github.com/darule0/sparkdiff.git /opt/sparkdiff\nsudo chmod o+rx /opt/sparkdiff/sparkdiff.sh\nsudo ln -s /opt/sparkdiff/sparkdiff.sh /usr/bin/sparkdiff\n\n```\n\n## How to obtain event logs for a spark application run?\nEach time a spark applicaiton is run, the console output will include an application id.\nThe application id can be used to locate the event logs in HDFS. In spark configuration,\nspark.eventLog.dir will specify where in HDFS event logs are stored.\n```console\nhdfs dfs -get /user/spark/applicationHistory/*\u003capplication id\u003e*\n```\n\n## Tutorial\n```console\n\n# install sparkdiff w/ CI\nmkdir ~/bin\nchmod u+rx ~/bin\nwget -O ~/bin/sparkdiff https://github.com/darule0/sparkdiff/blob/main/sparkdiff?raw=true\nchmod u+rx ~/bin/sparkdiff\n\n# display sparkdiff usage\nsparkdiff\n\n# contrast event logs from a two runs of the same spark application\nsparkdiff event_log_1 event_log_2\n\n```\n\n![alt text](https://raw.githubusercontent.com/darule0/sparkdiff/main/sparkdiff.png)\n\n## Directories Used\n| directory | purpose |\n| :--- | :--- |\n| $HOME/.sparkdiff.dd4b66ed-a43d-48ec-8e32-1b901bc8ea8e | The latest sparkdiff is automatically downlaoded here when Online Installation w/ CI. |\n| $HOME/.sparkdiff | Intermediate data for sparkdiff processing. |\n\n## Event Log Parsing Logic\n| array configurations which may be considered - all occurrences |\n| :--- |\n| App Attempt ID |\n\n| scalar configurations which may be considered - first occurrence |\n| :--- |\n| Java Home |\n| Java Version |\n| Scala Version |\n| Maximum Onheap Memory |\n| spark.acls.enable |\n| spark.app.id |\n| spark.app.name |\n| spark.blacklist.application.maxFailedExecutorsPerNode |\n| spark.blacklist.application.maxFailedTasksPerExecutor |\n| spark.blacklist.enabled |\n| spark.blacklist.killBlacklistedExecutors |\n| spark.blacklist.stage.maxFailedExecutorsPerNode |\n| spark.blacklist.stage.maxFailedTasksPerExecutor |\n| spark.blacklist.task.maxTaskAttemptsPerExecutor |\n| spark.blacklist.task.maxTaskAttemptsPerNode |\n| spark.blacklist.timeout |\n| spark.driver.extraJavaOptions |\n| spark.driver.extraLibraryPath |\n| spark.driver.maxResultSize |\n| spark.driver.memory |\n| spark.dynamicAllocation.enabled |\n| spark.dynamicAllocation.maxExecutors |\n| spark.eventLog.dir |\n| spark.eventLog.enabled |\n| spark.executor.cores |\n| spark.executor.extraJavaOptions |\n| spark.executor.extraLibraryPath |\n| spark.executor.heartbeatInterval |\n| spark.executor.id |\n| spark.yarn.executor.memoryOverhead |\n| spark.yarn.driver.memoryOverhead |\n| spark.yarn.am.memoryOverhead |\n| spark.executor.instances |\n| spark.dynamicAllocation.minExecutors |\n| spark.dynamicAllocation.initialExecutors |\n| spark.dynamicAllocation.schedulerBacklogTimeout |\n| spark.yarn.scheduler.heartbeat.interval-ms |\n| spark.streaming.backpressure.enabled |\n| spark.streaming.blockInterval |\n| spark.streaming.backpressure.initialRate |\n| spark.streaming.receiver.maxRate |\n| spark.streaming.kafka.maxRatePerPartition |\n| spark.executor.memory |\n| spark.executorEnv.HADOOP_NODE_JDK_HOME |\n| spark.executorEnv.IFCONTENTMASTER_HOME |\n| spark.executorEnv.IMF_CPP_RESOURCE_PATH |\n| spark.executorEnv.INFA_HADOOP_DIST_DIR |\n| spark.executorEnv.INFA_JAVA_BIN |\n| spark.executorEnv.INFA_MAPRED_OSGI_CONFIG |\n| eclipse.stateSaveDelayInterval |\n| spark.executorEnv.INFA_PLUGINS_HOME |\n| spark.executorEnv.INFA_RESOURCES |\n| spark.executorEnv.INFA_RESOURCES |\n| spark.executorEnv.JAVA_HOME |\n| spark.executorEnv.LD_LIBRARY_PATH |\n| spark.executorEnv.NLS_LANG |\n| spark.executorEnv.ODBCINI |\n| spark.executorEnv.ODBC_HOME |\n| spark.executorEnv.ORACLE_HOME |\n| spark.executorEnv.PATH |\n| spark.executorEnv.TNS_ADMIN |\n| spark.executorEnv.USE_DISTINCT_OSGI_DIR_PER_PROXY_USER |\n| spark.hadoop.avro.mapred.ignore.inputs.without.extension |\n| spark.hadoop.fs.file.impl.disable.cache |\n| spark.hadoop.fs.hdfs.impl.disable.cache |\n| spark.hadoop.fs.s3.impl.disable.cache |\n| spark.hadoop.fs.s3a.impl.disable.cache |\n| spark.hadoop.fs.s3n.impl.disable.cache |\n| spark.hbase.connector.security.credentials.enabled |\n| spark.infa.context.taskname |\n| spark.infa.jobrecoveryenabled |\n| spark.infa.port |\n| spark.kryoserializer.buffer.max |\n| spark.master |\n| spark.network.timeout |\n| spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS |\n| spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.RM_HA_URLS |\n| spark.scheduler.maxRegisteredResourcesWaitingTime |\n| spark.scheduler.minRegisteredResourcesRatio |\n| spark.scheduler.mode |\n| spark.serializer |\n| spark.shuffle.consolidateFiles |\n| spark.shuffle.service.enabled |\n| spark.shuffle.service.port |\n| spark.sql.autoBroadcastJoinThreshold |\n| spark.sql.broadcastTimeout |\n| spark.sql.catalogImplementation |\n| spark.sql.constraintPropagation.enabled |\n| spark.sql.crossJoin.enabled |\n| spark.sql.hive.metastore.sharedPrefixes |\n| spark.sql.retainGroupColumns |\n| spark.sql.shuffle.partitions |\n| spark.sql.statistics.fallBackToHdfs |\n| spark.sql.statistics.partitionPruner |\n| spark.submit.deployMode |\n| spark.ui.filters |\n| spark.ui.port |\n| spark.ui.view.acls.groups |\n| spark.yarn.appMasterEnv.HADOOP_NODE_JDK_HOME |\n| spark.yarn.appMasterEnv.IFCONTENTMASTER_HOME |\n| spark.yarn.appMasterEnv.IMF_CPP_RESOURCE_PATH |\n| spark.yarn.appMasterEnv.INFA_HADOOP_DIST_DIR |\n| spark.yarn.appMasterEnv.INFA_HADOOP_SPARK_LIB |\n| spark.yarn.appMasterEnv.INFA_HOME |\n| spark.yarn.appMasterEnv.INFA_JAVA_BIN |\n| spark.yarn.appMasterEnv.INFA_MAPRED_OSGI_CONFIG |\n| spark.yarn.appMasterEnv.INFA_PLUGINS_HOME |\n| spark.yarn.appMasterEnv.INFA_RESOURCES |\n| spark.yarn.appMasterEnv.INFA_SPARK_APP_CLASS_NAME |\n| spark.yarn.appMasterEnv.INFA_SPARK_CACHE_LIFETIME |\n| spark.yarn.appMasterEnv.INFA_SPARK_CACHE_SIZE |\n| spark.yarn.appMasterEnv.INFA_SPARK_DIST_LIB |\n| spark.yarn.appMasterEnv.INFA_SPARK_ENABLE_HIVE |\n| spark.yarn.appMasterEnv.INFA_SPARK_SCALA_VERSION |\n| spark.yarn.appMasterEnv.JAVA_HOME |\n| spark.yarn.appMasterEnv.NLS_LANG |\n| spark.yarn.appMasterEnv.ODBCINI |\n| spark.yarn.appMasterEnv.ODBC_HOME |\n| spark.yarn.appMasterEnv.ORACLE_HOME |\n| spark.yarn.appMasterEnv.TNS_ADMIN |\n| spark.yarn.appMasterEnv.USE_DISTINCT_OSGI_DIR_PER_PROXY_USER |\n| spark.yarn.principal |\n| spark.yarn.maxAppAttempts |\n| spark.yarn.proxy-user |\n| spark.yarn.queue |\n| spark.yarn.security.credentials.hbase.enabled |\n| spark.yarn.security.tokens.hbase.enabled |\n| spark.yarn.stagingDir |\n| spark.yarn.submit.waitAppCompletion |\n| awt.toolkit |\n| file.encoding |\n| file.encoding.pkg |\n| file.separator |\n| java.awt.graphicsenv |\n| java.awt.printerjob |\n| java.class.version |\n| java.endorsed.dirs |\n| java.ext.dirs |\n| java.home |\n| java.runtime.name |\n| java.runtime.version |\n| java.security.egd |\n| java.specification.name |\n| java.specification.vendor |\n| java.specification.version |\n| java.vendor |\n| java.vendor.url |\n| java.vendor.url.bug |\n| java.version |\n| java.vm.info |\n| java.vm.name |\n| java.vm.specification.name |\n| java.vm.specification.vendor |\n| java.vm.specification.version |\n| java.vm.vendor |\n| java.vm.version |\n| jceks.key.serialFilter |\n| jetty.git.hash |\n| line.separator |\n| log4j.configuration |\n| os.arch |\n| os.name |\n| os.version |\n| path.separator |\n| sun.arch.data.model |\n| sun.boot.library.path |\n| sun.cpu.endian |\n| sun.cpu.isalist |\n| sun.io.unicode.encoding |\n| sun.java.launcher |\n| sun.jnu.encoding |\n| sun.management.compiler |\n| sun.nio.ch.bugLevel |\n| sun.os.patch.level |\n| user.country |\n| user.home |\n| user.language |\n| user.name |\n| user.timezone |\n\n| array configurations wich may be md5hash considered - first occurrence |\n| :--- |\n| spark.yarn.appMasterEnv.INFA_MAPRED_CLASSPATH |\n| spark.yarn.appMasterEnv.LD_LIBRARY_PATH |\n| spark.yarn.appMasterEnv.PATH |\n| java.library.path |\n| sun.boot.class.path |\n\n| array configurations wich may be tokenized and considered - first occurrence |\n| :--- |\n| spark.executor.extraClassPath |\n| spark.driver.extraClassPath |\n\n| inputs which may be considered - all occurrence summed |\n| :--- |\n| scrap = Last 159 characters of events that contain Input Metrics |\n| Input Metrics scrap -\u003e Bytes Read -\u003e BytesRead.integer |\n| Input Metrics scrap -\u003e Bytes Read -\u003e BytesRead.iec (human readable) |\n| Input Metrics scrap -\u003e Records Read -\u003e RecordsRead.integer |\n\n| configurations which may be excluded |\n| :--- |\n| spark.driver.host |\n| spark.driver.port |\n| spark.executorEnv.SESS_TEMP_WORKING_DIR |\n| spark.infa.context.executionid |\n| spark.infa.context.wfrunid |\n| spark.infa.host |\n| spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES |\n| spark.yarn.app.container.log.dir |\n| spark.yarn.app.id |\n| spark.yarn.appMasterEnv.SESS_TEMP_WORKING_DIR |\n| spark.yarn.dist.archives |\n| spark.yarn.dist.files |\n| spark.yarn.keytab |\n| java.io.tmpdir |\n| sun.java.command |\n| user.dir |\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdarule0%2Fsparkdiff","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdarule0%2Fsparkdiff","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdarule0%2Fsparkdiff/lists"}