https://github.com/michaelmior/spark-log-analysis
https://github.com/michaelmior/spark-log-analysis
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/michaelmior/spark-log-analysis
- Owner: michaelmior
- Created: 2017-10-04T14:32:34.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2017-10-11T17:59:47.000Z (over 7 years ago)
- Last Synced: 2024-12-31T12:47:08.458Z (4 months ago)
- Language: Python
- Size: 4.88 KB
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Spark log analysis
All of the scripts below are designed to analyze Spark event logs and print out various statistics.
To produce a log from your Spark application, the following configuration options need to be set:spark.eventLog.enabled true
spark.eventLog.dir hdfs://namenode/shared/spark-logsEach script takes a log file name as an argument.
Below is a list of all the scripts and the data they produce as output.## extract-cache-levels.py
Produces JSON output which in a format suitable for feeding back into Spark to modify the storage level of each partition of an RDD during each stage of execution.
Note that this script requires a [modified version of Spark](https://github.com/michaelmior/spark/tree/track-rdd-size).## graph-job.py
Produces a graph of the application in [DOT](http://www.graphviz.org/content/dot-language) format which can be fed to Graphviz to produce a visualization of a spark application.
## rdd-sizes.py
Note that this script requires a log produced by a [modified version of Spark](https://github.com/michaelmior/spark/tree/track-rdd-size).
| Column | Description |
| --- | --- |
| RDD ID | ID of the estimated RDD partition |
| Partition | partition corresponding to the size estimate |
| Estimated Size | average estimated size for the partition |## rdd-summary.py
| Column | Description |
| --- | --- |
| RDD ID | ID of the RDD |
| Name | name for the RDD defined in the application |
| Storage Level | persistence level for this RDD (does not account for changes) |
| Callsite | information on where this RDD was created |## uses-caching.py
| Column | Description |
| --- | --- |
| Job ID | ID of a particular job |
| Runtime | total runtime of the job |
| Cached Partitions | partitions read from the cache |
| Cacheable Partitions | partitions which could have been cached because they were previously computed |
| Annotated Partitions | partitions which were annotated for caching that should be in the cache (unless evicted) |
| Total Partitions | total number of partitions for all RDDs in the job |