Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/metonymic-smokey/javagc

A pipeline to extract and analyse object lifetimes from a Java program.
https://github.com/metonymic-smokey/javagc

Last synced: about 2 months ago
JSON representation

A pipeline to extract and analyse object lifetimes from a Java program.

Awesome Lists containing this project

README

        

Object Analyzer

A pipeline to extract and analyze object lifetimes from a Java program.

The Object Analyzer runs a given Java program in a modified JVM ([AntTracks JVM](./ant-tracks-jvm/)) that collects profiling information for every single object that was allocated and details of every garbage collection event. The JVM writes this information to highly compressed trace files which are read by the [Analyzer](./ant-tracks-analyzer) and converted to Parquet files. The processed file is used to generate a few visualizations using the [analysis scripts](./analysis).

More details can be found in the paper titled ["Analysis of Garbage Collection Patterns to Extend Microbenchmarks for Big Data Workloads"](https://dl.acm.org/doi/10.1145/3491204.3527473).

## Usage

### Requirements

- Docker (tested on v20.10)
- bash (tested on v5.0)
- A Java 8 program, or a compiled JAR targeted for Java 8 (Java class file version 52.0 and below).

### Running on a JAR file

If you have a JAR file that you want to analyze, use the `run.sh` script as:
```bash
./run.sh [...]
```

The outputs will be saved to a subdirectory in `./outputs/` - look for the last line in the script's output to get the full path.

The output directory contains:
- `output/data___lifetimes_._parquet` subdirectory contains graphs generated by the analysis.
- `data` subdirectory contains raw object level data as a CSV (this can be quite large - please delete it if not required) and as a Parquet file.
- `trace_files` subdirectory contains trace, symbols and class definitions file generated by the modified AntTracks JVM.

### Running on generated trace files

If you have generated a trace file (with symbols and class definitions files too) using the AntTracks JVM separately, you can use the `on_traces.sh` script as:
```
./on_traces.sh
```

Note: the directory containing the trace file must also contain the symbols and class definitions files with the same suffix.

The outputs will be saved to a subdirectory in `./outputs/` - look for the last line in the script's output to get the full path.

The output directory structure is similar to [Running on a JAR file](#running-on-a-jar-file) except that `trace_files` subdirectory is not generated.

## Directory structure

- [`ant-tracks-jvm`](./ant-tracks-jvm/): the modified AntTracks JVM. Note: this is modified slightly from the original AntTracks JVM to support applications that run multiple JVMs concurrently. The Object Analyzer pipeline assumes that this modified AntTracks JVM is used.
- [`ant-tracks-analyzer`](./ant-tracks-analyzer/): a Java CLI application that re-uses the source code of the original AntTracks Analyzer to extract object data and lifetime in a processable format.
- [`analysis`](./analysis): python scripts used to generate Parquet files and visualizations from the processed CSVs generated by the Analyzer.
- [`custom-benchmarks`](./custom-benchmarks): a set of JMH-based Java micro-benchmarks made to replicate some patterns observed in Big Data benchmarks.
- [`IonutBench`](./IonutBench/): a JMH implementation of some of Ionut Balosin's [Garbage collectors benchmarks](https://ionutbalosin.com/2019/12/jvm-garbage-collectors-benchmarks-report-19-12/).
- [`sample-program`](./sample-program): a sample Java 11 benchmark program that is compiled with Java 8 compatibility. The generated JAR file (`./gradlew jar`) can be analyzed using `./run.sh $PWD/sample-program/build/libs/sample-program.jar 1000 10000` (`1000 10000` are arguments to the program).
- [`vmtrace`](./vmtrace): a JVMTI agent that tracks the allocations of all objects. Currently not used in the pipeline since it cannot find the death/collection of objects.

## Limitations

- Only Java 8 applications (or compiled JARs targeted for Java 8 compatibility) are supported. See the [sample program](./sample-program) for an example on how to configure Gradle to target Java 8 even if a higher Java version is used for compilation.
- This is due to a limitation in AntTracks JVM since it's a modified Java 8 JVM.
- The analysis scripts are agnostic of the data source and only expect the data in a particular format. If it's possible to get the same data through another source that supports newer JVMs (perhaps something like [vmtrace](./vmtrace)), the same lifetime analysis can be performed.

## Citing

If you find this work useful, please cite our work - [Analysis of Garbage Collection Patterns to Extend Microbenchmarks for Big Data Workloads](https://dl.acm.org/doi/10.1145/3491204.3527473). A BibTeX is given below:
```bibtex
@inproceedings{10.1145/3491204.3527473,
author = {Sarnayak, Samyak S. and Ahuja, Aditi and Kesavarapu, Pranav and Naik, Aayush and Kumar V., Santhosh and Kalambur, Subramaniam},
title = {Analysis of Garbage Collection Patterns to Extend Microbenchmarks for Big Data Workloads},
year = {2022},
isbn = {9781450391597},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3491204.3527473},
doi = {10.1145/3491204.3527473},
abstract = {Java uses automatic memory allocation where the user does not have to explicitly free used memory. This is done by the garbage collector. Garbage Collection (GC) can take up a significant amount of time, especially in Big Data applications running large workloads where garbage collection can take up to 50 percent of the application's run time. Although benchmarks have been designed to trace garbage collection events, these are not specifically suited for Big Data workloads, due to their unique memory usage patterns. We have developed a free and open source pipeline to extract and analyze object-level details from any Java program including benchmarks and Big Data applications such as Hadoop. The data contains information such as lifetime, class and allocation site of every object allocated by the program. Through the analysis of this data, we propose a small set of benchmarks designed to emulate some of the patterns observed in Big Data applications. These benchmarks also allow us to experiment and compare some Java programming patterns.},
booktitle = {Companion of the 2022 ACM/SPEC International Conference on Performance Engineering},
pages = {121–128},
numpages = {8},
keywords = {big data, java, java virtual machine, garbage collection, hadoop},
location = {Bejing, China},
series = {ICPE '22}
}
```

## License

GPLv2