{"id":19176997,"url":"https://github.com/joomcode/trace-analysis","last_synced_at":"2025-05-07T20:07:48.163Z","repository":{"id":37448575,"uuid":"485397711","full_name":"joomcode/trace-analysis","owner":"joomcode","description":"Library for performance bottleneck detection and optimization efficiency prediction","archived":false,"fork":false,"pushed_at":"2022-07-27T07:42:18.000Z","size":209,"stargazers_count":38,"open_issues_count":0,"forks_count":2,"subscribers_count":52,"default_branch":"main","last_synced_at":"2025-05-07T20:07:35.314Z","etag":null,"topics":["jaeger","opentracing","optimization","performance","spark"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/joomcode.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-04-25T14:08:36.000Z","updated_at":"2025-02-07T10:39:07.000Z","dependencies_parsed_at":"2022-08-02T12:07:09.161Z","dependency_job_id":null,"html_url":"https://github.com/joomcode/trace-analysis","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joomcode%2Ftrace-analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joomcode%2Ftrace-analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joomcode%2Ftrace-analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joomcode%2Ftrace-analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/joomcode","download_url":"https://codeload.github.com/joomcode/trace-analysis/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252949271,"owners_count":21830151,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["jaeger","opentracing","optimization","performance","spark"],"created_at":"2024-11-09T10:31:33.726Z","updated_at":"2025-05-07T20:07:48.136Z","avatar_url":"https://github.com/joomcode.png","language":"Scala","readme":"# trace-analysis\n \n## Highlights\n`trace-analysis` is a library for performance bottleneck detection and optimization efficiency prediction.\n\nGiven dataframe [OpenTracing](https://github.com/opentracing/specification/blob/master/specification.md)-compatible trace-analysis calculates latency distribution among all\nencountered spans.\n\nAlso, trace-analysis allows to simulate optimization effects on historical traces, so you can estimate optimization potential \nbefore implementation. \n\nOptimization simulation is a key feature of this library but as it requires using sophisticated tree processing algorithms \n(see [Optimization Analysis Explained](#optimization-analysis-explained)) we have to deal with recursive data structures during Spark job execution.\n\n## Releases\nThe latest release is available on Maven Central. Currently, we support \n[Scala 2.12](https://search.maven.org/artifact/com.joom.tracing/trace-analysis_2.12)\n```\nimplementation group: 'com.joom.tracing', name: 'trace-analysis_2.12', version: '0.1.0'\n```\n\nand [Scala 2.13](https://search.maven.org/artifact/com.joom.tracing/trace-analysis_2.13).\n```\nimplementation group: 'com.joom.tracing', name: 'trace-analysis_2.13', version: '0.1.0'\n```   \n\n## Introduction\n\nThe most common approach to analyze Jaeger traces is to use Jaeger UI.\nBut this approach has many issues such as\n- Traces may become very long, and it's difficult to detect latency dominators without special tooling.\n- On higher percentiles (p95, p99) there may be many operations with high latency (database/cache queries, third party service requests, etc.).\n  But it is not clear how to estimate total impact of each such operation on the latency percentile.\n- Jaeger UI does not let us analyze subtraces. We have to open the whole trace and then look for operations of interest.\n- There is no tooling to estimate optimizations effect before implementation.\n\nThis library solves all the issues mentioned above by analyzing big corpora of Jaeger traces using Spark capabilities.  \n\nUsage of this library can be separated into two steps.\n1. Hot spot analysis. On this stage we investigate which operations take most of the time.\n2. Potential optimization analysis. On this stage we simulate potential optimizations and estimate their effect\n   on historical traces.\n\n## Hot Spot Analysis\n\nFirst of all we should read the corpus of traces we want to analyze. For example, it may be all the traces about\n`HTTP GET /dispatch` request.\n \nFor example, we can read test span dump from file `lib/src/test/resources/test_data.json` (spans created with [hotrod](https://github.com/jaegertracing/jaeger/blob/main/examples/hotrod/README.md)) \ninto variable `spanDF`.\n```scala\nimport org.apache.spark.sql.SparkSession\nimport com.joom.trace.analysis.spark.spanSchema\n\nval spark = SparkSession.builder()\n  .master(\"local[1]\")\n  .appName(\"Test\")\n  .getOrCreate()\n\n\nval spanDF = spark\n  .read\n  .schema(spanSchema)\n  .json(\"lib/src/test/resources/test_data.json\")\n```\n\nThen we have to define queries to particular operations inside this trace corpus. It may be root operation (`HTTP GET /dispatch`)\nor we may want to see latency distribution inside heavy subtraces. Let's assume that we want to check 2 operations:\n- Root operation `HTTP GET /dispatch`;\n- Heavy subtrace `FindDriverIDs`.\n\nSo we create `TraceAnalysisQuery`\n```scala\nval query = HotSpotAnalysis.TraceAnalysisQuery(\n   TraceSelector(\"HTTP GET /dispatch\"),\n   Seq(\n      OperationAnalysisQuery(\"HTTP GET /dispatch\"),\n      OperationAnalysisQuery(\"FindDriverIDs\")\n   )\n)\n```\n\nThe only thing left is to calculate span durations distribution\n```scala\nval durations = HotSpotAnalysis.getSpanDurations(spanDF, Seq(query))\n```\n\nHere `durations` is a map of dataframes with scheme\n```\n(\n  \"operation_name\": name of the operation; \n  \"duration\": total operations duration [microseconds], \n  \"count\": total operation count\n)\n```\n\n![](resources/hot_spot_result.png)\n\nFurther we can, for example, investigate the longest or most frequent operations.\n\n## Optimization Analysis\n\nNow, when we checked latency distribution among different spans we may want to optimize particular heavy operation.\nFor example `FindDriverIDs`. But it may take weeks and even months to refactor existing code base. So it will be very convenient\nif we could estimate optimization impact before optimization implementation.\n\nAnd comes `OptimizationAnalysis`!\n\nBut before getting our hands dirty we should catch basic idea behind this optimization potential estimation.\n\n### Optimization Analysis Explained\n\nEvery Opentracing trace is a tree-like structure with spans in the nodes. \nEach span has a name, start time, end time and (except root spans) reference to its parent.\nSo the basic idea is to artificially change duration of spans matching some condition and get a new span duration.\n\nBut when calculating updated trace duration we should take into account order of span execution (sequential/parallel).\nThere can be two extreme cases:\n- Sequential execution - total trace duration will be reduced by the same absolute value as optimized span.\n  \n  ![](resources/optimization_seq.png)\n- Parallel execution (non-critical path) - total trace duration will be unchanged because our optimization does not affect critical path.\n  \n  ![](resources/optimization_parallel.png)\n\nTo handle all this cases correctly we need information about span execution order; that's why in the code we deal with\ntrees not with Spark Rows.\n\nNow, when we grasped basic understanding of optimization analysis algorithm (it is mostly implemented inside `SpanModifier` class) \nlet's apply it to our dataset.\n\n### Optimization Analysis Applied\n\nLike before we need to load historical traces. We have to preprocess them with `getTraceDataset` method and store in variable `traceDS`.\n```scala\nval traceDS = getTraceDataset(spanDF)\n```\n\nSuppose we want to simulate effect of 50% latency decrease of `FindDriverIDs` operation.\n```scala\nval optimization = FractionOptimization(\"FindDriverIDs\", 0.5)\n```\n\nParticularly we are interested in p50 and p90\n```scala\nval percentiles = Seq(Percentile(\"p50\", 0.5), Percentile(\"p90\", 0.9))\n```\n\nNow we create optimized durations\n```scala\nval optimizedDurations = OptimizationAnalysis.calculateOptimizedTracesDurations(\n  traceDS,\n  Seq(optimization),\n  percentiles\n)\n```\n\nHere `optimizedDurations` is a dataframe with schema.\n```\n(\n   \"optimization_name\": name of the optimization,\n   \"percentile\": percentile of the request latency\n   \"duration\": latency [microseconds] \n)\n```\n\nThere is a special optimization with name `\"none\"` which indicates non-optimized latency\n\n![](resources/optimization_analysis_result.png)\n\nNow we have estimation of our optimization potential!","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoomcode%2Ftrace-analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjoomcode%2Ftrace-analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoomcode%2Ftrace-analysis/lists"}