{"id":18335827,"url":"https://github.com/epickrram/perf-workshop","last_synced_at":"2025-07-18T08:04:59.044Z","repository":{"id":33032645,"uuid":"36668281","full_name":"epickrram/perf-workshop","owner":"epickrram","description":"Tutorial on reducing Linux scheduler jitter","archived":false,"fork":false,"pushed_at":"2018-08-20T13:35:31.000Z","size":521,"stargazers_count":124,"open_issues_count":0,"forks_count":23,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-04-06T04:34:41.131Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://epickrram.blogspot.co.uk/2015/09/reducing-system-jitter.html","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/epickrram.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-06-01T15:02:11.000Z","updated_at":"2024-12-13T14:37:03.000Z","dependencies_parsed_at":"2022-08-17T21:15:31.537Z","dependency_job_id":null,"html_url":"https://github.com/epickrram/perf-workshop","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/epickrram/perf-workshop","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epickrram%2Fperf-workshop","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epickrram%2Fperf-workshop/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epickrram%2Fperf-workshop/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epickrram%2Fperf-workshop/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/epickrram","download_url":"https://codeload.github.com/epickrram/perf-workshop/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epickrram%2Fperf-workshop/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265724711,"owners_count":23817860,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-05T20:05:04.429Z","updated_at":"2025-07-18T08:04:59.020Z","avatar_url":"https://github.com/epickrram.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"System Jitter Utility\n=====================\n\nA test program for exploring causes of jitter.\n\n\nThe application\n===============\n\n![Application diagram](doc/application.png)\n\nThe application consists of 3 threads:\n\n1. The producer thread - responsible for reading data from a memory-mapped file, inserting a timestamp, and publishing messages onto a queue (an instance of the [Disruptor] (https://github.com/LMAX-Exchange/disruptor)).\n2. The accumulator (logic in the above diagram) thread - records a timestamp when it pulls a message from the queue, and stores the queue transit latency in a histogram.\n3. The journaller thread - records a timestamp when it pulls a message from the queue, writes an entry to a journal containing the queue transit latency.\n\nAll timestamps are generated by calling `System.nanoTime()`.\n\nThe producer thread will busy-spin for ten microseconds between each publication. Consumer threads are busy-waiting on the head of the queue for messages to arrive from the producer.\n\nThe application is garbage-free, and [guaranteed safepoints] (https://epickrram.blogspot.co.uk/2015/08/jvm-guaranteed-safepoints.html) are disabled, so there should be no jitter introduced by the JVM itself.\n\nFour latencies are recorded:\n\n1. Queue transit time for accumulator thread\n2. Queue transit time for journaller thread\n3. Inter-message time for accumulator thread\n4. Inter-message time for journaller thread\n\nOn system exit, full histograms of these values are generated for post-processing (placed in `/tmp/` by default). \n\n\nRequirements\n============\n\n1. JDK 8+\n\n\nTools to install\n================\n\nInstall the following tools in order to work through the exercises:\n\n1. gnuplot\n2. perf\n3. hwloc\n4. trace-cmd\n5. powertop\n\n\nUsing\n=====\n\n1. Clone this git repository\n2. Build the library: `./gradlew bundleJar`\n3. Run it: `cd src/main/shell \u0026\u0026 bash ./run_test.sh BASELINE`\n\n\nOutput\n======\n\nThe `run_test.sh` script will run the application, using 'BASELINE' as a label. At exit, the application will print out a number of latency histograms.\n\nBelow is an excerpt of the output containing the histogram of latencies recorded between the producer thread and the accumulator thread.\n\n\n    == Accumulator Message Transit Latency (ns) ==\n    mean                   60879\n    min                       76\n    50.00%                   168\n    90.00%                   256\n    99.00%               2228239\n    99.90%               8126495\n    99.99%              10485823\n    99.999%             11534399\n    99.9999%            11534399\n    max                 11534399\n    count                3595101\n\n\nSo for this run, 3.5m messages were passed through the queue, the mean latency was around 60 microseconds, \nmin latency was 75 nanoseconds, and the max latency was over 11 milliseconds.\n\nThese numbers can be plotted on a chart using the following command, executed from `src/main/shell`:\n\n`bash ./chart_accumulator_message_transit_latency.sh`\n\nand viewed with the following command:\n\n`gnuplot ./accumulator_message_transit_latency.cmd`\n\nproducing something that looks like this chart:\n\n![Baseline chart](doc/baseline-chart.png)\n\n\nWhy so slow?\n============\n\nFrom these first results, we can see that at the 99th percentile, inter-thread latency was over 2 milliseconds, \nmeaning that 1 in 100 messages took 2ms or longer to transit between two threads.\n\nSince no other work is being done by this program, the workload is constant, and there are no runtime pauses, \nwhere is this jitter coming from?\n\nBelow are a series of steps working through some causes of system jitter on a modern Linux kernel \n(my laptop is running Fedora 22 on kernel 4.0.4).  Most of these techniques have been tested on a 3.18 kernel, \nolder versions may not have the same features/capabilities.\n\n\nCPU speed\n=========\n\nModern CPUs (especially on laptops) are designed to be power efficient, this means that the OS will typically try \nto scale down the clock rate when there is no activity. On Intel CPUs, this is partially handled using power-states,\nwhich allow the OS to reduce CPU frequency, meaning less power draw, and less thermal overhead.\n\nOn current kernels, this is handled by the CPU scaling governor. You can check your current setting by looking in the file\n\n`/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor`\n\non my laptop, this is set to `powersave` mode. To see available governors:\n\n`cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors`\n\nwhich tells me that I have two choices:\n\n1. performance\n2. powersave\n\nBefore making a change though, let's make sure that powersave is actually causing us issues.\n\nTo do this, we can use [perf_events] (https://perf.wiki.kernel.org/index.php/Main_Page) \nto monitor the CPU's P-state while the application is running:\n \n`perf record -e \"power:cpu_frequency\" -a`\n \nThis command will sample the cpu_frequency trace point written to by the intel cpufreq driver on all CPUs. This information comes\nfrom an MSR on the chip which holds the FSB speed.\n\nHit Ctrl+c once the application has finished running, then use `perf script` to view the available output.\n\nFiltering entries to include only those samples taken when `java` was executing shows some variation in the reported frequency:\n\n    java  2804 [003]  3327.796741: power:cpu_frequency: state=1500000 cpu_id=3\n    java  2804 [003]  3328.089969: power:cpu_frequency: state=3000000 cpu_id=3\n    java  2804 [003]  3328.139009: power:cpu_frequency: state=2500000 cpu_id=3\n    java  2804 [003]  3328.204063: power:cpu_frequency: state=1000000 cpu_id=3\n\n\nSet the scaling governor to performance mode to reduce this:\n\n`sudo bash ./set_cpu_governor.sh performance`\n\nand re-run the test while using `perf` to record `cpu_frequency` events. If the change has taken effect, \nthere should be no events output by `perf script`.\n\n\nRunning the test again with the performance governor enabled produces better results for inter-thread latency:\n\n    == Accumulator Message Transit Latency (ns) ==\n    mean                   23882\n    min                       84\n    50.00%                   152\n    90.00%                   208\n    99.00%                589827\n    99.90%               4456479\n    99.99%               7340063\n    99.999%              7864351\n    99.9999%             8126495\n    max                  8126495\n    count                3595101\n\n\nThough there is still a max latency of 8ms, it has been reduced from the previous value of 11ms.\n\nThe effect is clearly visible when added to the chart. To add the new data, go through the steps followed earlier:\n\n`bash ./chart_accumulator_message_transit_latency.sh`\n`gnuplot ./accumulator_message_transit_latency.cmd`\n\n![Performance governor chart](doc/performance-chart.png)\n\n\nProcess migration\n=================\n\nAnother possible cause of scheduling jitter is likely to be down to the OS scheduler moving processes around as different tasks \nbecome runnable. The important threads in the application are at the mercy of the scheduler, which can, at any time\ndecide to run another process on the current CPU. When this happens, the running thread's context will be saved, and it will\nbe shifted back into the schedulers run-queue (or possibly migrated to another CPU entirely).\n\nTo find out whether this is happening to the threads in our application, we can turn to `perf` again and sample trace events\nemitted by the scheduler. First, record the PIDs of the two important threads from the application (producer and accumulator):\n\n    Starting replay at Thu Sep 24 14:17:31 BST 2015\n    Accumulator thread has pid: 11372\n    Journaller thread has pid: 11371\n    Producer thread has pid: 11370\n    Warm-up complete at Thu Sep 24 14:17:35 BST 2015\n    Pausing for 10 seconds...\n\nOnce warm-up has completed, record the scheduler events for the specific PIDs of interest:\n\n`perf record -e \"sched:sched_stat_runtime\" -t 11370 -t 11372`\n\nThis command will record events emitted by the scheduler to update a task's runtime statistics. The recording session will exit once those processes complete. Running `perf script` again will show the captured events:\n\n`java 11372 [001]  3055.140623: sched:sched_stat_runtime: comm=java pid=11372 runtime=1000825 [ns] vruntime=81510486145 [ns]`\n\nThe line above shows, among other things, what CPU the process was executing on when stats were updated. In this case, the process was running on CPU 001. A bit of sorting and counting will show exactly how the process was moved around the available CPUs during its lifetime:\n\n`perf script | grep \"java 11372\" | awk '{print $3}' | sort | uniq -c`\n\n    16071 [000]\n    10858 [001]\n     5778 [002]\n     7230 [003]\n\n\nSo this thread mostly ran on CPUs 0 and 1, but also spent some time on CPUs 2 and 3. Moving the process around is going to require a context switch, and cache invalidation effects. While these are unlikely to be the sources of maximum latency, in order to start improving the worst-case, it will be necessary to stop migration of these processes.\n\nThe application allows the user to select a target CPU for any of the three processing threads via a config file `/tmp/perf-workshop.properties`. Edit the file, and select two different CPUs for the producer and accumulator threads:\n\n    perf.workshop.affinity.producer=1\n    perf.workshop.affinity.accumulator=3\n\n\nRe-running the test shows a large improvement:\n\n\n![Pinned threads chart](doc/pinned-thread-chart.png)\n\n\nThis result implies that forcing the threads to run on a single CPU can help reduce inter-thread latency. Whether this is down to the scheduler making better decisions about where to run other processes, or simply because there is less context switching is not clear.\n\nOne thing to look out for is the fact that we have not stopped the scheduler from running other tasks on those CPUs. We are still seeing multi-millisecond delays in message passing, and this could be down to other processes being run on the CPU that the application thread has been restricted to.\n\nReturning to `perf` and this time capturing all `sched_stat_runtime` events for a specific CPU (in this case 1) will show what other processes are being scheduled while the application is running:\n\n`perf record -e \"sched:sched_stat_runtime\" -C 1`\n\nStripping out everything but the process name, and counting occurrences in the event trace shows that while the java application was running most of the time, there are plenty of other processes that were scheduled during the application's execution time:\n\n    45514 java\n       60 kworker/1:2\n       26 irq/39-DLL0665:\n       24 rngd\n       15 rcu_sched\n        9 gmain\n        8 goa-daemon\n        7 chrome\n        6 ksoftirqd/1\n        5 rtkit-daemon\n\n\nCPU Isolation\n=============\n\nAt this point, it's time to remove the target CPUs from the OS's scheduling domain. This can be done with the `isolcpus` boot parameter (i.e. add `isolcpus=1,3` to `grub.conf`), or by using the `cset` command from the `cpuset` package.\n\nIn this case, I'm using `isolcpus` to stop the scheduler from running other userland processes on CPUs 1 \u0026 3. The difference in inter-thread latency is dramatic:\n\n\n    == Accumulator Message Transit Latency (ns) ==\n    mean                     144\n    min                       84\n    50.00%                   144\n    90.00%                   160\n    99.00%                   208\n    99.90%                   512\n    99.99%                  2432\n    99.999%                 3584\n    99.9999%               11776\n    max                    14848\n    count                3595101\n\n\nThe difference is so great, that it's necessary to use a log-scale for the y-axis of the chart.\n\n\n![Isolated CPUs](doc/isolcpus-chart-log-scale.png)\n\n\nNote that the difference will not be so great on a server-class machine with lots of spare processing power. The effect here is magnified by the fact that the OS only has 4 CPUs (on my laptop) to work with, and a desktop distribution of Linux. So there is much more scheduling pressure than would be present on a server-class machine.\n\nUsing `perf` once again to confirm that other processes are not running on the reserved CPUs shows that there is still some contention to deal with:\n\n    81130 java\n        2 ksoftirqd/1\n       43 kworker/1:0\n        1 kworker/1:1H\n        2 kworker/3:1\n        1 kworker/3:1H\n       11 swapper\n\nThese processes starting with 'k' are kernel threads that deal with house-keeping tasks on behalf of the OS, 'swapper' is the Linux idle process, which is scheduled whenever there is no work to be executed on a CPU.\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepickrram%2Fperf-workshop","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fepickrram%2Fperf-workshop","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepickrram%2Fperf-workshop/lists"}