{"id":23902684,"url":"https://github.com/mtumilowicz/java17-mesi-false-sharing-processor-optimisations-workshop","last_synced_at":"2026-06-16T08:31:41.941Z","repository":{"id":228656474,"uuid":"758215011","full_name":"mtumilowicz/java17-mesi-false-sharing-processor-optimisations-workshop","owner":"mtumilowicz","description":"Introduction to cache coherence: false sharing, MESI protocol and vectorization","archived":false,"fork":false,"pushed_at":"2024-11-16T20:43:29.000Z","size":438,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-23T10:44:20.224Z","etag":null,"topics":["cache-coherence","cache-coherency","cache-invalidation","cache-line","cache-line-padding","false-sharing","jmh","jmh-benchmarks","mesi","mesi-protocol","multi-core-architectures","processor-architecture","simd","simd-instructions","simd-programming","vectorization","workshop","workshop-materials","writeback","writethrough"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mtumilowicz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-15T21:09:21.000Z","updated_at":"2024-11-16T20:43:32.000Z","dependencies_parsed_at":"2024-03-19T21:24:33.913Z","dependency_job_id":"a635ba7b-ce59-46f2-96a6-0575ead3e219","html_url":"https://github.com/mtumilowicz/java17-mesi-false-sharing-processor-optimisations-workshop","commit_stats":null,"previous_names":["mtumilowicz/java17-mesi-false-sharing-processor-optimisations-workshop"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mtumilowicz/java17-mesi-false-sharing-processor-optimisations-workshop","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mtumilowicz%2Fjava17-mesi-false-sharing-processor-optimisations-workshop","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mtumilowicz%2Fjava17-mesi-false-sharing-processor-optimisations-workshop/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mtumilowicz%2Fjava17-mesi-false-sharing-processor-optimisations-workshop/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mtumilowicz%2Fjava17-mesi-false-sharing-processor-optimisations-workshop/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mtumilowicz","download_url":"https://codeload.github.com/mtumilowicz/java17-mesi-false-sharing-processor-optimisations-workshop/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mtumilowicz%2Fjava17-mesi-false-sharing-processor-optimisations-workshop/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34398405,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-16T02:00:06.860Z","response_time":126,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cache-coherence","cache-coherency","cache-invalidation","cache-line","cache-line-padding","false-sharing","jmh","jmh-benchmarks","mesi","mesi-protocol","multi-core-architectures","processor-architecture","simd","simd-instructions","simd-programming","vectorization","workshop","workshop-materials","writeback","writethrough"],"created_at":"2025-01-04T22:49:50.546Z","updated_at":"2026-06-16T08:31:41.919Z","avatar_url":"https://github.com/mtumilowicz.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Build Status](https://app.travis-ci.com/mtumilowicz/java17-mesi-false-sharing-processor-optimisations-workshop.svg?branch=main)](https://app.travis-ci.com/mtumilowicz/java17-mesi-false-sharing-processor-optimisations-workshop)\n[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)\n\n# java17-mesi-false-sharing-processor-optimisations-workshop\n\n* references\n    * https://jenkov.com/tutorials/java-concurrency/false-sharing.html\n    * https://www.baeldung.com/java-false-sharing-contended\n    * [Cache Issues -- False Sharing -- Mike Bailey, Oregon State University](https://www.youtube.com/watch?v=dznxqe1Uk3E)\n    * [JDD2019: Who ate my RAM?, Jarek Pałka](https://www.youtube.com/watch?v=bJAg_23ixmY)\n    * [JDD 2018: new java.io.File(\"jdd.json\"); is this really that simple? by Jarek Pałka](https://www.youtube.com/watch?v=cJBfQRXMBII)\n    * [2023 - Krzysztof Ślusarski - Java vs CPU](https://www.youtube.com/watch?v=D96mSWuU-xc)\n    * https://techexpertise.medium.com/cache-coherence-problem-and-approaches-a18cdd48ee0e\n    * https://medium.com/@mallela.chandra76/cache-coherence-in-a-system-d9ba906b45f7\n    * https://www.brainkart.com/article/The-MESI-protocol_7651/\n    * https://www.geeksforgeeks.org/cache-coherence-protocols-in-multiprocessor-system/\n    * https://stackoverflow.com/questions/49983405/what-is-the-benefit-of-the-moesi-cache-coherency-protocol-over-mesi\n    * https://stackoverflow.com/questions/21126034/msi-mesi-how-can-we-get-read-miss-in-shared-state\n    * https://stackoverflow.com/questions/10058243/mesi-cache-protocol\n    * http://www.edwardbosworth.com/My5155_Slides/Chapter13/CacheCoherency.htm\n    * https://fgiesen.wordpress.com/2014/07/07/cache-coherency/\n    * https://github.com/melix/jmh-gradle-plugin\n    * https://chat.openai.com/\n    * [308. WJUG - Maciej Przepióra - \"Java Memory Model for Mere Mortals\" [EN]](https://www.youtube.com/watch?v=GEVGL36rLLU)\n\n## preface\n* goals of these workshops\n    * understanding modern cache architecture in multi-core environment\n    * discussing cache write policies\n    * explaining cache coherence problems\n        * with presentation of most known cache coherence protocol: MESI\n    * exemplifying cache performance hits\n        * false sharing\n        * loop order\n    * showing processor optimisations\n        * vectorization\n* workshop plan\n    * false sharing example\n    * benchmarks\n        * to trigger: `gradlew jmh`\n        * vectorization\n        * loop order\n\n## prerequisite\n* access to a cache by a processor involves one of two processes: read and write\n    * each process can have two results\n        * cache hit = processor accesses its private cache and finds the addressed item already in the cache\n        * otherwise - cache miss\n\n## cache coherence\n* in modern CPUs (almost) all memory accesses go through the cache hierarchy\n    * CPU core’s load/store (and instruction fetch) units normally can’t even access memory directly\n        * physically impossible\n          \n            ![alt text](img/architecture_overview.png)\n        * they talk to their L1 caches\n            * at this point, there’s generally more cache levels involved\n            * L1 cache doesn’t talk to memory directly anymore, it talks to a L2 cache – which in turns talks to memory\n                * or maybe to a L3 cache\n        * caches are organized into “lines”, corresponding to aligned blocks of memory\n            * 32 bytes (older ARMs, 90s/early 2000s x86s/PowerPCs)\n            * 64 bytes (newer ARMs and x86s)\n            * 128 bytes (newer Power ISA machines)\n* cache coherence\n    * concern raised in a multi-core distributed caches\n    * when multiple processors are operating on the same or nearby memory locations, they may end up sharing the same cache line\n        * unit of granularity of a cache entry is 64 bytes (512 bits)\n            * even if you read/write 1 byte you're writing 64 bytes\n        * it’s essential to keep those overlapping caches consistent with each other\n            * benefits of multithreading can disappear if the threads are competing for the same cache line\n            * note that the problem really is that we have multiple caches, not that we have multiple cores\n                * we could solve the entire problem by sharing all caches between all cores (L1)\n                    * each cycle, the L1 picks one lucky core that gets to do a memory operation this cycle, and runs it\n                        * problem: cores now spend most of their time waiting in line for their next turn at a L1 request\n                            * processors do a lot of those, at least one for every load/store instruction\n                                * slow\n                            * solution: next best thing is to have multiple caches and then make them behave as if there was only one cache\n                                * this is what cache coherency protocols are for\n        * there are quite a few protocols to maintain the cache coherency between CPU cores\n        * problem is not unique to parallel processing systems\n            * strong resemblance to the \"lost update\" problem\n        * general approach\n            * getting read access to a cache line involves talking to the other cores\n                * might cause them to perform memory transactions\n            * writing to a cache line is a multi-step process\n                * before you can write anything, you first need to acquire both exclusive ownership of the cache line and a copy of its existing contents\n                    * \"Read For Ownership\" request\n            * each line in a cache is identified and referenced by a cache tag (block number)\n                * allows the determination of the primary memory address associated with each element in the cache\n            * each individual cache must monitor the traffic in cache tags\n                * corresponds to the blocks being read from and written to the shared primary memory\n                * done by a snooping cache (or snoopy cache, after the Peanuts comic strip)\n                    * basic idea behind snooping is that all memory transactions take place on a shared bus that’s visible to all cores\n                        ![alt text](img/snoop_tags.png)\n                    * caches don’t just interact with the bus when they want to do a memory transaction themselves\n                        * instead, each cache continuously snoops on bus traffic to keep track of what the other caches are doing\n                        * if one cache wants to read from or write to memory on behalf of its core, all the other cores notice\n                            * that allows them to keep their caches synchronized\n                            * one core writes to a memory location =\u003e other cores know that their copies of the corresponding cache line are now stale and hence invalid\n                                * problem: write-back model\n                                    * it’s not enough to broadcast just the writes to memory when they happen\n                                        * physical write-back to memory can happen a long time after the core executed the corresponding store\n                                            * for the intervening time, the other cores and their caches might themselves try to write to the same location, causing a conflict\n                                    * if we want to avoid conflicts, we need to tell other cores about our intention to write before we start changing anything in our local copy\n                    * memory itself is a shared resource\n                        * memory access needs to be arbitrated\n                            * only one cache gets to read data from, or write back to, memory in any given cycle\n                * caches do not respond to bus events immediately\n                    * reason: cache is busy doing other things (sending data to the core for example)\n                        * it might not get processed that cycle\n                    * invalidation queue\n                        * place where bus message triggering a cache line invalidation sits for a while until the cache has time to process it\n                     \n        * why you need `volatile` keyword in java if you have cache coherency?\n           * cache coherency ensures that all threads eventually see the change\n              * no guarantee about when this happens\n           * `volatile` works also on higher level of abstraction\n              * introduces happens-before relationships around reads and writes\n* cache write policies\n    * write back\n        * write operations are usually made only to the cache\n        * main memory is only updated when the corresponding cache line is flushed from the cache\n        * results in inconsistency\n            * example: if two caches contain the same line, and the line is updated in one cache, the other cache will unknowingly have an invalid value\n        * one fundamental implementation\n            * MESI\n                * each cache line can be in one of these four distinct states: Modified, Exclusive, Shared, or Invalid\n                * key feature: delayed flush to main memory\n                    * example: when no one reads the data there is no need to write main memory\n                        * better to write only to cache as it is much faster\n    * write through\n        * all write operations are made to main memory as well as to the cache\n        * ensures that main memory is always valid\n        * has consistency issues\n            * occur unless other cache monitor the memory traffic or receive some direct notification of the update.\n        * two fundamental implementations\n            1. with update protocol\n                * after write to main memory message with the updated data is broadcast to all processor modules in the system\n                    * each processor updates the contents of the affected cache block if this block is present in its cache\n            1. with invalidation of copies\n                * after write to main memory invalidation request is broadcast through the system\n                    * all copies in other caches are invalidated\n    * write-through caches are simpler, but write-back has some advantages\n        * it can filter repeated writes to the same location\n            * most of the cache line changes on a write-back =\u003e can issue one large memory transaction instead of several small ones\n                * more efficient\n\n\n## MESI protocol\n* formal mechanism for controlling cache coherency using snooping techniques\n* most widely used cache coherence protocol\n* each line in an individual processors cache can exist in one of the four following states\n    * (M)odified\n        * result of a successful write hit on a cache line\n            * its value is different from the main memory\n        * indicates that the cache line is present in current cache only and is dirty\n            * it must be in the I state for all other cores\n        * modified line can be kept by a processor only as long as it's the only processor that has this copy\n            * on a cache miss, the cache still needs to write-back data to memory if it is in the modified state\n                * processor must signal \"Dirty\" and write the data back to the shared primary memory\n                    * causing the other processor to abandon its memory fetch\n            * it is necessary to first write in memory and then read it from there\n                * reading cores can't directly read from the cache of the writing core\n                    * it's more expensive\n                    * example\n                        1. suppose, both of A \u0026 B share that line and B got it directly from the cache line of A\n                        1. C needs that line for a write\n                        1. both A \u0026 B will have keep snooping that line\n                            * shared copies grows\n                            * greater impact on performance due to the snooping done by all the 'shared' processors\n            * example\n                1. let's processor A has that modified line\n                1. processor B is trying to read that same cache (modified by A) line from main memory\n                1. A's snooping read attempts for that line\n                    * content in the main memory is invalid now (because A modified the content)\n                1. A has to write it back to main memory\n                    * in order to allow processor B (and others) to read that line\n    * (E)xclusive\n        * its value matches the main memory value\n        * no other cache holds a copy of this line\n        * main purpose: prevent the unnecessary broadcast of a Cache Invalidate signal on a write hit\n            * reduces traffic on a shared bus\n    * (S)hared\n        * multiple caches may hold the line\n        * main memory is up to date\n        * if a core does not have exclusive access to a cache line when it wants to write, it first needs to send an \"I want exclusive access\" request to the bus\n            * this tells all other cores to invalidate their copies of that cache line, if they have any\n    * (I)nvalid\n        * cache line does not contain valid data\n        * it will reload that cache line the next time it needs to look at a value in it\n            * even if that value is perfectly current =\u003e false sharing\n* requires 2 bits per cache line to hold the state\n    * 4 values: M, E, S, I\n* transition between the states is controlled by memory accesses and bus snooping activity\n    * suppose a requesting processor processing a write hit on its cache\n        * by definition, any copy of the line in the caches of other processors must be in the Shared State\n        * what happens depends on the state of the cache in the requesting processor\n            1. Modified\n                * protocol does not specify an action for the processor\n            1. Shared\n                * processor writes the data\n                * marks the cache line as Modified\n                * broadcasts a Cache Invalidate signal to other processors\n            1. Exclusive\n                * processor writes the data and marks the cache line as Modified\n        * example\n\n            |Step   |cache line A   |cache line B   |\n            |---    |---            |---            |\n            |1      |Exclusive      |-              |\n            |2      |Shared         |Shared         |\n            |3      |Invalid        |Modified       |\n            |4      |Shared         |Shared         |\n\n            1. Core A reads a value\n            1. Core B reads a value from the same are of memory\n            1. Core B writes into that value\n            1. Core A tries to read a value from that same part of memory\n                * Core B's cache line is forced back to memory\n                * Core A's cache line is re-loaded from memory\n    * simulation\n        1. exclusive\n            ![alt text](img/pt1_exclusive.png)\n            * CPU 1\n                * is the first (and only) processor to request block A from the shared memory\n                * issues a BR (Bus Read) for the block and gets its copy\n                    * neither CPU 2 nor CPU 3 respond to the BR\n                * cache line containing block A is marked Exclusive\n                    * subsequent reads to this block access the cached entry and not the shared memory\n        1. shared\n            ![alt text](img/pt2_shared.png)\n            * CPU 2 requests the same block A\n            * snoop cache on CPU 1 notes the request and CPU 1 broadcasts Shared, announcing that it has a copy of the block\n                * CPU 3 does not respond to the BR\n            * both copies of the block are marked as shared\n                * indicates that the block is in two or more caches for reading\n                * copy in the shared primary memory is up to date\n        1. modified\n            ![alt text](img/pt3_modified.png)\n            * CPU 2 writes to the cache line it is holding in its cache\n                * issues a BU (Bus Upgrade) broadcast, marks the cache line as Modified, and writes the data to the line\n                * CPU 1 responds to the BU by marking the copy in its cache line as Invalid\n                * CPU 3 does not respond to the BU\n            * informally, CPU 2 can be said to \"own the cache line\"\n        1. dirty\n            ![alt text](img/pt4_dirty.png)\n            * CPU 3 attempts to read block A from primary memory\n            * CPU 1, the cache line holding that block has been marked as Invalid\n                * CPU 1 does not respond to the BR (Bus Read) request\n            * CPU 2 has the cache line marked as Modified\n                * asserts the signal Dirty on the bus, writes the data in the cache line back to the shared memory, and marks the line Shared\n                * informally, CPU 2 asks CPU 3 to wait while it writes back the contents of its modified cache line to the shared primary memory\n            * CPU 3 waits and then gets a correct copy\n            * The cache line in each of CPU 2 and CPU 3 is marked as Shared\n\n## false sharing\n* true sharing\n    * CPUs are writing to the same variables stored within the same cache line\n* CPUs are writing to independent variables stored within the same cache line\n    * independent = each CPU doesn't really rely on the values written by the other CPU\n    * steps\n        1. first thread modifies the variables\n            * cache line is invalidated in all CPU caches\n        1. other CPUs reload the content of the invalidated cache line\n            * even if they don't need the variable that was modified by first thread\n* essence of problems with concurrent programming\n    * more processors, more power, more electricity consumption but slower than single thread\n* solution: padding\n    * usually cache line has 16 slots\n        * if you use padding you could move a value to the next cache line\n    * JVM\n        * don't use volatile\n            * write to main memory will be delayed up to very end\n            * assuming that threads are modifying not interlapping set of variables it's OK\n            * example\n                ```\n                private static final int THREADS = 6\n                private static int[] results = new int[THREADS] // each thread has it's own place to accumulate results\n                ```\n                however, if we modify `results` using volatile machinery, we have false sharing\n                ```\n                VH.setVolatile(results, offset, results[offset] + 1)\n                ```\n        * change data structures so the independent variables are no longer stored within the same cache line\n            * `jdk.internal.vm.annotation.Contended`\n                * annotation to prevent false sharing\n                * introduced by Java 8 under `sun.misc` package\n                    * repackaged later by Java 9\n                * by default adds 128 bytes of padding\n                    * cache line size in many modern processors is around 64/128 bytes\n                    * configurable through the `-XX:ContendedPaddingWidth`\n                * annotated field =\u003e JVM will add some paddings around it\n                * annotated class =\u003e JVM will add the same padding before all the fields\n                * `-XX:-RestrictContended`\n                    * disable `@Contended` annotation\n                * use cases\n                    * `ConcurrentHashMap`\n                        * https://github.com/openjdk/jdk/blob/f29d1d172b82a3481f665999669daed74455ae55/src/java.base/share/classes/java/util/concurrent/ConcurrentHashMap.java#L2565\n                    * `ForkJoinPool`\n                        * https://github.com/openjdk/jdk/blob/1e8806fd08aef29029878a1c80d6ed39fdbfe182/src/java.base/share/classes/java/util/concurrent/ForkJoinPool.java#L774\n* note that false sharing doesn't cause incorrect results - just a performance hit\n    * updates may be lost if writes are not atomic\n\n## processor optimisations\n* SIMD (Single Instruction, Multiple Data)\n    * example: add four pairs of numbers together at once, rather than adding them sequentially\n    * instructions supported by modern processors\n    * allow a single instruction to operate on multiple data elements simultaneously\n        * can significantly improve performance by exploiting parallelism at the instruction level\n    * data should be aligned and contiguous in memory\n* vectorization\n    * involves identifying portions of code that can be executed in parallel using SIMD instructions\n    * typically involves operations like arithmetic operations, array computations, and data processing loops\n    * example: assembler on intel platform\n        * `vpadd`\n            * Vector Packed Add\n            * used for adding packed integer or floating-point values stored in SIMD registers\n            * vs `add` - typical instructions used for scalar operations\n        * `vmovdqu`\n            * Vector Move Unaligned\n            * used for moving data between memory and SIMD registers\n            * vs `mov` - typical instructions used for scalar operations\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmtumilowicz%2Fjava17-mesi-false-sharing-processor-optimisations-workshop","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmtumilowicz%2Fjava17-mesi-false-sharing-processor-optimisations-workshop","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmtumilowicz%2Fjava17-mesi-false-sharing-processor-optimisations-workshop/lists"}