{"id":18017038,"url":"https://github.com/achille-roussel/kway-go","last_synced_at":"2025-07-29T08:08:49.466Z","repository":{"id":213643556,"uuid":"734587708","full_name":"achille-roussel/kway-go","owner":"achille-roussel","description":"K-way merge with Go 1.23 range functions","archived":false,"fork":false,"pushed_at":"2024-12-11T20:58:37.000Z","size":45,"stargazers_count":39,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-08T02:28:14.369Z","etag":null,"topics":["go","golang","kwaymerge","merge","performance","rangefunc"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/achille-roussel.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-12-22T04:30:08.000Z","updated_at":"2025-03-05T09:10:44.000Z","dependencies_parsed_at":"2024-02-02T21:47:09.577Z","dependency_job_id":null,"html_url":"https://github.com/achille-roussel/kway-go","commit_stats":null,"previous_names":["achille-roussel/kway-go"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/achille-roussel/kway-go","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/achille-roussel%2Fkway-go","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/achille-roussel%2Fkway-go/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/achille-roussel%2Fkway-go/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/achille-roussel%2Fkway-go/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/achille-roussel","download_url":"https://codeload.github.com/achille-roussel/kway-go/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/achille-roussel%2Fkway-go/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267652766,"owners_count":24122098,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-29T02:00:12.549Z","response_time":2574,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["go","golang","kwaymerge","merge","performance","rangefunc"],"created_at":"2024-10-30T04:19:58.663Z","updated_at":"2025-07-29T08:08:49.384Z","avatar_url":"https://github.com/achille-roussel.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# kway-go [![Go Reference](https://pkg.go.dev/badge/github.com/achille-roussel/kway-go.svg)](https://pkg.go.dev/github.com/achille-roussel/kway-go)\nK-way merge with Go 1.23 range functions\n\n[bboreham]: https://github.com/bboreham\n[godoc]: https://pkg.go.dev/github.com/achille-roussel/kway-go@v0.2.0#pkg-examples\n[gophercon]: https://www.gophercon.com/agenda/session/1160355\n\n## Installation\n\nThis package is intended to be used as a library and installed with:\n```sh\ngo get github.com/achille-roussel/kway-go\n```\n\n## Usage\n\nThe package contains variations of the K-way merge algorithm for different\nforms of iterator sequences:\n\n* **Merge** and **MergeFunc** operate on sequences that yield single\n  values. **Merge** must be used on ordered values, while **MergeFunc**\n  accepts a comparison function as first argument to customize the\n  ordering logic.\n\n* **MergeSlice** and **MergeSliceFunc** are similar functions but operate on\n  sequences that yield slices of values. These are intended for applications\n  with higher throughput requirements that use batching or read values from\n  paging APIs.\n\nThe sequences being merged must each be ordered using the same comparison logic\nthan the one used for the merge, or the algorithm will not be able to produce an\nordered sequence of values.\n\nThe following code snippets illustrates how to merge three ordered sequences\ninto one:\n```go\nfor v, err := range kway.Merge(seq0, seq1, seq2) {\n    ...\n}\n```\n\nMore examples are available in the [Go doc][godoc].\n\n### Error Handling\n\nThe merge functions report errors seen from the input sequences, but the\npresence of errors does not interrupt the merge operations. When an error\noccurs, it is immediately bubbled up to the program, but if more values are\navailable in the input sequences, the program can continue consuming them after\nhandling the error. This model delegates the decision of how to handle errors to\nthe application, allowing it to carry or abort depending on the error value or\ntype, for example:\n\n```go\nfor v, err := range kway.Merge(sequences...) {\n    if err != nil {\n        // handle the error, the program may choose to break out of the loop\n        // or carry on to read the next value.\n        ...\n    } else {\n        // a value is available, process it\n        ...\n    }\n}\n```\n\n## Implementation\n\nThe K-way merge algorithm was inspired by the talk from\n[Bryan Boreham][bboreham] at [Gophercon 2023][gophercon], which described\nhow using a loser-tree instead of a min-heap improved performance of Loki's\nmerge of log records.\n\nThe `kway-go` package also adds a specialization for cases where the program\nis merging exactly two sequences, since this can be implemented as a simple\nunion of two sets which has a much lower compute and memory footprint.\n\n## Performance\n\nK-way merge is often used in stream processing or database engines to merge\ndistributed query results into a single ordered result set. In those\napplications, performance of the underlying algorithms tend to matter: for\nexample, when performing compaction of sorted records, the merge algorithm is\non the critical path and often where most of the compute is being spent. In that\nregard, there are efficiency requirements that the implementation must fulfil to\nbe a useful solution to those problems.\n\n\u003e :bulb: While exploring the performance characteristics of the algorithm, it is\n\u003e important to keep in mind that absolute numbers are only useful in the context\n\u003e where they were collected, since measurements depend on the hardware executing\n\u003e the code, and the data being processed. We should use relative performance of\n\u003e different benchmarks within a given context as a hint to find opportunities\n\u003e for optimizations in production applications, not as universal truths.\n\nThe current implementation has already been optimized to maximize throughput, by\namortizing as much of the baseline costs as possible, and ensure that CPU time is\nspent on the important parts of the algorithm.\n\nAs part of this optimization work, it became apparent that while the Go runtime\nimplementation of coroutines underneath `iter.Pull2` has a much lower compute\nfootprint than using channels, it still has a significant overhead when reading\nvalues in tight loops of the merge algorithm.\n\nThis graph shows a preview of the results, the full analysis is described in the\nfollowing sections:\n\n![image](https://github.com/achille-roussel/kway-go/assets/865510/730da27c-e639-4cfe-878a-9cc5c9287e37)\n\n\n### Establishing a performance baseline\n\nTo explore performance, let's first establish a baseline. We use the throughput\nof merging a single sequence, which is simple reading all the values it yields\nas comparison point:\n```\nMerge1  592898557  1.843 ns/op  0 comp/op   542741115 merge/s\n```\nThis benchmark shows that on this test machine, the highest theoretical\nthroughput we can achieve is **~540M merge/s** for one sequence,\n**~270M merge/s** when merging two sequences, etc...\n\n### Performance analysis of the K-way merge algorithm\n\nNow comparing the performance of merging two and three sequences:\n```\nMerge2   47742177  24.78 ns/op  0.8125 comp/op  40359389 merge/s\nMerge3   27540648  42.23 ns/op  1.864 comp/op   23682342 merge/s\n```\nWe observe a significant drop in throughput in comparison with iterating over\na single sequence, with the benchmark now performing **~7x slower** than the\ntheoretical throughput limit.\n\nThe K-way merge algorithm has a complexity of *O(n∙log(k))*, there would also be\na baseline cost for the added code implementing the merge operations, but almost\nan order of magnitude difference seems unexpected.\n\nTo understand what is happening, we can look into a CPU profile:\n```\nDuration: 3.46s, Total samples = 2.44s (70.45%)\nShowing nodes accounting for 2.40s, 98.36% of 2.44s total\nDropped 9 nodes (cum \u003c= 0.01s)\n flat  flat%   sum%    cum   cum%\n0.30s 12.30% 12.30%  0.72s 29.51%  github.com/achille-roussel/kway-go.MergeFunc[go.shape.int].merge2[go.shape.int].func3\n0.25s 10.25% 22.54%  0.34s 13.93%  github.com/achille-roussel/kway-go.(*tree[go.shape.int]).next\n0.21s  8.61% 31.15%  0.76s 31.15%  github.com/achille-roussel/kway-go.sequence.func1\n0.17s  6.97% 38.11%  0.26s 10.66%  github.com/achille-roussel/kway-go.MergeFunc[go.shape.int].unbuffer[go.shape.int].func6.1\n0.15s  6.15% 44.26%  0.25s 10.25%  github.com/achille-roussel/kway-go.MergeFunc[go.shape.int].buffer[go.shape.int].func1.1\n0.15s  6.15% 50.41%  0.21s  8.61%  github.com/achille-roussel/kway-go.MergeFunc[go.shape.int].buffer[go.shape.int].func4.1\n0.14s  5.74% 56.15%  0.23s  9.43%  iter.Pull2[go.shape.[]go.shape.int,go.shape.interface { Error string }].func2\n0.13s  5.33% 61.48%  0.13s  5.33%  runtime/internal/atomic.(*Uint32).CompareAndSwap (inline)\n0.11s  4.51% 65.98%  0.18s  7.38%  iter.Pull2[go.shape.[]go.shape.int,go.shape.interface { Error string }].func1.1\n0.10s  4.10% 70.08%  0.27s 11.07%  runtime.coroswitch_m\n0.09s  3.69% 73.77%  0.09s  3.69%  github.com/achille-roussel/kway-go.benchmark[go.shape.int].func2\n0.09s  3.69% 77.46%  0.09s  3.69%  runtime.coroswitch\n0.08s  3.28% 80.74%  0.11s  4.51%  gogo\n0.07s  2.87% 83.61%  0.09s  3.69%  github.com/achille-roussel/kway-go.MergeFunc[go.shape.int].buffer[go.shape.int].func2.1\n0.06s  2.46% 86.07%  0.06s  2.46%  runtime.mapaccess1_fast64\n0.05s  2.05% 88.11%  0.09s  3.69%  github.com/achille-roussel/kway-go.benchmark[go.shape.int].func1\n0.04s  1.64% 89.75%  0.04s  1.64%  cmp.Compare[go.shape.int] (inline)\n0.04s  1.64% 91.39%  0.04s  1.64%  internal/race.Acquire\n0.04s  1.64% 93.03%  0.04s  1.64%  runtime.(*guintptr).cas (inline)\n0.04s  1.64% 94.67%  0.32s 13.11%  runtime.mcall\n0.04s  1.64% 96.31%  0.04s  1.64%  runtime.save_g\n0.02s  0.82% 97.13%  0.02s  0.82%  internal/race.Release\n0.01s  0.41% 97.54%  0.43s 17.62%  github.com/achille-roussel/kway-go.MergeFunc[go.shape.int].merge[go.shape.int].func5\n0.01s  0.41% 97.95%  0.04s  1.64%  github.com/achille-roussel/kway-go.nextNonEmptyValues[go.shape.int]\n0.01s  0.41% 98.36%  0.08s  3.28%  runtime/pprof.(*profMap).lookup\n    0     0% 98.36%  0.72s 29.51%  github.com/achille-roussel/kway-go.BenchmarkMerge2\n    0     0% 98.36%  0.43s 17.62%  github.com/achille-roussel/kway-go.BenchmarkMerge3\n    0     0% 98.36%  0.35s 14.34%  github.com/achille-roussel/kway-go.MergeFunc[go.shape.int].buffer[go.shape.int].func1\n    0     0% 98.36%  0.14s  5.74%  github.com/achille-roussel/kway-go.MergeFunc[go.shape.int].buffer[go.shape.int].func2\n    0     0% 98.36%  0.27s 11.07%  github.com/achille-roussel/kway-go.MergeFunc[go.shape.int].buffer[go.shape.int].func4\n    0     0% 98.36%  1.15s 47.13%  github.com/achille-roussel/kway-go.MergeFunc[go.shape.int].unbuffer[go.shape.int].func6\n    0     0% 98.36%  1.15s 47.13%  github.com/achille-roussel/kway-go.benchmark[go.shape.int]\n    0     0% 98.36%  0.76s 31.15%  iter.Pull2[go.shape.[]go.shape.int,go.shape.interface { Error string }].func1\n    0     0% 98.36%  0.76s 31.15%  runtime.corostart\n```\nAs we can see here, a significant amount of time seems to be spent in the Go\nruntime code managing coroutines. While it might be possible to optimize the\nruntime, there is a lower bound on how much it can be reduced.\n\nIt is also unlikely that the Go compiler could help here, there are no real\nopportunities for inlining or other optimizations.\n\n### Performance optimization of the K-way merge algorithm\n\nWe basically have a very high baseline cost for each operation, with the\nhypothesis that it is driven by coroutine context switch implemented in the\nruntime, the only thing we can do to improve performance is doing less of these.\n\nThis is a typical a baseline cost amortization problem: we want to call the\n`next` function returned by `iter.Pull2` less often, which can be done by\nintroducing buffering. Instead of pulling values one at a time, we can\nefficiently buffer N values from each sequence in memory, by transposing\nthe `iter.Seq2[T, error]` sequences into `iter.Seq2[[]T, error]`. The call\nto `next` then only needs to happen when we exhaust the buffer, which ends up\namortizing its cost.\n\nWith an internal buffer size of **128** values per sequence:\n```\nMerge2  190103247  6.133 ns/op  0.8333 comp/op  163045156 merge/s\nMerge3  95485022  12.74 ns/op   1.864 comp/op    78492807 merge/s\n```\nNow we made the algorithm **3-4x faster**, and have performance in the range of\n**1.5 to 2.5x** the theoretical throughput limit.\n\nIt is interesting to note that the CPU profile didn't seem to indicate that 75%\nof the time was spent in the runtime, but reducing the time spent in that code\npath has had a non-linear impact on performance. Likely some other CPU\ninstruction pipeline and caching shenanigans are at play here, possibly impacted\nby the atomic compare-and-swap operations in coroutine switches.\n\nAs expected, the CPU profile now shows that almost no time is spent in the\nruntime:\n```\nDuration: 3.17s, Total samples = 2.35s (74.08%)\nShowing nodes accounting for 2.28s, 97.02% of 2.35s total\nDropped 22 nodes (cum \u003c= 0.01s)\n flat  flat%   sum%    cum   cum%\n0.45s 19.15% 19.15%  0.56s 23.83%  github.com/achille-roussel/kway-go.(*tree[go.shape.int]).next\n0.43s 18.30% 37.45%  0.43s 18.30%  github.com/achille-roussel/kway-go.benchmark[go.shape.int].func2\n0.37s 15.74% 53.19%  0.97s 41.28%  github.com/achille-roussel/kway-go.MergeFunc[go.shape.int].merge2[go.shape.int].func3\n0.23s  9.79% 62.98%  0.24s 10.21%  github.com/achille-roussel/kway-go.MergeFunc[go.shape.int].buffer[go.shape.int].func1.1\n0.22s  9.36% 72.34%  0.65s 27.66%  github.com/achille-roussel/kway-go.MergeFunc[go.shape.int].unbuffer[go.shape.int].func6.1\n0.13s  5.53% 77.87%  0.13s  5.53%  github.com/achille-roussel/kway-go.MergeFunc[go.shape.int].buffer[go.shape.int].func4.1\n0.12s  5.11% 82.98%  0.21s  8.94%  github.com/achille-roussel/kway-go.benchmark[go.shape.int].func1\n0.10s  4.26% 87.23%  0.52s 22.13%  github.com/achille-roussel/kway-go.sequence.func1\n0.09s  3.83% 91.06%  0.09s  3.83%  cmp.Compare[go.shape.int] (inline)\n0.05s  2.13% 93.19%  0.05s  2.13%  github.com/achille-roussel/kway-go.MergeFunc[go.shape.int].buffer[go.shape.int].func2.1\n0.03s  1.28% 94.47%  0.06s  2.55%  runtime/pprof.(*profMap).lookup\n0.02s  0.85% 95.32%  0.02s  0.85%  github.com/achille-roussel/kway-go.parent (inline)\n0.02s  0.85% 96.17%  0.02s  0.85%  runtime.asyncPreempt\n0.02s  0.85% 97.02%  0.02s  0.85%  runtime.mapaccess1_fast64\n    0     0% 97.02%  0.97s 41.28%  github.com/achille-roussel/kway-go.BenchmarkMerge2\n    0     0% 97.02%  0.76s 32.34%  github.com/achille-roussel/kway-go.BenchmarkMerge3\n    0     0% 97.02%  0.31s 13.19%  github.com/achille-roussel/kway-go.MergeFunc[go.shape.int].buffer[go.shape.int].func1\n    0     0% 97.02%  0.08s  3.40%  github.com/achille-roussel/kway-go.MergeFunc[go.shape.int].buffer[go.shape.int].func2\n    0     0% 97.02%  0.13s  5.53%  github.com/achille-roussel/kway-go.MergeFunc[go.shape.int].buffer[go.shape.int].func4\n    0     0% 97.02%  0.76s 32.34%  github.com/achille-roussel/kway-go.MergeFunc[go.shape.int].merge[go.shape.int].func5\n    0     0% 97.02%  1.73s 73.62%  github.com/achille-roussel/kway-go.MergeFunc[go.shape.int].unbuffer[go.shape.int].func6\n    0     0% 97.02%  1.73s 73.62%  github.com/achille-roussel/kway-go.benchmark[go.shape.int]\n    0     0% 97.02%  0.52s 22.13%  iter.Pull2[go.shape.[]go.shape.int,go.shape.interface { Error string }].func1\n    0     0% 97.02%  0.52s 22.13%  runtime.corostart\n```\n\n### Further optimizations using batch processing\n\nThere is a final performance frontier we can cross. While we are buffering\nvalues internally, the input and output sequences remain `iter.Seq2[T, error]`,\nwhich yield values one by one. Often times in data systems, APIs have pagination\ncapabilities, or stream processors work on batch of values for the same reason\nwe added buffering: it reduces the baseline cost of crossing system boundaries.\n\nIf the input sequences are already slices of values, and the output sequence\nproduces slices of values, we can reduce the internal memory footprint (no need\nto allocate memory to buffer the inputs), while also further amortizing the cost\nof function calls to yield values in and out of the merge algorithm.\n\nApplications that fall into those categories can unlock further performance by\nusing `MergeSlice` instead of `Merge`, which works on `iter.Seq2[[]T, error]`\nend-to-end.\n\nWhat is interesting with this approach is that in cases where the processing of\ninputs and outputs can be batched, this model **can even beat the theoretical\nthroughput limit**. For example, in the benchmarks we've used, the body of the\nloop consuming merged values simply counts the results. When consuming slices\nthere is no need to iterate over the slices and increment the counter by one\neach time, we can batch the operation by incrementing the counter by the length\nof the slice, achieving much higher throughput than predicted by the baseline:\n```\nMergeSlice2  477720793  2.273 ns/op  0.6688 comp/op  439971259 merge/s\nMergeSlice3  150406080  7.945 ns/op  1.667 comp/op   125861613 merge/s\n```\n\n\u003e :warning: Keep in mind that to minimize the footprint, `MergeSlice` resuses\n\u003e its output buffer, which means that the application cannot retain it beyond\n\u003e the body of the loop raning over the merge function. This can lead to subtle\n\u003e bugs that can be difficult to track, `Merge` should always be preferred unless\n\u003e there is clear evidence that the increased maintenance cost is worth the\n\u003e performance benefits.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fachille-roussel%2Fkway-go","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fachille-roussel%2Fkway-go","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fachille-roussel%2Fkway-go/lists"}