{"id":27057760,"url":"https://github.com/mrange/pushstream6","last_synced_at":"2025-04-05T11:33:09.546Z","repository":{"id":46054366,"uuid":"428156745","full_name":"mrange/PushStream6","owner":"mrange","description":"Push Stream for F#6","archived":false,"fork":false,"pushed_at":"2021-12-27T21:00:15.000Z","size":55,"stargazers_count":5,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-02T02:09:05.506Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"F#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mrange.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-11-15T07:04:33.000Z","updated_at":"2022-05-06T11:20:49.000Z","dependencies_parsed_at":"2022-08-30T21:20:27.351Z","dependency_job_id":null,"html_url":"https://github.com/mrange/PushStream6","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrange%2FPushStream6","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrange%2FPushStream6/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrange%2FPushStream6/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrange%2FPushStream6/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mrange","download_url":"https://codeload.github.com/mrange/PushStream6/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247331879,"owners_count":20921846,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-04-05T11:33:04.598Z","updated_at":"2025-04-05T11:33:09.537Z","avatar_url":"https://github.com/mrange.png","language":"F#","funding_links":[],"categories":[],"sub_categories":[],"readme":"# F# Advent 2021  Dec 08 - Fast data pipelines with F#6\n\n_Thanks to [Sergey Tihon](https://www.linkedin.com/in/sergeytihon/) for running [F# Weekly](https://sergeytihon.com/2021/10/18/f-advent-calendar-2021/) and [F# Advent](https://sergeytihon.com/2021/10/18/f-advent-calendar-2021/)._\n\n_Thanks to [manofstick](https://gist.github.com/manofstick) for trying out the code and coming with invaluable feedback. [Cistern.ValueLinq](https://github.com/manofstick/Cistern.ValueLinq) is very impressive._\n\n## TLDR; F#6 enables data pipelines with up to 15x less overhead than LINQ\n\nThere were many interesting improvements in F#6 but one in particular caught my eye, the attribute `InlineIfLambda`.\n\nThe purpose of `InlineIfLambda` is to instruct the compiler to inline the lambda argument if possible. One reason is potentially improved performance.\n\n## Example from FSharp.Core 6\n\nLooking at the [Array.fs](https://github.com/dotnet/fsharp/blob/main/src/fsharp/FSharp.Core/array.fs) in the F# repository we see that the attribute is used in several places such as in `Array.iter`:\n\n```fsharp\nlet inline iter ([\u003cInlineIfLambda\u003e] action) (array: 'T[]) =\n    checkNonNull \"array\" array\n    for i = 0 to array.Length-1 do\n        action array.[i]\n```\n\nWithout `InlineIfLambda` `Array.iter` would be inlined but invoking `action` would be a virtual call incurring overhead that _sometimes_ can be important.\n\n```fsharp\n// This is an example on what we could write to use Array.iter\nlet mutable sum = 0\nmyArray |\u003e Array.iter (fun v -\u003e sum \u003c- sum + v)\n```\n\nF#5 also does inlining but it's based on a complexity analysis that we have little control over.\n\n```fsharp\n// What above evaluates to in F#5\nlet sum = ref 0\nlet action v = sum := !sum + v\n\n// Array.iter inlined\ncheckNonNull \"array\" myArray\nfor i = 0 to myArray.Length-1 do\n  // But the action is not inlined\n  action array.[i]\n```\n\nSo the above code could actually be inlined, or not inlined depending on what the complexity analysis thinks.\n\n```fsharp\n// What above evaluates to in F#6\nlet mutable sum = 0\ncheckNonNull \"array\" myArray\nfor i = 0 to myArray.Length-1 do\n  sum \u003c- sum + array.[i]\n```\n\nThis avoids virtual calls as well as allocating a `ref` cell and a lambda.\n\n## Arrays vs Seq\n\nArrays are great but one drawback is that for each step in a pipeline we would create an intermediate array which needs to be garbage collected.\n\n```fsharp\n// Creates an array\n[|0..10000|]\n// Creates a mapped array of ints\n|\u003e Array.map    ((+) 1)\n// Creates a filtered array of ints\n|\u003e Array.filter (fun v -\u003e (v \u0026\u0026\u0026 1) = 0)\n// Creates a mapped array of longs\n|\u003e Array.map    int64\n// Creates a sum\n|\u003e Array.fold   (+) 0L\n```\n\nOne way around this is using `seq`\n\n```fsharp\nseq { 0..10000 }\n|\u003e Seq.map    ((+) 1)\n|\u003e Seq.filter (fun v -\u003e (v \u0026\u0026\u0026 1) = 0)\n|\u003e Seq.map    int64\n|\u003e Seq.fold   (+) 0L\n```\n\nIt turns out that the `seq` pipeline above is about 3x slower than the `Array` pipeline even if it doesn't allocate (that much) unnecessary memory.\n\nCan we do better?\n\n## Building a push stream with `InlineIfLambda`\n\nIn a dream world it would be great if we could have a data pipeline with very little overhead for both memory and CPU. Let's try to see what we can do with `InlineIfLambda`.\n\n`seq`, which is an alias for `IEnumerable\u003c_\u003e`, is a so-called pull  pipeline. The consumer pulls value through the pipeline by calling `MoveNext` and `Current` until `MoveNext` returns `false`.\n\nAnother approach is to let the producer of data push data through the pipeline. This kind of pipeline tends to be simpler to implement and more performant.\n\nWe call it `PushStream` and as it is essentially nested lambdas  `InlineIfLambda` could help improve performance.\n\n```fsharp\ntype PushStream\u003c'T\u003e = ('T -\u003e bool) -\u003e bool\n```\n\nA `PushStream` is a function that accepts a receiver function `'T-\u003ebool` and calls the receiver function until no values are returned or the receiver function returns `false` indicating it wants no more values. `PushStream` returns `true` if the producer values were fully consumed and `false` if the consumption is stopped before reaching the end of the producer.\n\nA `PushStream` module could look something like this:\n\n```fsharp\n// 'T PushStream is an alternative syntax for PushStream\u003c'T\u003e\ntype 'T PushStream = ('T -\u003e bool) -\u003e bool\n\nmodule PushStream =\n  // Generates a range of ints in b..e\n  //  Note the use of [\u003cInlineIfLambda\u003e] to inline the receiver function r\n  let inline ofRange b e : int PushStream = fun ([\u003cInlineIfLambda\u003e] r) -\u003e\n      // This easy to implement in that we loop over the range b..e and\n      //  call the receiver function r until either it returns false\n      //  or we reach the end of the range\n      //  Thanks to InlineIfLambda r should be inlined\n      let mutable i = b\n      while i \u003c= e \u0026\u0026 r i do\n        i \u003c- i + 1\n      i \u003e e\n\n  // Filters a PushStream using a filter function\n  //  Note the use of [\u003cInlineIfLambda\u003e] to inline both the filter function f and the PushStream function ps\n  let inline filter ([\u003cInlineIfLambda\u003e] f) ([\u003cInlineIfLambda\u003e] ps : _ PushStream) : _ PushStream = fun ([\u003cInlineIfLambda\u003e] r) -\u003e\n    // ps is the previous push stream which we invoke with our receiver lambda\n    //  Our receiver lambda checks if each received value passes filter function f\n    //  If it does we pass the value to r, otherwise we return true to continue\n    //  f, ps and r are lambdas that should be inlined due to InlineIfLambda\n    ps (fun v -\u003e if f v then r v else true)\n\n  // Maps a PushStream using a mapping function\n  let inline map ([\u003cInlineIfLambda\u003e] f) ([\u003cInlineIfLambda\u003e] ps : _ PushStream)  : _ PushStream = fun ([\u003cInlineIfLambda\u003e] r) -\u003e\n    // ps is the previous push stream which we invoke with our receiver lambda\n    //  Our receiver lambda maps each received value with map function f and\n    //  pass the mapped value to r\n    //  If it does we pass the value to r, otherwise we return true to continue\n    //  f, ps and r are lambdas that should be inlined due to InlineIfLambda\n    ps (fun v -\u003e r (f v))\n\n  // Folds a PushStream using a folder function f and an initial value z\n  let inline fold ([\u003cInlineIfLambda\u003e] f) z ([\u003cInlineIfLambda\u003e] ps : _ PushStream) =\n    let mutable s = z\n    // ps is the previous push stream which we invoke with our receiver lambda\n    //  Our receiver lambda folds the state and value with folder function f\n    //  Returns true to continue looping\n    //  f and ps are lambdas that should be inlined due to InlineIfLambda\n    //  This also means that s should not need to be a ref cell which avoids\n    //  some memory pressure\n    ps (fun v -\u003e s \u003c- f s v; true) |\u003e ignore\n    s\n\n  // It turns out that if we pipe using |\u003e the F# compiler don't inline\n  //  the lambdas as we like it to.\n  //  So define a more restrictive version of |\u003e that applies function f\n  //  to a function v\n  //  As both f and v are restricted to lambas we can apply InlineIfLambda\n  let inline (|\u003e\u003e) ([\u003cInlineIfLambda\u003e] v : _ -\u003e _) ([\u003cInlineIfLambda\u003e] f : _ -\u003e _) = f v\n```\n\nThe previous pipeline with the `PushStream` definition above:\n\n```fsharp\nopen PushStream\nofRange     0 10000\n|\u003e\u003e map     ((+) 1)\n|\u003e\u003e filter  (fun v -\u003e (v \u0026\u0026\u0026 1) = 0)\n|\u003e\u003e map     int64\n|\u003e\u003e fold    (+) 0L\n```\n\nLooks pretty good but how does it perform?\n\n## Comparing performance with different data pipelines\n\nFirst let's define a baseline to compare all performance against, a simple for loop that computes the same result as the pipeline above\n\n```fsharp\nlet mutable s = 0L\nfor i = 0 to 10000 do\n  let i = i + 1\n  if (i \u0026\u0026\u0026 1) = 0 then\n    s \u003c- s + int64 i\ns\n```\n\nThen we define a bunch of benchmarks and compare them using [Benchmark.NET](https://benchmarkdotnet.org/).\n\n```fsharp\nopen PushStream\n\ntype [\u003cStruct\u003e] V2 = V2 of int*int\n\n[\u003cMemoryDiagnoser\u003e]\n[\u003cRyuJitX64Job\u003e]\ntype PushStream6Benchmark() =\n  class\n\n    [\u003cBenchmark\u003e]\n    member x.Baseline() =\n      // The baseline performance\n      //  We expect this to do the best\n      let mutable s = 0L\n      for i = 0 to 10000 do\n        let i = i + 1\n        if (i \u0026\u0026\u0026 1) = 0 then\n          s \u003c- s + int64 i\n      s\n\n    [\u003cBenchmark\u003e]\n    member x.Linq() =\n      // LINQ performance\n      Enumerable.Range(0,10001).Select((+) 1).Where(fun v -\u003e (v \u0026\u0026\u0026 1) = 0).Select(int64).Sum()\n\n    [\u003cBenchmark\u003e]\n    member x.Array () =\n      // Array performance\n      Array.init 10000 id\n      |\u003e Array.map    ((+) 1)\n      |\u003e Array.filter (fun v -\u003e (v \u0026\u0026\u0026 1) = 0)\n      |\u003e Array.map    int64\n      |\u003e Array.fold   (+) 0L\n\n    [\u003cBenchmark\u003e]\n    member x.Seq () =\n      // Seq performance\n      seq { 0..10000 }\n      |\u003e Seq.map    ((+) 1)\n      |\u003e Seq.filter (fun v -\u003e (v \u0026\u0026\u0026 1) = 0)\n      |\u003e Seq.map    int64\n      |\u003e Seq.fold   (+) 0L\n\n    [\u003cBenchmark\u003e]\n    member x.PushStream () =\n      // PushStream using |\u003e\n      ofRange   0 10000\n      |\u003e map    ((+) 1)\n      |\u003e filter (fun v -\u003e (v \u0026\u0026\u0026 1) = 0)\n      |\u003e map    int64\n      |\u003e fold   (+) 0L\n\n    [\u003cBenchmark\u003e]\n    member x.FasterPushStream () =\n      // PushStream using |\u003e\u003e as it turns out that\n      //  |\u003e prevents inlining of lambdas\n      ofRange     0 10000\n      |\u003e\u003e map     ((+) 1)\n      |\u003e\u003e filter  (fun v -\u003e (v \u0026\u0026\u0026 1) = 0)\n      |\u003e\u003e map     int64\n      |\u003e\u003e fold    (+) 0L\n\n    [\u003cBenchmark\u003e]\n    member x.PushStreamV2 () =\n      ofRange   0 10000\n      |\u003e map    (fun v -\u003e V2 (v, 0))\n      |\u003e map    (fun (V2 (v, w)) -\u003e V2 (v + 1, w))\n      |\u003e filter (fun (V2 (v, _)) -\u003e (v \u0026\u0026\u0026 1) = 0)\n      |\u003e map    (fun (V2 (v, _)) -\u003e int64 v)\n      |\u003e fold   (+) 0L\n\n    [\u003cBenchmark\u003e]\n    member x.FasterPushStreamV2 () =\n      // Mor\n      ofRange     0 10000\n      |\u003e\u003e map     (fun v -\u003e V2 (v, 0))\n      |\u003e\u003e map     (fun (V2 (v, w)) -\u003e V2 (v + 1, w))\n      |\u003e\u003e filter  (fun (V2 (v, _)) -\u003e (v \u0026\u0026\u0026 1) = 0)\n      |\u003e\u003e map     (fun (V2 (v, _)) -\u003e int64 v)\n      |\u003e\u003e fold    (+) 0L\n  end\n\nBenchmarkRunner.Run\u003cPushStream6Benchmark\u003e() |\u003e ignore\n```\n\n## Results\n\n### Range 0..10000\n\nOn my admittedly aging machine `Benchmark.NET` reports these performance numbers.\n\n```\nBenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1348 (21H2)\nIntel Core i5-3570K CPU 3.40GHz (Ivy Bridge), 1 CPU, 4 logical and 4 physical cores\n.NET SDK=6.0.100\n  [Host]    : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT DEBUG\n  RyuJitX64 : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT\n\nJob=RyuJitX64  Jit=RyuJit  Platform=X64\n\n|             Method |       Mean |     Error |    StdDev | Ratio | RatioSD |   Gen 0 | Allocated |\n|------------------- |-----------:|----------:|----------:|------:|--------:|--------:|----------:|\n|           Baseline |   6.807 us | 0.0617 us | 0.0577 us |  1.00 |    0.00 |       - |         - |\n|               Linq | 148.106 us | 0.5009 us | 0.4685 us | 21.76 |    0.19 |       - |     400 B |\n|              Array |  53.630 us | 0.1744 us | 0.1631 us |  7.88 |    0.08 | 44.7388 | 141,368 B |\n|                Seq | 290.103 us | 0.5075 us | 0.4499 us | 42.59 |    0.34 |       - |     480 B |\n|         PushStream |  34.214 us | 0.0966 us | 0.0904 us |  5.03 |    0.04 |       - |     168 B |\n|   FasterPushStream |   9.011 us | 0.0231 us | 0.0216 us |  1.32 |    0.01 |       - |         - |\n|       PushStreamV2 | 151.564 us | 0.3724 us | 0.3301 us | 22.25 |    0.18 |       - |     216 B |\n| FasterPushStreamV2 |   9.012 us | 0.0385 us | 0.0360 us |  1.32 |    0.01 |       - |         - |\n```\n\nThe imperative `Baseline` does the best as we expect.\n\n`Linq`, `Array` and `Seq` adds significant overhead over the `Baseline`. This is because the lambda functions all are very cheap to make any overhead caused by the pipeline to be clearly visible.\n\nIt doesn't necessarily mean that your code would benefit of a rewrite to an imperative style over using `Seq`. If the lambda functions are expensive or the pipeline processing is a small part of your application using `Seq` is fine.\n\n`Array` allocates a significant amount of memory that has to be GC:ed.\n\nWe see that `PushStream` does pretty good but what's real interesting is `FasterPushStream` where `InlineIfLambda` is properly applied thanks to operator `|\u003e\u003e`.\n\nThe performance of the `FasterPushStream` is comparable to the `Baseline` and it also don't allocate any memory.\n\n### Range 0..10\n\nThe above benchmark iterated used an input range `0..10000` but what happens if we change to a shorter range `0..10`:\n\n```\nBenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1379 (21H2)\nIntel Core i5-3570K CPU 3.40GHz (Ivy Bridge), 1 CPU, 4 logical and 4 physical cores\n.NET SDK=6.0.100\n  [Host]    : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT DEBUG\n  RyuJitX64 : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT\n\nJob=RyuJitX64  Jit=RyuJit  Platform=X64\n\n|             Method |       Mean |     Error |    StdDev | Ratio | RatioSD |  Gen 0 | Allocated |\n|------------------- |-----------:|----------:|----------:|------:|--------:|-------:|----------:|\n|           Baseline |   8.746 ns | 0.0287 ns | 0.0269 ns |  1.00 |    0.00 |      - |         - |\n|               Linq | 290.427 ns | 0.6953 ns | 0.6164 ns | 33.22 |    0.13 | 0.1273 |     400 B |\n|              Array | 104.258 ns | 0.7117 ns | 0.6658 ns | 11.92 |    0.09 | 0.0764 |     240 B |\n|                Seq | 481.877 ns | 1.6078 ns | 1.4252 ns | 55.11 |    0.22 | 0.1526 |     480 B |\n|         PushStream |  74.861 ns | 0.2804 ns | 0.2623 ns |  8.56 |    0.03 | 0.0535 |     168 B |\n|   FasterPushStream |  10.761 ns | 0.0302 ns | 0.0267 ns |  1.23 |    0.00 |      - |         - |\n|       PushStreamV2 | 206.379 ns | 0.4882 ns | 0.4567 ns | 23.60 |    0.10 | 0.0687 |     216 B |\n| FasterPushStreamV2 |  10.736 ns | 0.0175 ns | 0.0164 ns |  1.23 |    0.00 |      - |         - |\n```\n\nThe `FasterPushStream` does even better thanks to that it doesn't need to setup a pipeline. `Array` memory overhead goes from clearly the biggest overhead to about average as the overhead of the pipeline created for `Linq` and `Seq` is comparable to that of the intermediate arrays created by `Array`.\n\n## Explaining `PushStreamV2`\n\n`PushStreamV2` was added to expose the cost of F# tail calls. Tail calls in F# is annotated with `.tail` attribute to tell the jitter that the stack frame doesn't have to be preserved.\n\nIn .NET5 this caused a significant slow down when dealing with types that don't fit in the CPU register due to the runtime eliminating the stack frame on each call. With\n\n.NET6 `PushStreamV2` does worse but not horribly so thanks to improvements in the jitter meaning a stack frame is never created and thus doesn't need to be eliminated.\n\nWhat's exciting is that `FasterPushStreamV2` performs just as well as `FasterPushStream` thanks to inlining.\n\nSee appendix for more details.\n\n## But...\n\nWhile I think it's very exciting that we can write performant data pipelines in F# there are two issues that make `PushStream` finicky to use.\n\n### `|\u003e` doesn't inline lambdas\n\nThe difference between benchmarks `PushStream` and `FasterPushStream` is that the `PushStream` uses `|\u003e` which is the normal piping operator in F#.\n\nWhen `|\u003e` is used together with `PushStream` no inlining of lambdas happens which means increased CPU and memory overhead.\n\n[Perhaps](https://github.com/dotnet/fsharp/issues/12388) F# should be changed to support inlining even if `|\u003e` is used?\n\nThe workaround to define an operator `|\u003e\u003e` that has the `InlineIfLambda` attribute works but it is easy for a programmer to make mistakes as no warnings are produced by the compiler.\n\n```fsharp\n  let inline (|\u003e\u003e) ([\u003cInlineIfLambda\u003e] v : _ -\u003e _) ([\u003cInlineIfLambda\u003e] f : _ -\u003e _) = f v\n```\n\n### Inlining fails for hard to know reasons\n\n[manofstick](https://gist.github.com/manofstick) gave me important feedback to this blog post and he noted the inlining failed in the following situation\n\n```fsharp\n// This doesn't inline\nofArray [|0..10|] |\u003e\u003e fold (+) 0\n```\n\nWith a simple tweak the inlining comes back:\n\n```fsharp\nlet vs = [|0..10|]\nofArray vs |\u003e\u003e fold (+) 0\n```\n\nThis happens for a couple of scenarios and it's hard to understand why the compiler wouldn't inline the first example but chose to inline the second.\n\nThis makes the `PushStream` performance hard to predict as no warnings are produced by the compiler.\n\nIt's possible for a programmer to verify the generated code using tools like `dnspy` and tweak the code until inlining is restored. It doesn't feel like the optimal experience though.\n\n[Perhaps](https://github.com/dotnet/fsharp/issues/12416) when inlining is applied or not should be more predictable.\n\n## Conclusion\n\nTo me `InlineIfLambda` is the most exciting F#6 feature as it allow us to create abstractions with little overhead where before we had to rewrite the code to an imperative style.\n\nThis makes me wonder if the presence of `inline` and `InlineIfLambda` makes F# the best `.NET` language to write performant code in.\n\nFull source code available at [GitHub](https://github.com/mrange/PushStream6).\n\nMerry Christmas\n\nMårten\n\n\n## Appendix : `PumpStream`\n\nOne drawback with `PushStream` is that it doesn't can't implement `seq\u003c_\u003e` as a producer when it starts running it runs to completion, there's no way to yield the producer.\n\nAn alternative is `PumpStream`. A `PumpStream` returns a function that the consumer calls each time it wants a value, the pump operation might yield no value (as filter operations drops values) so several pumping operations might be needed to produce a value.\n\nA simple version of `PumpStream` could look like this:\n\n```fsharp\n// TODO: Support disposing sources\ntype 'T PumpStream = ('T -\u003e bool)-\u003e(unit -\u003e bool)\n\nmodule PumpStream =\n  open System\n  open System.Collections.Generic\n\n  // PumpStream of ints in range b..e\n  let inline ofRange b e : int PumpStream = fun ([\u003cInlineIfLambda\u003e] r) -\u003e\n    let mutable i = b\n    fun () -\u003e\n      if i \u003c= e \u0026\u0026 r i then\n        i \u003c- i + 1\n        true\n      else\n        false\n\n  // Filters a PumpStream using a filter function\n  let inline filter ([\u003cInlineIfLambda\u003e] f) ([\u003cInlineIfLambda\u003e] ps : _ PumpStream) : _ PumpStream = fun ([\u003cInlineIfLambda\u003e] r) -\u003e\n    ps (fun v -\u003e if f v then r v else true)\n\n  // Maps a PumpStream using a mapping function\n  let inline map ([\u003cInlineIfLambda\u003e] f) ([\u003cInlineIfLambda\u003e] ps : _ PumpStream) : _ PumpStream = fun ([\u003cInlineIfLambda\u003e] r) -\u003e\n    ps (fun v -\u003e r (f v))\n\n  // Folds a PumpStream using a folder function f and an initial value z\n  let inline fold ([\u003cInlineIfLambda\u003e] f) z ([\u003cInlineIfLambda\u003e] ps : _ PumpStream) =\n    let mutable s = z\n    let p = ps (fun v -\u003e s \u003c- f s v; true)\n    while p () do ()\n    s\n\n  // Implements seq\u003c_\u003e over a PumpStream\n  let inline toSeq ([\u003cInlineIfLambda\u003e] ps : 'T PumpStream) =\n    { new IEnumerable\u003c'T\u003e with\n      override x.GetEnumerator () : IEnumerator\u003c'T\u003e           =\n        let mutable current = ValueNone\n        let p = ps (fun v -\u003e current \u003c- ValueSome v; true)\n        { new IEnumerator\u003c'T\u003e with\n          // TODO: Implement Dispose\n          member x.Dispose ()     = ()\n          member x.Reset ()       = raise (NotSupportedException ())\n          member x.Current : 'T   = current.Value\n          member x.Current : obj  = current.Value\n          member x.MoveNext ()    =\n            current \u003c- ValueNone\n            while p () \u0026\u0026 current.IsNone do ()\n            current.IsSome\n        }\n      override x.GetEnumerator () : Collections.IEnumerator   =\n        x.GetEnumerator ()\n    }\n\n  // It turns out that if we pipe using |\u003e the F# compiler don't inlines\n  //  the lambdas as we like it to\n  //  So define a more restrictive version of |\u003e that applies function f to a function v\n  //  As both f and v are restibted to lambas we can apply InlineIfLambda\n  let inline (|\u003e\u003e) ([\u003cInlineIfLambda\u003e] v : _ -\u003e _) ([\u003cInlineIfLambda\u003e] f : _ -\u003e _) = f v\n```\n\n### Performance\n\nThe PumpStream is more complex than the `PushStream` so we expect it to perform worse but how much worse?\n\n```\n|             Method | Job |       Mean |     Error |    StdDev | Ratio | RatioSD |   Gen 0 | Allocated |\n|------------------- |---- |-----------:|----------:|----------:|------:|--------:|--------:|----------:|\n|           Baseline | PGO |   6.666 μs | 0.1036 μs | 0.0969 μs |  1.00 |    0.00 |       - |         - |\n|               Linq | PGO |  88.306 μs | 1.3255 μs | 1.2398 μs | 13.25 |    0.25 |  0.1221 |     400 B |\n|              Array | PGO |  41.591 μs | 0.1992 μs | 0.1664 μs |  6.25 |    0.09 | 44.7388 | 141,368 B |\n|                Seq | PGO | 142.978 μs | 0.2711 μs | 0.2403 μs | 21.48 |    0.32 |       - |     480 B |\n|   FasterPushStream | PGO |   8.786 μs | 0.0531 μs | 0.0497 μs |  1.32 |    0.02 |       - |         - |\n| FasterPushStreamV2 | PGO |   8.776 μs | 0.0387 μs | 0.0362 μs |  1.32 |    0.02 |       - |         - |\n|   FasterPumpStream | PGO |  17.901 μs | 0.0537 μs | 0.0503 μs |  2.69 |    0.04 |       - |      80 B |\n| FasterPumpStreamV2 | PGO |  17.778 μs | 0.0545 μs | 0.0510 μs |  2.67 |    0.04 |       - |      80 B |\n```\n\nThis run uses the new .NET6 feature of Profile Guided Optimizations, this does improve LINQ and `Seq` performance quite significantly but it also shows that `PumpStream` while slower than `PushStream` only adds about 3x overhead over the baseline + some memory overhead compared to `PushStream` 1.5x overhead over the baseline.\n\n## Appendix : Decompiling `PushStream`\n\nUsing [dnSpy](https://github.com/dnSpy/dnSpy) we can decompile the compiled IL code into C# to learn what's going on in more details\n\n### Baseline decompiled\n\n```csharp\npublic long Baseline()\n{\n  long s = 0L;\n  for (int i = 0; i \u003c 10001; i++)\n  {\n    int j = i + 1;\n    if ((j \u0026 1) == 0)\n    {\n      s += (long)j;\n    }\n  }\n  return s;\n}\n```\n\nThe baseline not very surprisingly becomes a quite efficient loop.\n\n### FasterPushStream decompiled\n\n```csharp\n[Benchmark]\npublic long FasterPushStream()\n{\n  long num = 0L;\n  int num2 = 0;\n  for (;;)\n  {\n    bool flag;\n    if (num2 \u003c= 10000)\n    {\n      int num3 = num2;\n      int num4 = 1 + num3;\n      if ((num4 \u0026 1) == 0)\n      {\n        long num5 = (long)num4;\n        num += num5;\n        flag = true;\n      }\n      else\n      {\n        flag = true;\n      }\n    }\n    else\n    {\n      flag = false;\n    }\n    if (!flag)\n    {\n      break;\n    }\n    num2++;\n  }\n  bool flag2 = num2 \u003e 10000;\n  return num;\n}\n```\n\nWhile a bit more code one can see that thanks to `inline` and `InlineIfLambda` everything is inlined into something that looks like decently efficient code.\n\nWe can also spot a reason why `FasterPushStream` does a bit worse than `Baseline` as the PushStream includes a short-cutting mechanism that allows the receiver to say it doesn't want to receive more values. This is to allow implementing `tryHead` and similar operations efficiently.\n\n### PushStream decompiled\n\n```csharp\npublic long PushStream()\n{\n  FSharpFunc\u003cFSharpFunc\u003cint, bool\u003e, bool\u003e _instance = Program.PushStream@55.@_instance;\n  FSharpFunc\u003cFSharpFunc\u003cint, bool\u003e, bool\u003e arg = new Program.PushStream@56-2(@_instance);\n  FSharpFunc\u003cFSharpFunc\u003clong, bool\u003e, bool\u003e fsharpFunc = new Program.PushStream@57-4(arg);\n  FSharpRef\u003clong\u003e fsharpRef = new FSharpRef\u003clong\u003e(0L);\n  bool flag = fsharpFunc.Invoke(new Program.PushStream@58-6(fsharpRef));\n  return fsharpRef.contents;\n}\n```\n\nUsing `|\u003e` F# don't inline the lambdas and a pipeline is set up. This leads to objects being created and virtual calls for each step in the pipeline.\n\nThe pipeline does surprisingly well but one big problem with this approach is that it might need to fallback to slow tail calls gives a significant performance drop.\n\nThere are work-arounds to prevent `.tail` attribute from being emitted but that hurt performance when a fast tail call could be used.\n\nInlining solves this issue as the tail calls are elimiated.\n\n## Appendix : Disassembling\n\nTo learn more about what's actually going we can disassemble the jitted code.\n\n### Baseline\n\nThe baseline disassembled:\n\n```asm\n.loop:\n  ; Increment loop variable (smart enough to pre increment + 1)\n  inc     edx\n  mov     ecx,edx\n  ; filter  (fun (V2 (v, _)) -\u003e (v \u0026\u0026\u0026 1) = 0)\n  test    cl,1\n  jne     .next\n  ; map (fun (V2 (v, _)) -\u003e int64 v)\n  movsxd  rcx,ecx\n  add     rax,rcx\n.next\n  ; Increment loop variable\n  cmp     edx,2711h\n  jl      .loop\n```\n\nHere the jitter was smart enough to pre increment with 1 to avoid incrementing by 1 each loop. In addition, checks the loop condition at the end saves a jmp.\n\n### FasterPushStreamV2 inlined\n\nLet's look at `FasterPushStreamV2`:\n\n```asm\n.loop:\n  ; Are we done?\n  cmp     edx,2710h\n  jg      .we_are_done\n  ; (fun (V2 (v, w)) -\u003e V2 (v + 1, w))\n  lea     ecx,[rdx+1]\n  ; filter  (fun (V2 (v, _)) -\u003e (v \u0026\u0026\u0026 1) = 0)\n  test    cl,1\n  jne     .next\n  ; map (fun (V2 (v, _)) -\u003e int64 v)\n  movsxd  rcx,ecx\n  ; fold (+) 0L\n  add     rax,rcx\n.next\n  ; Increment loop variable\n  inc     edx\n  jmp     .loop\n```\n\nThis looks pretty amazing. The `V2` struct and all virtual calls are completely gone.\n\nThe jitter also eliminated the short cutting condition as the producer is always fully consumed.\n\nThe extra overhead seems to come from the pre-increment optimization wasn't applied here and that the end-of-the-loop condition is done differently.\n\nStill not bad.\n\n### FasterPushStreamV2 not inlined\n\nBy looking at the IL code one can see that the F# compiler has added a `.tail` attribute on the calls to the receiver.\n\n```asm\n; This tells the jitter that next call is a tail call\nIL_000C: tail.\n; Tail call virtual\nIL_000E: callvirt  instance !1 class [FSharp.Core]Microsoft.FSharp.Core.FSharpFunc`2\u003cclass [FSharp.Core]Microsoft.FSharp.Core.FSharpFunc`2\u003cint32, bool\u003e, bool\u003e::Invoke(!0)\nIL_0013: ret\n```\n\nWhen invoking the next receiver in the PushStream the F# compiler emits `tail.` attribute.\n\nThis leads to the following jitted code:\n\n```asm\n; ofRange 0 10000\n.loop:\n  ; Are we done?\n  cmp     edi,2710h\n  jg      .we_are_done\n  ; Setup virtual call to map receiver\n  mov     rcx,rsi\n  mov     edx,edi\n  mov     rax,qword ptr [rsi]\n  mov     rax,qword ptr [rax+40h]\n  ; Call the map receiver (no tail call)\n  call    qword ptr [rax+20h]\n  ; Does the receiver think we should stop?\n  test    eax,eax\n  je      .we_are_done\n  ; Increment loop variable\n  inc     edi\n  jmp     .loop\n\n; map (fun v -\u003e V2 (v, 0))\n  ; Save value of rax\n  push    rax\n  ; Load address to map receiver\n  mov     rcx,qword ptr [rcx+8]\n  ; Clear eax\n  xor     eax,eax\n  ; Save V2 on stack\n  mov     dword ptr [rsp],edx\n  mov     dword ptr [rsp+4],eax\n  mov     rdx,qword ptr [rsp]\n  ; Setup virtual call to map receiver\n  mov     rax,qword ptr [rcx]\n  mov     rax,qword ptr [rax+40h]\n  mov     rax,qword ptr [rax+20h]\n  ; Restore stack\n  add     rsp,8\n  ; tail call to map receiver\n  jmp     rax\n\n; (fun (V2 (v, w)) -\u003e V2 (v + 1, w))\n  ; Save value of rax\n  push    rax\n  ; Save rdx V2 (wonder why it stores a 64bit word?)\n  mov     qword ptr [rsp+18h],rdx\n  ; Load address to filter receiver\n  mov     rcx,qword ptr [rcx+8]\n  ; Loads V2 (v, _)\n  mov     edx,dword ptr [rsp+18h]\n  ; v + 1\n  inc     edx\n  ; Load V2 (_, w)\n  mov     eax,dword ptr [rsp+1Ch]\n  ; It seems the whole round trip to the stack for V2 was unnecessary\n  ; Store V2 on stack\n  mov     dword ptr [rsp],edx\n  mov     dword ptr [rsp+4],eax\n  mov     rdx,qword ptr [rsp]\n  ; Setup virtual call to map receiver\n  mov     rax,qword ptr [rcx]\n  mov     rax,qword ptr [rax+40h]\n  mov     rax,qword ptr [rax+20h]\n  add     rsp,8\n  ; tail call to filter receiver\n  jmp     rax\n\n\n; filter  (fun (V2 (v, _)) -\u003e (v \u0026\u0026\u0026 1) = 0)\n  mov     qword ptr [rsp+10h],rdx\n  ; Test to see V2(v, _) the number is odd (V2 is on the stack)\n  test    byte ptr [rsp+10h],1\n  jne     .bail_out\n  ; No it's even\n  ; Load address to filter receiver\n  mov     rcx,qword ptr [rcx+8]\n  mov     rdx,qword ptr [rsp+10h]\n  ; Setup virtual call to map receiver\n  mov     rax,qword ptr [rcx]\n  mov     rax,qword ptr [rax+40h]\n  mov     rax,qword ptr [rax+20h]\n  ; tail call to map receiver\n  jmp     rax\n.bail_out:\n  ; No it was odd\n  ; Set eax to 1 (true) to continue looping\n  mov     eax,1\n  ; Return to ofRange loop\n  ret\n\n; map (fun (V2 (v, _)) -\u003e int64 v)\n  mov     qword ptr [rsp+10h],rdx\n  ; Load address to fold receiver\n  mov     rcx,qword ptr [rcx+8]\n  ; Load V2(v,_)\n  mov     edx,dword ptr [rsp+10h]\n  ; Extend to 64bit\n  movsxd  rdx,edx\n  ; Setup virtual call to fold receiver\n  mov     rax,qword ptr [rcx]\n  mov     rax,qword ptr [rax+40h]\n  mov     rax,qword ptr [rax+20h]\n  ; tail call to fold receiver\n  jmp     rax\n\n; fold (+) 0L\n  ; Load fold state\n  mov     rax,qword ptr [rcx+8]\n  ; Move state?\n  mov     rdx,rax\n  ; Add V2(v,_) to state\n  add     rdx,qword ptr [rax+8]\n  ; Save state\n  mov     qword ptr [rcx+8],rdx\n  ; Set eax to 1 (true) to continue looping\n  mov     eax,1\n  ; Return to ofRange loop\n  ret\n```\n\nLot more jitted code which explains the reduced of performance, I am actually surprised it performs as well as it should but I suppose CPUs recognizes common patterns for making virtual calls and optimize for that.\n\nWhat's great though is that we see that tail calls are applied `jmp` and that they are much more effiecient than in `.NET5` that did tail call through a helper function.\n\nSlower than when inlined but still improvments has been made to jitter.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmrange%2Fpushstream6","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmrange%2Fpushstream6","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmrange%2Fpushstream6/lists"}