{"id":18026651,"url":"https://github.com/jackmott/groupbyperformance","last_synced_at":"2025-03-27T01:31:18.570Z","repository":{"id":77502402,"uuid":"241204924","full_name":"jackmott/GroupByPerformance","owner":"jackmott","description":"Looking at the performance of GroupBy","archived":false,"fork":false,"pushed_at":"2020-02-24T05:01:20.000Z","size":42,"stargazers_count":23,"open_issues_count":1,"forks_count":4,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-22T21:06:47.395Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jackmott.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-02-17T20:48:34.000Z","updated_at":"2024-01-29T19:03:11.000Z","dependencies_parsed_at":null,"dependency_job_id":"7e6bba6f-2ab2-4fd1-b1e5-6b94a25d427c","html_url":"https://github.com/jackmott/GroupByPerformance","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jackmott%2FGroupByPerformance","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jackmott%2FGroupByPerformance/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jackmott%2FGroupByPerformance/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jackmott%2FGroupByPerformance/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jackmott","download_url":"https://codeload.github.com/jackmott/GroupByPerformance/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245764674,"owners_count":20668457,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-30T08:07:34.402Z","updated_at":"2025-03-27T01:31:18.561Z","avatar_url":"https://github.com/jackmott.png","language":"C#","funding_links":[],"categories":[],"sub_categories":[],"readme":"# `GroupBy` Performance\nLooking at the performance of `GroupBy` in various scenarios\n\nThis project is all .NET Core 3.1, previous editions may differ due to newer\n.NET Core LINQ optimizations.\n\nFor each use case I will test two scenarios, each using a toy `Bill` class standing in for a typical thing one might group on.\nScenario 1 will have ~10,000 keys each with ~10 bills, scenario 2 will have ~10 keys each with ~10,000 bills.\n\n\n## Aggregating by group\nOne thing you might use group by for is to compute some aggregate value by grouping. For instance with our bill, we might want to know the sum of the totals for each bill type. `Group By` provides a very concise way to do this:\n\n```c#\nreturn bills.GroupBy(b =\u003e b.BillType).Select(grouping =\u003e grouping.Sum(b =\u003e b.Total)).ToList();\n```\nThis will return a List of the sums of all the totals or each group type.\n\nAvoiding group by, another way we could compute the sums is like so:\n```c#\n var sums = new Dictionary\u003cint, double\u003e();\n foreach (var bill in bills)\n {\n     double total;\n     if (!sums.TryGetValue(bill.BillType, out total))\n     {\n         sums.Add(bill.BillType, bill.Total);\n     }\n     else\n     {                    \n         sums[bill.BillType] = total + bill.Total;\n     }\n }\n return sums.Values.ToList();\n```\nThis is a lot more code, and we might expect this to perform worse, since we are creating a `Dictionary` to hold a bunch of intermediate data before we aggregate. But it turns out, so is `GroupBy`, let's look at a BenchmarkDotNet comparison:\n\n|             Method | GroupCount |      Mean |     Error |    StdDev |    Gen 0 |    Gen 1 |    Gen 2 |  Allocated |\n|------------------- |----------- |----------:|----------:|----------:|---------:|---------:|---------:|-----------:|\n|       GroupBy_Sums |         10 |  3.323 ms | 0.0200 ms | 0.0187 ms | 367.1875 | 292.9688 | 210.9375 | 2566.69 KB |\n|    Dictionary_Sums |         10 |  1.553 ms | 0.0054 ms | 0.0048 ms |        - |        - |        - |    1.16 KB |\n|       GroupBy_Sums |      10000 | 12.396 ms | 0.1093 ms | 0.1022 ms | 578.1250 | 296.8750 |  78.1250 | 4586.58 KB |\n|    Dictionary_Sums |      10000 |  2.181 ms | 0.0178 ms | 0.0158 ms | 121.0938 | 101.5625 |  85.9375 |  998.24 KB |\n\nDoing the aggregation as you go, if possible, saves a ton of time and GC pressure. This is because `GroupBy` can't perform this operation in a totally lazy manner, it is having to creating a lookup table behind the scenes, one that is filled with a collection of Bills, rather than just the current sum.\n\n## Building up a cache for later use\n\nAnother scenario we often use `GroupBy` for is to create caches/lookup tables for use later in our application.  For instance a website may pull down data from a database on startup, and create an in memory cache to avoid hammering the database for frequently queries but rarely changed data. With this use case, there are two concerns. How long it takes to create the cache, and then how well the cache performs when using it. We will take a look at both. In our case we want some kind of lookup table that lets us get a collection of `Bills` when we provide the `BillType`. We could do that by using `GroupBy`, `ToLookup`, or by creating a `Dictionary` by hand:\n``` c#\n \npublic Dictionary\u003cint, IEnumerable\u003cBill\u003e\u003e GroupBy_Cache()\n{\n    return bills.GroupBy(b =\u003e b.BillType).ToDictionary(g =\u003e g.Key, g =\u003e (IEnumerable\u003cBill\u003e)g);\n}\n \npublic ILookup\u003cint,Bill\u003e Lookup_Cache()\n{\n    return bills.ToLookup(b =\u003e b.BillType);\n}\n\npublic Dictionary\u003cint,List\u003cBill\u003e\u003e Dictionary_Cache()\n{\n    var dict = new Dictionary\u003cint, List\u003cBill\u003e\u003e();\n    foreach (var bill in bills)\n    {\n        List\u003cBill\u003e billList;\n        if (!dict.TryGetValue(bill.BillType, out billList))\n        {\n            billList = new List\u003cBill\u003e();\n            dict.Add(bill.BillType, billList);\n        }\n        billList.Add(bill);\n\n    }\n    return dict;\n}\n```\n\nHere is how long these approaches take to build:\n\n\n|           Method | GroupCount |      Mean |     Error |    StdDev |    Gen 0 |    Gen 1 |    Gen 2 | Allocated |\n|----------------- |----------- |----------:|----------:|----------:|---------:|---------:|---------:|----------:|\n|    GroupBy_Cache |         10 |  2.582 ms | 0.0093 ms | 0.0087 ms | 343.7500 | 265.6250 | 187.5000 |   2.51 MB |\n|     Lookup_Cache |         10 |  2.507 ms | 0.0107 ms | 0.0089 ms | 343.7500 | 265.6250 | 187.5000 |    2.5 MB |\n| Dictionary_Cache |         10 |  2.173 ms | 0.0092 ms | 0.0086 ms | 312.5000 | 226.5625 | 156.2500 |    2.5 MB |\n|    GroupBy_Cache |      10000 | 12.101 ms | 0.2383 ms | 0.3340 ms | 609.3750 | 328.1250 | 203.1250 |   4.75 MB |\n|     Lookup_Cache |      10000 | 11.864 ms | 0.0617 ms | 0.0577 ms | 515.6250 | 218.7500 |  62.5000 |   3.85 MB |\n| Dictionary_Cache |      10000 |  8.708 ms | 0.1029 ms | 0.0962 ms | 500.0000 | 312.5000 | 140.6250 |   3.58 MB |\n\nIt is worth noting that the `Dictionary` case could be optimized further in cases when you know, or approximately know, how many groupings will be in the data ahead of time, by setting the initial `Dictionary` capacity.\n\n## Using the cache\n\nThe above startup times may all be roughly equivalent for many use cases, especially since they will usually be one time, or rare operations. But what about using the cache, which will be done repeatedly, and where performance is more likely to be important?  Here is a test of each cache that just grabs a single BillType from each cache, and iterates over all the bills in that group to sum them up:\n\n```c#\n [Benchmark]\n public double GroupByCache_Use()\n {          \n     var bills = groupByCache[5];\n     double total = 0.0;\n     foreach (var bill in bills)\n     {\n         total += bill.Total;\n     }\n     return total;\n }\n\n [Benchmark]\n public double LookupCache_Use()\n {                        \n     var bills = lookupCache[5];\n     double total = 0.0;\n     foreach (var bill in bills)\n     {\n         total += bill.Total;\n     }                    \n     return total;\n }\n\n [Benchmark]\n public double DictionaryCache_Use()\n {\n     var bills = dictionaryCache[5];\n     double total = 0.0;\n     for(int j = 0; j \u003c bills.Count;j++)\n     {\n         total += bills[j].Total;\n     }\n     return total;                    \n }\n```\n\nWith the following results:\n\n|              Method | GroupCount |         Mean |     Error |    StdDev |  Gen 0 | Gen 1 | Gen 2 | Allocated |\n|-------------------- |----------- |-------------:|----------:|----------:|-------:|------:|------:|----------:|\n|    GroupByCache_Use |         10 | 58,547.40 ns | 46.101 ns | 38.496 ns |      - |     - |     - |      40 B |\n|     LookupCache_Use |         10 | 58,321.58 ns | 58.868 ns | 52.185 ns |      - |     - |     - |      41 B |\n| DictionaryCache_Use |         10 | 11,767.20 ns | 32.860 ns | 29.130 ns |      - |     - |     - |         - |\n|    GroupByCache_Use |      10000 |     75.96 ns |  0.375 ns |  0.332 ns | 0.0048 |     - |     - |      40 B |\n|     LookupCache_Use |      10000 |     70.00 ns |  1.129 ns |  1.056 ns | 0.0048 |     - |     - |      40 B |\n| DictionaryCache_Use |      10000 |     14.73 ns |  0.068 ns |  0.064 ns |      - |     - |     - |         - |\n\nThe differences here are due almost entirely to being able to leverage that the `Dictionary` cache has a `List` inside of it, allowing us to avoid the overhead and allocations used when using an `IEnumerable`. If you were to use `.Sum()` in all cases for instance, instead of a foreach or for loop to compute the sums, all 3 caches would perform very nearly the same.  Also it should be noted the time to retrieve a group from the cache is very nearly the same in all 3 cases. Only the time and allocations to iterate over the groups differ.\n\n## Using GroupBy to produce Lists for each group\n\nIf you want to create a cache that is as efficient as possible when being used, you can still use GroupBy, like so:\n\n```c#\npublic Dictionary\u003cint, List\u003cBill\u003e\u003e GroupByList_Cache()\n{\n    return bills.GroupBy(b =\u003e b.BillType).ToDictionary(g =\u003e g.Key,g =\u003e g.ToList());\n}\n```\n\nUnforunately while this cache will be fast to use, it takes much longer and way more allocations to create:\n\n|            Method | GroupCount |      Mean |     Error |    StdDev |    Gen 0 |    Gen 1 |    Gen 2 | Allocated |\n|------------------ |----------- |----------:|----------:|----------:|---------:|---------:|---------:|----------:|\n| GroupByList_Cache |         10 |  3.142 ms | 0.0266 ms | 0.0249 ms | 460.9375 | 359.3750 | 210.9375 |   3.27 MB |\n|  Dictionary_Cache |         10 |  2.088 ms | 0.0091 ms | 0.0081 ms | 328.1250 | 250.0000 | 171.8750 |    2.5 MB |\n| GroupByList_Cache |      10000 | 14.683 ms | 0.2589 ms | 0.2422 ms | 781.2500 | 468.7500 | 187.5000 |   6.04 MB |\n|  Dictionary_Cache |      10000 |  8.577 ms | 0.1040 ms | 0.0973 ms | 468.7500 | 281.2500 | 140.6250 |   3.58 MB |\n\n\n## A real life example from work (and a large open source project)\n\nAt work we had some code running in a web framework on every single request that was a bit slow, GroupBy was being used to extract just the first element out of each group into a `Dictionary`. Here is an approximation of the original code and the optimized version:\n\n```c#\n [Benchmark]\n public Dictionary\u003cint, Bill\u003e GroupbyFirst()\n {\n     return bills.GroupBy(b =\u003e b.BillType).ToDictionary(g =\u003e g.Key, g =\u003e g.First());\n }\n [Benchmark]\n public Dictionary\u003cint, Bill\u003e DictionaryFirst()\n {\n     var dict = new Dictionary\u003cint, Bill\u003e();\n     foreach (var bill in bills)\n     {\n         if (!dict.ContainsKey(bill.BillType))\n         {\n             dict.Add(bill.BillType, bill);\n         }\n     }\n     return dict;\n }\n```\n\nAnd the results:\n|          Method | GroupCount |      Mean |     Error |    StdDev |    Gen 0 |    Gen 1 |    Gen 2 |  Allocated |\n|---------------- |----------- |----------:|----------:|----------:|---------:|---------:|---------:|-----------:|\n|    GroupbyFirst |         10 |  2.481 ms | 0.0068 ms | 0.0064 ms | 367.1875 | 285.1563 | 210.9375 | 2565.98 KB |\n| DictionaryFirst |         10 |  1.140 ms | 0.0023 ms | 0.0019 ms |        - |        - |        - |    1.01 KB |\n|    GroupbyFirst |      10000 | 11.337 ms | 0.0967 ms | 0.0858 ms | 625.0000 | 343.7500 | 140.6250 | 4859.43 KB |\n| DictionaryFirst |      10000 |  1.539 ms | 0.0056 ms | 0.0046 ms | 101.5625 |  83.9844 |  76.1719 |  920.06 KB |\n\n\n## Have your cake and eat it to?\n\nThe brevity that GroupBy affords is really nice.  In many cases you can get that same brevity with better performance by writing your own generic extension methods, to turn Enumerables into grouped dictionaries:\n\n```c#\n    public static Dictionary\u003cK, List\u003cV\u003e\u003e GroupByDictionary\u003cK, V\u003e(this IEnumerable\u003cV\u003e items, Func\u003cV, K\u003e keySelector)\n    {\n        var dictionary = new Dictionary\u003cK, List\u003cV\u003e\u003e();\n        foreach (var item in items)\n        {\n            List\u003cV\u003e grouping;\n            var key = keySelector(item);\n            if (!dictionary.TryGetValue(key, out grouping))\n            {\n                grouping = new List\u003cV\u003e(1);\n                dictionary.Add(key, grouping);\n            }\n            grouping.Add(item);\n        }\n        return dictionary;\n    }\n\n    public static Dictionary\u003cK, List\u003cV\u003e\u003e GroupByDictionary\u003cU,K, V\u003e(this IEnumerable\u003cU\u003e items, Func\u003cU, K\u003e keySelector, Func\u003cU,V\u003e valueSelector)\n    {\n        var dictionary = new Dictionary\u003cK, List\u003cV\u003e\u003e();\n        foreach (var item in items)\n        {\n            List\u003cV\u003e grouping;\n            var key = keySelector(item);\n            if (!dictionary.TryGetValue(key, out grouping))\n            {\n                grouping = new List\u003cV\u003e(1);\n                dictionary.Add(key, grouping);\n            }\n            grouping.Add(valueSelector(item));\n        }\n        return dictionary;\n    }\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjackmott%2Fgroupbyperformance","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjackmott%2Fgroupbyperformance","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjackmott%2Fgroupbyperformance/lists"}