{"id":19392075,"url":"https://github.com/davidraab/word-duplicates-benchmark","last_synced_at":"2026-04-22T09:01:25.560Z","repository":{"id":197203283,"uuid":"524153415","full_name":"DavidRaab/word-duplicates-benchmark","owner":"DavidRaab","description":"Benchmark over multiple solution to pick duplicates from a text file","archived":false,"fork":false,"pushed_at":"2022-08-22T14:49:29.000Z","size":18,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-02-24T21:12:33.442Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"F#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DavidRaab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-08-12T16:21:35.000Z","updated_at":"2022-08-12T16:22:41.000Z","dependencies_parsed_at":null,"dependency_job_id":"f1e1a3a8-0108-4aa7-ac7e-d7ab990a7a76","html_url":"https://github.com/DavidRaab/word-duplicates-benchmark","commit_stats":null,"previous_names":["davidraab/word-duplicates-benchmark"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/DavidRaab/word-duplicates-benchmark","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DavidRaab%2Fword-duplicates-benchmark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DavidRaab%2Fword-duplicates-benchmark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DavidRaab%2Fword-duplicates-benchmark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DavidRaab%2Fword-duplicates-benchmark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DavidRaab","download_url":"https://codeload.github.com/DavidRaab/word-duplicates-benchmark/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DavidRaab%2Fword-duplicates-benchmark/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32128704,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-22T08:34:57.708Z","status":"ssl_error","status_checked_at":"2026-04-22T08:34:55.583Z","response_time":58,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-10T10:30:20.444Z","updated_at":"2026-04-22T09:01:25.526Z","avatar_url":"https://github.com/DavidRaab.png","language":"F#","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Duplicates\n\nIn an online forum I had a discussion of the *fastest* way to get\nonly duplicated words of a text file. I said that a dictionary would\nbe the fastest version. At least, thats how I am used to solve this problem\nin Perl.\n\nAll those \"Enterprise\" developers said a Dictionary would be slower,\nbecause of hashing blablablabla...\n\nSo I created this Benchmark and added a lot of different versions to this\nproblem.\n\n\u003e Remember the task is not to get unique words. Its about **only** getting the\n\u003e words in a text file that apperas **more** than **once**!\n\n## Running\n\nRunning this program produces an output like this on my machine.\n\n    All Equal (should be true): true\n    Benchmarking...\n    Map ListComp      1000:  121.9/s\n    Map fold          1000:  136.4/s\n    Map chain         1000:  148.9/s\n    CountBy           2000:  210.7/s\n    CountBy Choose    2000:  192.3/s\n    CountBy List      2000:  179.5/s\n    ResizeArray       1500:  143.8/s\n    addCombine        1000:  148.5/s\n    Dictionary        2000:  205.0/s\n    CountBy LC        2000:  181.0/s\n\n    Full Mutable Versions\n    Mutable Array     2000:  203.5/s\n    Scan Array        1000:  124.0/s\n    Scan Array Full   1000:  123.2/s\n    Array Only        2000:  195.9/s\n    \n## Interpretation of Results\n\n1. **CountBy** is the fastest version. It just uses the built-in **Seq.countBy**\nfunction. This btw. uses a **Dictionary** under the hood.\n\n2. **Dictionary** is the version I would write if **CountBy** would not exists. Or\nin other words, if I had to implement it myself. What was the task to begin with.\nIt uses a mutable dictionary for the interim data and then transforms it into\na list with List Comprehension.\n\n3. **Map ListComp**, **Map fold**, **Map chain** are still the same algorithm, but\nusing a **Map** instead of a **Dictionary**.\n\n4. **CountBy Choose** and **CountBy List** are minor changes to **CountBy**.\n\n5. **ResizeArray** was one of the *more intelligent* solution someone suggested\ninstead of using a **Dictionary**. Split the words, sort it, then iterate through\nthe words. Keep track of the previous word, and the last added word to the result.\nWe add the word if we didn't added the word already and it is the same as previous\n(so it appears at least twice).\n\n6. **addCombine** is basically the same as **Map Fold** but uses the **addCombine**\nhelper function.\n\n7. **CountBy LC** is the same as **CountBy** but uses List Comprehension instead\nof Function Piping.\n\n## *The Full Mutable* versions all use only Mutable Data-Structures.\n\nAll versions above return an immutable List.\n\n1. **MutableArray** is the same as **CountBy** or **Dictionary**, it counts the\nwords with a **Dictionary** and then only picks duplicates and pushes it into\nan **ResizeArray**.\n\n2. **Scan Array** is the silly idea, to not use a **Dictionary** and to re-scan\nthe word **List** over and over again, and stop if at least two invocations was\nfound. Then we add it to a **HashSet**. This way we avoid adding the same word\nto the result over and over again.\n\n3. **Scan Array Full** is the same as **Scan Array** but just scans the whole array\nwithout short circuiting.\n\n4. **Array Only** is the same as **CountBy** but does all operations on an\nan **Array** and retuns an **Array** instead of **List**.\n\n## Final Verdict\n\nIf you execute the benchmark yourself you will get similar results. Some\noperations are sometimes faster/slower on an invocation. It probably has to\ndo with Garbace Collection running.\n\nBut overall you get the idea that using a **Dictionary** is not slow. I don't\nget it why .Net people avoid it so often. A Dictionary or Hash (Perl) is one\nof the most used data-structure in Perl but also JavaScript, Python and so on.\n\nPicking the right algorithm is more important than thinking Hashing a key\nwould be slow. If it would be slow, there would be no point in ever using a\n**Dictionary** at all.\n\nThe solution also shows that immutability is in general not the biggest\nperformance impact. The Full Mutable version has no advantages over returning\nan immutable list. But this also could be because List creation in F# is\nvery optimized.\n\nUsing **Map** as interim data is noticible slower. But this case is a good\nexample when you can use mutable data in an language in F#. As we never return\nthat **Map** data-structure we create in the **Map ...** versions, we can use\na **Dictionary** safely. The function can still be considered immutable.\n\n## Regex Performance\n\nOn Performance. This task is slow in .Net. It has todo with its regex Engine.\nWhen I do the **Mutable Array** algorithm in Perl it is around two times faster.\n(Look into `dups.pl`)\n\nYou also get the same improvements when you change the `splitWords` function into\na function that just splits a string on a whitespace character. Just switch the \n`splitIntoWords` and `splitIntoWords'` functions with each other.\n\nNow the task we are given are not exactly solved, as we get extra punctuations\nand other garbage. \n\nBut it shows better the impact of the choosen Algorithm instead of benchmarking\nthe .Net Regex Engine.\n\n    All Equal (should be true): false\n    Benchmarking...\n    Map ListComp      1000:  350.3/s\n    Map fold          1000:  356.3/s\n    Map chain         1000:  354.3/s\n    CountBy           2000: 1019.6/s\n    CountBy Choose    2000: 1019.0/s\n    CountBy List      2000:  939.6/s\n    ResizeArray       1500:  395.9/s\n    addCombine        1000:  355.5/s\n    Dictionary        2000: 1319.6/s\n    CountBy LC        2000: 1029.4/s\n\n    Full Mutable Versions\n    Mutable Array     2000: 1376.5/s\n    Scan Array        1000:  232.2/s\n    Scan Array Full   1000:  232.4/s\n    Array Only        2000: 1129.0/s\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidraab%2Fword-duplicates-benchmark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdavidraab%2Fword-duplicates-benchmark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidraab%2Fword-duplicates-benchmark/lists"}