{"id":13586970,"url":"https://github.com/koraa/huniq","last_synced_at":"2025-04-04T06:08:29.773Z","repository":{"id":41900090,"uuid":"234530227","full_name":"koraa/huniq","owner":"koraa","description":"Filter out duplicates on the command line. Replacement for `sort | uniq` optimized for speed (10x faster) when sorting is not needed.","archived":false,"fork":false,"pushed_at":"2024-01-26T08:41:49.000Z","size":55,"stargazers_count":227,"open_issues_count":9,"forks_count":11,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-04-14T09:38:39.568Z","etag":null,"topics":["cli","rust","tools"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/koraa.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-01-17T10:55:35.000Z","updated_at":"2024-05-22T11:56:21.188Z","dependencies_parsed_at":"2024-05-22T11:56:19.351Z","dependency_job_id":"b7dbe4cc-765e-4022-8fe5-18325f6bdc1d","html_url":"https://github.com/koraa/huniq","commit_stats":{"total_commits":60,"total_committers":9,"mean_commits":6.666666666666667,"dds":0.3833333333333333,"last_synced_commit":"1d3c47eafb83147ea83594c64ba62b4fbbe3d617"},"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/koraa%2Fhuniq","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/koraa%2Fhuniq/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/koraa%2Fhuniq/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/koraa%2Fhuniq/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/koraa","download_url":"https://codeload.github.com/koraa/huniq/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247128751,"owners_count":20888235,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","rust","tools"],"created_at":"2024-08-01T15:05:56.523Z","updated_at":"2025-04-04T06:08:29.751Z","avatar_url":"https://github.com/koraa.png","language":"Rust","readme":"# huniq version 2\n\nCommand line utility to remove duplicates from the given input.\nNote that huniq does not sort the input, it just removes duplicates.\n\n```\nSYNOPSIS: huniq -h # Shows help\nSYNOPSIS: huniq [-c|--count] [-0|--null|-d DELIM|--delim DELIM]\n```\n\n```\n$ echo -e \"foo\\nbar\\nfoo\\nbaz\" | huniq\nfoo\nbar\nbaz\n\n$ echo -e \"foo\\nbar\\nfoo\\nbaz\" | huniq -c\n1 baz\n2 foo\n1 bar\n```\n\n`huniq` replaces `sort | uniq` (or `sort -u` with gnu sort) and `huniq -c` replaces `sort | uniq -c`, assuming the data is sorted just so it can be passed to `uniq`. If having sorted output is desired, `sort | uniq` should still be used.\n\nThe order of the output is stable when in normal mode, but it is not stable when in -c/count mode.\n\n## Installation\n\n```\n$ cargo install huniq\n```\n\n## Motivation\n\nSorting is slow. By using hash tables/hash sets instead of sorting\nthe input huniq is generally faster than `sort -u` or `sort | uniq -c` when testing with gnu sort/gnu uniq.\n\n## Version History\n\nVersion 1 can be found [here](https://github.com/SoftwearDevelopment/huniq).\n\nChanges made in version 2:\n\n* The -d/-0 flags where added so you can specify custom delimiters\n* Completely rewritten in rust.\n* Version two is (depending on which benchmark you look at below) between 3.5x and 6.5x faster than version 1\n\n## Build\n\n```sh\ncargo build --release\n```\n\nTo run the tests execute:\n\n```sh\nbash ./test.sh\n```\n\n## Benchmark\n\nYou can use `bash ./benchmark.sh` to execute the benchmarks. They will execute until you manually abort them (e.g. by pressing Ctrl-C).\n\nThe benchmarks work by repeatedly feeding the implementations with data\nfrom /usr/share/dict/* and measuring memory usage and time needed to process\nthe data with the unix `time` tool.\n\nFor the `uniq` algorithm, the results are posted below: We can see that the\nrust implementation blows pretty much anything else out of the water in terms\nof performance. Use sort only if you really need a coffee break, because you\nwon't get it with huniq! It beats the C++ implementation by a factor\nof between 6.5 (for very few duplicates) and 3.5 (around 98% duplicates).\nCompared to `sort -u`: huniq is around 30 times faster.\n\nIf memory efficiency is what you are looking for, use datamash which is not as fast as huniq\nbut uses the least memory (by a factor of around 3); failing that use `sort|uniq` which is a\nlot slower but uses just very slightly more memory than datamash.\n\n```\nrepetitions  implementation  seconds  memory/kb\n1            huniq2-rust        0.26      29524\n1            huniq1-c++         1.67      26188\n1            awk                1.63     321936\n1            datamash           1.78       9644\n1            shell              7.30       9736\n2            huniq2-rust        0.84      29592\n2            huniq1-c++         3.28      26180\n2            awk                3.71     322012\n2            datamash           4.60       9636\n2            shell             16.68       9740\n5            huniq2-rust        2.02      29648\n5            huniq1-c++         6.21      26184\n5            awk                7.69     322012\n5            datamash           9.10       9992\n5            shell             44.71      10184\n10           huniq2-rust        3.40      29676\n10           huniq1-c++        12.84      26172\n10           awk               16.73     321940\n10           datamash          24.44      10032\n10           shell             93.75      10036\n50           huniq2-rust       14.68      29612\n50           huniq1-c++        55.32      26200\n50           awk               74.91     321940\n50           datamash         103.54      10936\n50           shell            453.94      10956\n100          huniq2-rust       43.65      29492\n100          huniq1-c++       154.99      26180\n100          awk              239.66     321956\n100          datamash         285.94      12148\n100          shell           1062.07      12208\n```\n\nFor the counting `huniq -c` implementation, the speed advantage\nwas less pronounced: Here the rust implementation is between 25%\nand 50% faster than the C++ implementation and between 5x and 10x\nfaster than `sort | uniq -c`.\n\nThe increased memory usage of the rust implementation is much worse though:\nThe rust implementation needs about 2.2x more memory than the C++ implementation\nand between 10x and 12x more memory than `sort | uniq`.\n\n```\nrepetitions  implemetation  seconds  memory/kb\n1            huniq2-rust       1.47     132096\n1            huniq1-c++        1.85      60196\n1            awk               2.79     362940\n1            datamash          2.28       9636\n1            shell             7.71      11716\n2            huniq2-rust       2.32     132052\n2            huniq1-c++        2.98      60156\n2            awk               4.65     363016\n2            datamash          5.27       9732\n2            shell            16.37      11680\n5            huniq2-rust       4.98     132092\n5            huniq1-c++        7.54      60128\n5            awk               9.37     363016\n5            datamash         11.22       9964\n5            shell            44.77      11948\n10           huniq2-rust       8.81     132048\n10           huniq1-c++       13.55      60196\n10           awk              16.19     363032\n10           datamash         25.12       9908\n10           shell            90.01      11976\n50           huniq2-rust      45.89     132092\n50           huniq1-c++       74.04      60104\n50           awk              85.43     362956\n50           datamash        141.90      10996\n50           shell           454.42      12876\n100          huniq2-rust      90.80     132080\n100          huniq1-c++      150.41      60196\n100          awk             163.13     363008\n100          datamash        322.70      12212\n100          shell           933.67      14100\n```\n\n## Future direction\n\nFeature wise huniq is pretty much complete, but the performance and memory usage should be improved in the future.\n\nThis first of all involves a better benchmarking setup which will probably consist\nof an extra rust application that will use RNGs to generate test data for huniq and\ntake parameters like the number of elements to create, the rate of duplicates (0-1)\nthe length of strings to output and so on…\n\nThen based on the improved benchmarking capabilities, some optimizations should be tried\nlike short string optimization, arena allocation, different hash functions, using\nmemory optimized hash tables, using an identity function for the `uniq` function\n(we already feed it with hashes, so a second round of hashing is not necessary).\n\n## License\n\nCopyright © (C) 2020, Karolin Varner. All rights reserved.\n\nRedistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:\n\n    Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.\n    Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.\n    Neither the name of the Karolin Varner nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.\n\nTHIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL Softwear, BV BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.\n","funding_links":[],"categories":["Rust","Text Processing","\u003ca name=\"text-processing\"\u003e\u003c/a\u003eText processing","Included Software","Other","Applications"],"sub_categories":["Acknowledgments","System tools"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkoraa%2Fhuniq","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkoraa%2Fhuniq","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkoraa%2Fhuniq/lists"}