{"id":18700878,"url":"https://github.com/rabestro/unique-ip-addresses","last_synced_at":"2025-04-12T08:23:25.477Z","repository":{"id":37097220,"uuid":"500518097","full_name":"rabestro/unique-ip-addresses","owner":"rabestro","description":"The optimal solution to the problem of counting unique IPv4 addresses in a huge text file.","archived":false,"fork":false,"pushed_at":"2024-07-31T06:21:50.000Z","size":180,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-26T03:33:12.202Z","etag":null,"topics":["array-set","bit-array","bit-manipulation","bit-mask","bit-set","bitset","console-application","intstream","ipv4","jmh","jmh-benchmarks","solution","unique-network"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rabestro.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-06-06T16:52:26.000Z","updated_at":"2024-07-31T06:21:53.000Z","dependencies_parsed_at":"2024-11-07T11:43:35.509Z","dependency_job_id":null,"html_url":"https://github.com/rabestro/unique-ip-addresses","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rabestro%2Funique-ip-addresses","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rabestro%2Funique-ip-addresses/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rabestro%2Funique-ip-addresses/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rabestro%2Funique-ip-addresses/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rabestro","download_url":"https://codeload.github.com/rabestro/unique-ip-addresses/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248537991,"owners_count":21120922,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["array-set","bit-array","bit-manipulation","bit-mask","bit-set","bitset","console-application","intstream","ipv4","jmh","jmh-benchmarks","solution","unique-network"],"created_at":"2024-11-07T11:39:39.757Z","updated_at":"2025-04-12T08:23:25.442Z","avatar_url":"https://github.com/rabestro.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# IPv4 addresses\n\nA simple task to count the number of unique IPv4 addresses in a huge file.\n\n## Task Description\n\nGiven a simple text file with IPv4 addresses in dotted-decimal notation. One line - one address, like this:\n```\n145.67.23.4\n8.34.5.23\n89.54.3.124\n89.54.3.124\n3.45.71.5\n...\n```\nThe file size is not limited and can take tens and hundreds of gigabytes. Your task is to count the number of unique addresses in this file, spending as little memory and time as possible. You can download a sample file [here](https://ecwid-vgv-storage.s3.eu-central-1.amazonaws.com/ip_addresses.zip). The file is zipped, and you should unzip it before processing. Please note that the size of unzipped file is about 120Gb.\n\n## How to build and run the project\n\nTo build the application you may use this command (Linux/MacOS):\n\n```shell\n./gradlew assemble\n```\n\nTo run using gradle please use a command:\n\n```shell\n./gradlew run -q --console=plain --args=\"ip_addresses\"\n```\n\nAlternatively, you can run the program using the command:\n\n```shell\njava -jar build/libs/ipcounter.jar ip_addresses\n```\n\nPlease note that to run the application, you should replace `ip_addresses` with a full path to the source text file with IPv4 addresses.\n\n## Run test\n\nUnit tests, integration tests and application functional test are written using the Spock Framework.\n\nTests can be run with the command: \n```shell\n./gradlew test\n```\n\n## Benchmarks\n\n```shell\n./gradlew jmh\n```\n\n### Benchmark results for containers\n\n```text\nBenchmark                               (amount)  Mode  Cnt   Score   Error  Units\nContainerBenchmark.dualBitSetContainer        1B  avgt    5  20.188 ± 1.211   s/op\nContainerBenchmark.dualBitSetContainer        1M  avgt    5   0.053 ± 0.002   s/op\nContainerBenchmark.dualBitSetContainer        1K  avgt    5   0.034 ± 0.001   s/op\nContainerBenchmark.longArrayContainer         1B  avgt    5  13.943 ± 0.336   s/op\nContainerBenchmark.longArrayContainer         1M  avgt    5   0.093 ± 0.005   s/op\nContainerBenchmark.longArrayContainer         1K  avgt    5   0.080 ± 0.002   s/op\n```\n\n## Solution description\n\n### Naive approach \n\nThe first naive attempt to solve the problem:\n\n```shell\nsort -u ips.txt | wc -l\n```\n\nAfter a long time processing a huge text file, I got an error message:\n\n```text\nsort: write failed: /tmp/sortcQjXmj: No space left on device\n0\n```\n\n### The naive approach in Java\n\nThe same approach, but implemented as a Java program, looks like this:\n\n```java\nvar path = Path.of(\"ips.txt\");\n\ntry (var lines = Files.lines(path)) {\n    var unique = lines.distinct().count();    \n    System.out.println(unique);\n}\n```\n\n### The Solution\n\nTo read a text file, we use the `Files.lines()` method. This method returns a stream of strings, \neach of which is a textual representation of the IPv4 address. The addresses are then converted \nfrom their textual representation to integers of type int (4 bytes). These addresses are added \nto a special container, where one bit is used to indicate this number. Since the int type uses 4 bytes, \nour container needs $2^{32}$ bits to indicate the presence of numbers. This is equal to 512 MB. \nThis volume is fixed and does not depend on the total number of IPv4 addresses.\n\nThe project implements one version of the converter and several variants of the container \noptimized for different amounts of data. In particular, `LongArrayContainer` immediately allocates \nthe required 512 MB of data and is optimal for very large amounts of data. Whereas `BitSetContainer` \nallocates the required memory dynamically for the used IP address segments. Thus, if we have IP addresses \nonly from certain segments, this will allow us to allocate the minimum required memory. \nThe container `DualBitSetContainer` is a particular, slightly more optimized, \ncase of the more general `BitSetContainer`.\n\nThe following code snippet illustrates the idea behind this solution.\n\n```java\nprivate static long countUnique(Stream\u003cString\u003e ipAddresses) {\n    return ipAddresses\n                .mapToInt(new IPv4Converter())\n                .collect(LongArrayContainer::new, IntContainer::add, IntContainer::addAll)\n                .countUnique();\n}\n```\n\nHere `IntContainer` is the interface of the container, and `LongArrayContainer` is its implementation.\n\nThe implementation of the converter is highly optimized and assumes that the IP address in text format is correct. \nThe converter does not perform any checks and will return an arbitrary number in case of an incorrect address. \nIf additional verification is required, then you can either add a filter to the string stream, \nor implement a converter where there will be an additional check of the address for correctness.\n\nIn the catalog `src/jmh` you will find benchmarks where several implementations of converters are considered. \nAt the moment, the solution seems to be close to optimal.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frabestro%2Funique-ip-addresses","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frabestro%2Funique-ip-addresses","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frabestro%2Funique-ip-addresses/lists"}