{"id":16689541,"url":"https://github.com/sigpwned/delta4j","last_synced_at":"2026-05-20T15:40:01.708Z","repository":{"id":232885803,"uuid":"784940905","full_name":"sigpwned/delta4j","owner":"sigpwned","description":"Elements for building concurrent and distributed data processing applications","archived":false,"fork":false,"pushed_at":"2024-04-11T20:52:09.000Z","size":83,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-12-31T01:59:04.326Z","etag":null,"topics":["concurrent-programming","distributed-computing","java","probabilistic-data-structures","statistics","text"],"latest_commit_sha":null,"homepage":"https://github.com/sigpwned/delta4j","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sigpwned.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-10T21:42:03.000Z","updated_at":"2024-07-27T03:18:23.000Z","dependencies_parsed_at":null,"dependency_job_id":"7a862a84-6aba-461a-8b81-05caeab3962c","html_url":"https://github.com/sigpwned/delta4j","commit_stats":null,"previous_names":["sigpwned/delta4j"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/sigpwned/delta4j","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sigpwned%2Fdelta4j","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sigpwned%2Fdelta4j/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sigpwned%2Fdelta4j/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sigpwned%2Fdelta4j/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sigpwned","download_url":"https://codeload.github.com/sigpwned/delta4j/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sigpwned%2Fdelta4j/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33265095,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-20T15:12:43.734Z","status":"ssl_error","status_checked_at":"2026-05-20T15:12:42.300Z","response_time":356,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["concurrent-programming","distributed-computing","java","probabilistic-data-structures","statistics","text"],"created_at":"2024-10-12T15:48:30.626Z","updated_at":"2026-05-20T15:40:01.678Z","avatar_url":"https://github.com/sigpwned.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# delta4j\n\ndelta4j is a lightweight Java library designed to support the development of concurrent and\ndistributed data processing applications. Focusing on performance and simplicity, delta4j includes a\nrange of software elements such as Bloom Filters, optimized text processing structures, and various\nstatistical distribution implementations. It is crafted to for minimal dependencies -- just SLF4J --\nto make integrating it into existing projects as simple as possible.\n\n## Features\n\n* Simple: A straightforward API for easy integration and usage.\n* Modern: Built with Java 11 and later versions in mind.\n* Lightweight: A small library with little overhead and minimal dependencies (just SLF4J).\n* Probabilistic Data Structures: High-cardinality probabilistic data structures designed for\n  processing larger-than-memory datasets\n* Statistical Distributions: Implementations of common statistical distributions with Sketch support\n  and tight integrations with Java functional programming constructs, e.g., Stream, Supplier, etc.\n* Optimized Text Processing: Handcrafted mutable, immutable, and unmodifiable text processing\n  structures for efficient text manipulation.\n* Concurrency and Distribution: Concurrency is a primary concern, with all data structures\n  supporting divide-and-conquer fitting for vertical scaling and optional serialization for\n  horizontal scaling\n\n## Modules\n\n* core: The core module contains the core functionality of delta4j, including probabilistic data\n  structures, statistical distributions, and text processing utilities.\n* jackson: The jackson module provides support for serializing and deserializing delta4j data\n  structures using the Jackson JSON library.\n\n## Quick Start\n\nThis section provides a brief introduction to getting delta4j set up in your project.\n\n### BloomFilter\n\nCreating a bloom filter is easy! Simply provide the number of elements you expect to add to it, and\nthe desired probability of false positives:\n\n    BloomFilter\u003cString\u003e bloomFilter=BloomFilter.of(1000, 0.001);\n    bloomFilter.add(\"hello\");\n    if(bloomFilter.mightContain(\"hello\")) {\n        // Always executes\n    }\n    if(bloomFilter.mightContain(\"world\")) {\n        // 99.9% chance this does not run\n    }\n\nIf you'd like to create a new BloomFilter and fit it to a large dataset, you can use streams to\nperform this operation either sequentially or in parallel. For example, the below code creates new\nBloomFilter from the lines of a (potentially very large) file.\n\n    long count;\n    try (Stream\u003cString\u003e lines=Files.lines(path)) {\n        count = lines.count();\n    }\n    BloomFilter\u003cString\u003e bloomFilter;\n    try (Stream\u003cString\u003e lines=Files.lines(path)) {\n        bloomFilter = lines.collect(BloomFilter.toBloomFilter(count, 0.001));\n    }\n\n### Distributions\n\ndelta4j provides several commonly-used distributions along with methods to create them in parallel\nor distribution fashion. Each distribution will support the following features:\n\n* `sample(Random)`: Returns a single sample from the distribution. The result type differs from one\n  distribution type to another, but is always categorical (parameteric), continuous (`double`), or\n  discrete (`long`).\n* `Sketch`: A sketch is a lightweight, mutable, and serializable object that can be used to\n  fit a distribution incrementally in a divide-and-conquer fashion. Each distribution type (e.g.,\n  `FancyDistribution`) has an inner `Sketch` type (e.g., `FancyDistribution.Sketch`).\n* `fit(Stream)`: Fits a distribution to a stream of data. This method is designed to be used in\n  parallel or distributed fashion, depending on the stream provided.\n* `toXxxDistribution()`: Returns a `Collector` that can be used to fit a distribution to a stream of\n  data. This method is designed to be used in parallel or distributed fashion, depending on the\n  stream provided. Only defined for categorical distributions because the standard library does not\n  define corresponding `Collector` types for `DoubleStream` and `LongStream`.\n* `of()`: Returns a distribution with the given parameters. When creating a distribution directly\n  from parameters, this method is preferred (as opposed to using an explicit constructor) because\n  some distributions may have special or common instances that are precomputed.\n\nFor example, here is how to use the `GaussianDistribution` class:\n\n    GaussianDistribution d=GaussianDistribution.of(0.0, 1.0);\n    double sample=gaussianDistribution.sample(ThreadLocalRandom.current());\n\n### Text Views\n\nThe `StringView` class is a lightweight, mutable, and serializable subclass of `CharSequence` that\ncan be used to create substring views of a larger string without incurring the cost of copying the\nunderlying character array. Both mutable (`MutableStringView`) and immutable (`ImmutableStringView`)\nversion are provided. For example, this code uses `StringView` to create a map of all the unique\none- to four-letter substrings in a stream of strings:\n\n    // Compute the frequency of all one- to four-letter substrings in a file\n    Map\u003cCharSequence,Long\u003e bpes=Files.lines(path).flatMap(line -\u003e \n      IntStream.rangeClosed(1, 4)\n        .filter(len -\u003e len \u003c line.length())\n        .mapToObj(len -\u003e \n          IntStream.range(0, line.length() - len)\n            .mapToObj(i -\u003e StringView.mutableOf(line, i, i+len))\n            .flatMap(Function.identity())))\n        .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));\n\nThe `CharArrayView` class is a lightweight, mutable, and serializable subclass of `CharSequence`\nthat can be used to create substring views of a larger character array without incurring the cost of\ncopying the underlying character array. Both mutable (`MutableCharArrayView`) and immutable\n(`ImmutableCharArrayView`) version are provided. A similar example applies to `CharArrayView`.\n\n### Jackson Serialization\n\nThe `delta4j-jackson` module provides support for serializing and deserializing delta4j data\nstructures using the Jackson JSON library. To use this module, add it to your POM file and register\nthe module with your `ObjectMapper` instance:\n\n    ObjectMapper mapper=new ObjectMapper();\n    mapper.registerModule(new Delta4jModule());\n\n## Installation\n\nTo use delta4j in your project, add the following dependency to your `pom.xml` for Maven:\n\n```xml\n\n\u003cdependency\u003e\n  \u003cgroupId\u003ecom.sigpwned\u003c/groupId\u003e\n  \u003cartifactId\u003edelta4j\u003c/artifactId\u003e\n  \u003cversion\u003e0.0.0-b0\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n## Contributing\n\nWe welcome contributions! If you would like to help make delta4j better, please follow our\ncontributing guidelines. You can submit bug reports, feature requests, and pull requests through our\nGitHub issue tracker.\n\n## License\n\ndelta4j is open-source software licensed under the Apache License, Version 2.0. See the LICENSE file\nfor more details\n\n## Colophon\n\nThis library was originally designed as a set of `Collector` implementations to assist in parallel\ndata processing using Java 8 streams. A delta always appears at the end of a large stream. Also, a \"\ndelta\" is a small, incremental update to an existing dataset, and this library is designed not only\nto represent them efficiently, but also to process them at massive scale.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsigpwned%2Fdelta4j","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsigpwned%2Fdelta4j","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsigpwned%2Fdelta4j/lists"}