{"id":16908111,"url":"https://github.com/cjrh/aggonydb","last_synced_at":"2026-04-19T17:11:19.214Z","repository":{"id":70769770,"uuid":"481880132","full_name":"cjrh/aggonydb","owner":"cjrh","description":"Aggony DB is a one-trick-pony database that can perform rapid aggregation of many-fields low-cardinality big data","archived":false,"fork":false,"pushed_at":"2022-04-25T05:08:16.000Z","size":247,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-08-26T12:18:53.441Z","etag":null,"topics":["aggregation","datasketches","probabilistic-data-structures"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cjrh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-04-15T07:46:24.000Z","updated_at":"2025-02-19T20:50:20.000Z","dependencies_parsed_at":null,"dependency_job_id":"2f46fb0e-8f74-41ff-8783-0c8d1690dd48","html_url":"https://github.com/cjrh/aggonydb","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/cjrh/aggonydb","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cjrh%2Faggonydb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cjrh%2Faggonydb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cjrh%2Faggonydb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cjrh%2Faggonydb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cjrh","download_url":"https://codeload.github.com/cjrh/aggonydb/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cjrh%2Faggonydb/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32014831,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-18T20:23:30.271Z","status":"online","status_checked_at":"2026-04-19T02:00:07.110Z","response_time":55,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aggregation","datasketches","probabilistic-data-structures"],"created_at":"2024-10-13T18:50:12.523Z","updated_at":"2026-04-19T17:11:19.194Z","avatar_url":"https://github.com/cjrh.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# aggonydb\nAggony DB is a one-trick-pony database that can perform rapid aggregation of\nmany-fields low-cardinality data\n\n## ...but what is it tho?\n\nSay you have spreadsheet-like data:\n\n| key | name                    | age | city     | ..1000's of columns |\n|-----|-------------------------|-----|----------|---------------------|\n| 1   | caleb                   | 40  | Brisbane | ...                 |\n| 2   | gina                    | 40  | Brisbane |                     |\n| 3   | mike                    | 30  | London   |                     |\n| 4   | gina                    | 30  | London   |                     |\n| ... | ...another million rows | ... | ...      |                     |\n\nThen you want to analyze it, specifically with *filtering + aggregation*. For example\nyou want to answer the following questions:\n\n- _How many people called `gina` are in `London`?_\n- _How many people called `caleb` are `30`?_\n\nIn essence, you're applying filters to the data and counting how\nmany records match.\n\nThis repo provides a proof-of-concept implementation of an HTTP\napi for:\n- storing data into a Postgresql database\n- querying that database for filtered aggregations\n\nSee the Demo further down.\n\n## This is what DBs are good at. What's the problem?\n\nIt turns out that using conventional techniques like storing all\nthe data in a relational database becomes slow when the size of\nthe data increases. The speed issue is not easily solved\nwith indexes. Because _any_ field can participate in a filter\noperation, one would have to create an index for every field,\nand the problem of combinatorial explosion still remains because\nthe database engine still needs to perform large costly joins\non all the filtered fields to arrive at the _intersection_.\n\nThis repo doesn't invent anything—all it does is wrap an HTTP\napi around the [Apache DataSketches](https://datasketches.apache.org/)\nlibrary.\n[They describe the filtering problem](https://datasketches.apache.org/docs/Background/TheChallenge.html)\nin much more detail.\n\n## Ok so what are you proposing?\n\nHa, I'm not proposing anything, this repo is an experiment to play\naround with using probabilistic data structures to see whether\nthey provide an acceptable solution to the filtering-aggregation\nproblem described above.\n\nThe Apache DataSketches project provides efficient implementations\nof several \"sketches\", which are the probabilistic data structures\nused for these cardinality (unique counting) approximations. \n\nThe specific one used in this demo project is the\n[Theta Sketch](https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html).\nAnother sketch called _HyperLogLog_ (HLL) got pretty popular a few years ago\nand is also a cardinality estimator. HLL has low estimate error for\ncounts and unions, but estimates of _intersection_ have much larger\nerror bounds because HLL doesn't natively support an intersection\nset operation―you have to calculate it manually with the \n[inclusion-exclusion principle](https://en.wikipedia.org/wiki/Inclusion%E2%80%93exclusion_principle)\nwhich can result in a large estimate error if one of the sets has\na much smaller cardinality than the others.  This scenario will\nbe common in our \"filter a spreadsheet\" example.\n\nInstead, we use the Theta Sketch. There are several benefits of the \nTheta Sketch over HLL, but the primary reason is that cardinality\nestimate for set intersections can be several orders of magnitude\nlower. The following graph, from [this page](https://datasketches.apache.org/docs/Theta/ThetaAccuracyPlots.html), shows how the \nrelative error for the inclusion-exclusion method grows, compared\nto the Theta Sketch native implementation, as the similarity of the\ntwo sets being intersected decreases (the intersection gets smaller\nand smaller towards the right).\n\n![Theta Sketch accuracy](64KSketchVsIEerror.png)\n\nThese are my reasons for using the Theta Sketch:\n- intersection is natively implemented, much smaller relative error\n- other set operations, like `AnotB` are also natively implemented\n- unlike HLL, can combine sketches of different accuracy: quoting from the\n  [Accuracy](https://datasketches.apache.org/docs/Theta/AccuracyOfDifferentKUnions.html) page: _One of the benefits of the Theta\n  Sketch algorithms is that they support the union of sketches that have\n  been created with different values of k or Nominal Entries_\n- the implementation in datasketches gives error bounds on estimates\n- the error bounds are unbiased above/below (HLL errors are biased) \n- just like HLLs, unions introduce no new error regardless of how many\n  are unioned together\n\n\n## Demo\n\nFirst step is to run the docker container to start the Postgresql DB.\nYou need a checkout of https://github.com/apache/datasketches-postgresql\nalongside a checkout of this repo.  Then, in the folder for this repo:\n\n```shell\n$ docker-compose up -d\n```\n\nThen, start the HTTP server:\n\n```shell\n$ cargo run\n```\n\nThen, let's add some data (using [httpie](https://httpie.io/) :\n\n```shell\n$ http post localhost:8080/add/demo \\\n    fields[name]=gina \\\n    fields[age]:=40 \\\n    fields[height]:=1.7 \\\n    fields[adult]:=true \\\n    fields[stars]:=5.0 \\\n    domain_key:=1\nok\n\n$ http post localhost:8080/add/demo \\\n    fields[name]=gina \\\n    fields[age]:=40 \\\n    fields[height]:=1.7 \\\n    fields[adult]:=true \\\n    fields[stars]:=5.0 \\\n    domain_key:=2\nok\n\n$ http post localhost:8080/add/demo \\\n    fields[name]=gina \\\n    fields[age]:=40 \\\n    fields[height]:=1.7 \\\n    fields[adult]:=true \\\n    fields[stars]:=5.0 \\\n    domain_key:=3\nok\n\n```\n\nNotes:\n- The \"demo\" dataset name (in the URL, `/add/{dataset}`) is created on the\n  fly.\n- The given field names and values also do not need to exist, they're created\n  on the fly.\n- The `domain_key` is the disambiguator. The three additions made above mean\n  3 separate events. If you're seeing this data from another database, it\n  might make sense to use the primary key, for example, depending on what\n  you're adding.\n- Note that all JSON types are allowed in the request POST body. (Internally,\n  they're all saved as string values but that doesn't affect anything)\n\nNow let's do a count:\n\n```shell\n$ http post localhost:8080/filter/demo items:='[{\"field_name\": \"name\", \"value\": \"gina\"}]'\n\n{\n    \"estimate\": 3.0,\n    \"lower_bound\": 3.0,\n    \"upper_bound\": 3.0\n}\n\n```\n\nNotes:\n- The request POST body must have an `items` field, which is an array of\n  objects. Each object must have a `field_name` attribute and a `value`\n  attribute. This sequence defines the intersection of these fields and values\n  over all the submitted data.\n- The `estimate` result field is the cardinality estimate of the intersection\n  (in this case we only have one filter).\n- The `lower_bound` and `upper_bound` fields define the accuracy range, and\n  will be greatly appreciated by consumers of the estimates.\n\n## TODO:\n\n- [ ] more examples in README showing multiple filters\n- [ ] more test cases for empty sets\n- [ ] make tests actually runnable in github ci (will have to run the PG container)\n- [ ] make a \"batch\" endpoint for receiving larger chunks of data\n- [ ] make an endpoint for receiving serialized theta sketches. The Python\n  datasketches library can produce serialized theta sketches that are byte\n  compatible with what is stored here.\n- [ ] contemplate an additional API that can save sketches alongside a date\n  field. It could be very useful for producing trends.\n- [ ] contemplate extracting a domain_key out of the supplied `add` fields,\n  rather than requiring it to be specified separately. Perhaps the name of\n  the field can be specified separately.  Not sure how I feel about this,\n  more magic probably helps nobody in the long run.\n- [ ] investigate doing more sketch stuff in rust space rather than in DB. More\n  of a scaling thing.\n- [ ] investigate using postgres partitions for easily \"ageing out\" old sketches.\n  I read a blog about something like this but I can't remember where that was\n  right now.\n- [ ] Add links and reading material to this README.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcjrh%2Faggonydb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcjrh%2Faggonydb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcjrh%2Faggonydb/lists"}