{"id":20414415,"url":"https://github.com/pykeen/bloom-filterer-benchmark","last_synced_at":"2025-10-06T02:58:56.260Z","repository":{"id":104276154,"uuid":"361169578","full_name":"pykeen/bloom-filterer-benchmark","owner":"pykeen","description":"🪑 Benchmark the bloom filterer at https://pykeen.github.io/bloom-filterer-benchmark/","archived":false,"fork":false,"pushed_at":"2021-04-25T10:33:49.000Z","size":11319,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-03-05T03:29:57.194Z","etag":null,"topics":["benchmarking","bloom-filter","knowledge-graph-embedding-models","knowledge-graph-embeddings","knowledge-graphs","machine-learning","negative-sampling","pykeen"],"latest_commit_sha":null,"homepage":"https://pykeen.github.io/bloom-filterer-benchmark/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pykeen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-04-24T13:31:14.000Z","updated_at":"2021-04-25T17:09:59.000Z","dependencies_parsed_at":null,"dependency_job_id":"bad9b1b2-66e9-40f1-9f96-44590986cf8b","html_url":"https://github.com/pykeen/bloom-filterer-benchmark","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/pykeen/bloom-filterer-benchmark","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pykeen%2Fbloom-filterer-benchmark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pykeen%2Fbloom-filterer-benchmark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pykeen%2Fbloom-filterer-benchmark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pykeen%2Fbloom-filterer-benchmark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pykeen","download_url":"https://codeload.github.com/pykeen/bloom-filterer-benchmark/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pykeen%2Fbloom-filterer-benchmark/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278551509,"owners_count":26005389,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-06T02:00:05.630Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmarking","bloom-filter","knowledge-graph-embedding-models","knowledge-graph-embeddings","knowledge-graphs","machine-learning","negative-sampling","pykeen"],"created_at":"2024-11-15T06:09:50.035Z","updated_at":"2025-10-06T02:58:56.232Z","avatar_url":"https://github.com/pykeen.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# bloom-filterer-benchmark\n\nNegative sampling is necessary during training of knowledge graph embedding models because knowledge graphs typically\nonly have positive examples. Typically negative examples are derived from positive ones by corruption: replacing an \nentry of a positive triple by a random replacement. Unfortunately, this corruption technique can produce false \nnegatives that are actually already in the knowledge graph. Sometimes this isn't a big deal, depending on the \ndataset, model, and learning task. PyKEEN provides the ability to filter false negatives using an exact algorithm, \nbut it's quite slow.\n\nAn alternative filterer based on an approximate existence index structure called the\n[bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) was introduced in\n[PyKEEN #401](https://github.com/pykeen/pykeen/pull/401) by Max Berrendorf ([@mberr](https://github.com/mberr)).\nWe then did this benchmarking study to show that it both makes a huge time improvement while maintaining a low\nerror rate.\n\nThe code and artifacts are available on [GitHub](https://github.com/pykeen/bloom-filterer-benchmark). \nIt can be rerun with `python benchmark.py --force`.  A tutorial on how to enable the filtering of negative \nsamples as well as specifics on the exact filterer and the bloom filterer are available on\n[Read the Docs](https://pykeen.readthedocs.io/en/latest/reference/negative_sampling.html).\n\n## Benchmarking\n\nBenchmarking over several datasets of varying size shows suggests that there isn't a large size-dependence on the\nrelationship between the bloom filter's\n`error_rate` parameter and the actual error observed on either the testing or validation sets.\n\n\u003cimg src=\"charts/errors.svg\" /\u003e\n\nAs expected, the time for checking the triples decreases with an increased nominal error rate.\n\n\u003cimg src=\"charts/lookup_times.svg\" /\u003e\n\nDatasets with a larger number of triples take longer to create. The time to create a bloom filter also decreases as the\nnominal error rate increases.\n\n\u003cimg src=\"charts/creation_times.svg\" /\u003e\n\nThe size of the bloom filter increases with larger number of training triples, but also varies exponentially with the\nerror rate. The relationship is `log(time) ~ log(triples) + log(error rate)`.\n\n\u003cimg src=\"charts/sizes.svg\" /\u003e\n\n## Comparison\n\nThe bloom filterer is compared the exact implementation using Python sets. Several error rates\nfor the bloom filter are shown simultaneously, to show that there's a tradeoff that's possible for each.\n\nThe setup times show the bloom filterer is faster for larger datasets, though this only has to be done once\nfor any given training. The times for both implementations are sub-seconds, so they can be considered negligible.\n\n\u003cimg src=\"charts/comparison/setup.svg\" /\u003e\n\nThe Python-based implementation performs well on smaller datasets, but the bloom filterer clearly wins for larger\nones. The lookup operation is repeated during training, so these times do add up.\n\n\u003cimg src=\"charts/comparison/lookup_times.svg\" /\u003e\n\nThe error rates make most sense to show on a log scale, but since the exact implementation has a constant error\nrate of zero, the log scale becomes an issue. Therefore, the error rates were all adjusted by adding\n`1 / number of triples`, which is the minimum possible error for any given lookup task. In this chart, you can\nthink of the \"adjusted error rate\" reported for the exact algorithm as the lower bound. Note that several datapoints\nare shown for the bloom filter, which correspond to different \"desired error rates\" that are parameterizable\nin the code.\n\n\u003cimg src=\"charts/comparison/errors.svg\" /\u003e\n\nThe following chart shows the tradeoff between the implementations (with all error rates of the bloom filter shown)\nacross all datasets. Again, think of the adjusted error rate reported for the exact algorithm as a lower bound.\n\n\u003cimg src=\"charts/comparison/errors_2d.svg\" /\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpykeen%2Fbloom-filterer-benchmark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpykeen%2Fbloom-filterer-benchmark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpykeen%2Fbloom-filterer-benchmark/lists"}