{"id":16871254,"url":"https://github.com/mynameisfiber/fuggetaboutit","last_synced_at":"2025-03-22T07:31:07.553Z","repository":{"id":57432550,"uuid":"13305205","full_name":"mynameisfiber/fuggetaboutit","owner":"mynameisfiber","description":"implementations of a counting bloom, a timing bloom and a scaling timing bloom... made for working with streams!","archived":false,"fork":false,"pushed_at":"2017-02-01T14:57:31.000Z","size":303,"stargazers_count":42,"open_issues_count":0,"forks_count":5,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-18T09:21:29.296Z","etag":null,"topics":["bloom-filters","counting-bloom-filter","datastructures","decay","fast","python","scaling","stream","timing-bloom","timing-bloom-filter"],"latest_commit_sha":null,"homepage":"http://micha.codes/fuggetaboutit/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mynameisfiber.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-10-03T17:51:27.000Z","updated_at":"2023-01-26T03:35:28.000Z","dependencies_parsed_at":"2022-09-17T03:50:50.288Z","dependency_job_id":null,"html_url":"https://github.com/mynameisfiber/fuggetaboutit","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mynameisfiber%2Ffuggetaboutit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mynameisfiber%2Ffuggetaboutit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mynameisfiber%2Ffuggetaboutit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mynameisfiber%2Ffuggetaboutit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mynameisfiber","download_url":"https://codeload.github.com/mynameisfiber/fuggetaboutit/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244925053,"owners_count":20532873,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bloom-filters","counting-bloom-filter","datastructures","decay","fast","python","scaling","stream","timing-bloom","timing-bloom-filter"],"created_at":"2024-10-13T15:06:52.722Z","updated_at":"2025-03-22T07:31:03.797Z","avatar_url":"https://github.com/mynameisfiber.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Fugget About It\n[![Build Status](https://secure.travis-ci.org/mynameisfiber/fuggetaboutit.png?branch=master)](http://travis-ci.org/mynameisfiber/fuggetaboutit)\n[![PyPI version](https://badge.fury.io/py/fuggetaboutit.svg)](https://badge.fury.io/py/fuggetaboutit)\n\n\u003e auto-scaling probabilistic time windowed set inclusion datastructure\n\n[_docs_](http://micha.codes/fuggetaboutit)\n\n### what is?\n\nWhat does this mean?  Well... it means you can have a rolling window view on\nunique items in a stream (using the `TimingBloomFilter` object) and also have\nit rescale itself when the number of unique items increases beyond what you had\nanticipated (using the `ScalingTimingBloomFilter`).  And, since this is built\non bloom filters, the number of bits per entry is generally EXCEEDINGLY small\nletting you keep track of many items using a small amount of resources while\nstill having very tight bounds on error.\n\nSo, let's say you have a stream coming in 24 hours a day, 7 days a week.  This\nstream contains phone numbers and you want to ask the question \"Have I seen\nthis phone number in the past day?\".  This could be answered with the following\ncode stub:\n\n```\nfrom fuggetaboutit import TimingBloomFilter\n\ncache = TimingBloomFilter(capacity=1000000, decay_time=24*60*60).start()\n\ndef handle_message(phone_number):\n    if phonenumber in cache:\n        print \"I have seen this before: \", phone_number\n    cache.add(phone_number)\n```\n\nAssuming you have a `tornado.ioloop` running, this will automatically forget\nold values for you and only print if the phone number has been seen *in the\nlast 24hours*.  (NOTE: If you do not have an IOLoop running, don't worry...\njust call the `TimingBloomFilter.decay()` method every half a decay interval or\nevery 12 hours in this example).\n\nNow, this example assumed you had apriori knowledge about how many unique phone\nnumbers you would expect -- we told fuggetaboutit that we would have at most\n1000000 unique phone numbers.  What happens if we don't know this number\nbeforehand or we know that this value varies wildly?  In this case, we can use\nthe `ScalingTimingBloomFilter`\n\n```\nfrom fuggetaboutit import ScalingTimingBloomFilter\n\ncache = ScalingTimingBloomFilter(capacity=1000000, decay_time=24*60*60).start()\n\ndef handle_message(phone_number):\n    if phonenumber in cache:\n        print \"I have seen this before: \", phone_number\n    cache.add(phone_number)\n```\n\nThis will automatically build new bloom filters as needed, and delete unused\none.  In this case, the capacity is simply a baseline capacity and we can\neasily grow beyond it.\n\n### speed\n\nDid we mention that this thing is fast?  It's all built on numpy ndarray's and\nuses a c-python module to optimize all of the important bits.  On a 2011\nMacBook Air, I get:\n\n```\n$ python -m fuggetaboutit.benchmark\nBenchmarking blooms with size 100000\n(baseline timing of keygeneration: 9.84e-06s, already subtracted from results)\n.-------------------------------------------------------------------------------.\n|                                    | bench_add | bench_contains | bench_decay |\n|===============================================================================|\n|                Timing Bloom Filter | 1.09e-05s | 8.1764627e-06s | 1.9898e-03s |\n|        Scaling Timing Bloom Filter | 1.57e-05s | 1.6510360e-05s | 2.3653e-03s |\n| Scaled Scaling Timing Bloom Filter | 2.41e-05s | 1.9161074e-05s | 1.5937e-02s |\n'-------------------------------------------------------------------------------'\n```\n\nFor these benchmarks, the first and second entries are empty\n`TimingBloomFilter` and `ScalingTimingBloomFilter` objects with capacity\n100000.  The same is the case for the last entry, however we also added 150000\nentries before the test so that the bloom is in a scaled state.\n\n### todo\n\n**MOAR SPEED**\n\n\n### References\n\nFuggetaboutit was inspired by the following papers\n\n* Paulo Sérgio Almeida, Carlos Baquero, Nuno Preguiça, David Hutchison;\n  [\"Scalable Bloom Filters\"](http://asc.di.fct.unl.pt/~nmp/pubs/ref--04.pdf)\n* Jonathan L. Dautrich, Chinya V. Ravishankar; [\"Inferential Time-Decaying\n  Bloom Filters\"\n  ](http://www.edbt.org/Proceedings/2013-Genova/papers/edbt/a23-dautrich.pdf)\n* Adam Kirsch, Michael Mitzenmacher; [\"Less Hashing, Same Performance: Building\n  a Better Bloom\n  Filter\"](http://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmynameisfiber%2Ffuggetaboutit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmynameisfiber%2Ffuggetaboutit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmynameisfiber%2Ffuggetaboutit/lists"}