{"id":13488604,"url":"https://github.com/Python3WebSpider/ScrapyRedisBloomFilter","last_synced_at":"2025-03-28T01:36:46.002Z","repository":{"id":57464710,"uuid":"100011563","full_name":"Python3WebSpider/ScrapyRedisBloomFilter","owner":"Python3WebSpider","description":"Scrapy Redis Bloom Filter","archived":false,"fork":false,"pushed_at":"2021-07-25T18:17:09.000Z","size":35,"stargazers_count":173,"open_issues_count":7,"forks_count":52,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-09-17T01:53:13.969Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Python3WebSpider.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-08-11T08:43:16.000Z","updated_at":"2024-08-03T19:05:07.000Z","dependencies_parsed_at":"2022-08-31T03:10:12.303Z","dependency_job_id":null,"html_url":"https://github.com/Python3WebSpider/ScrapyRedisBloomFilter","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Python3WebSpider%2FScrapyRedisBloomFilter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Python3WebSpider%2FScrapyRedisBloomFilter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Python3WebSpider%2FScrapyRedisBloomFilter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Python3WebSpider%2FScrapyRedisBloomFilter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Python3WebSpider","download_url":"https://codeload.github.com/Python3WebSpider/ScrapyRedisBloomFilter/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":222333976,"owners_count":16968058,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T18:01:18.742Z","updated_at":"2025-03-28T01:36:45.996Z","avatar_url":"https://github.com/Python3WebSpider.png","language":"Python","readme":"# Scrapy-Redis-BloomFilter\n\nThis is a package for supporting BloomFilter of Scrapy-Redis.\n\n## Installation\n\nYou can easily install this package with pip:\n\n```\npip install scrapy-redis-bloomfilter\n```\n\nDependency:\n\n- Scrapy-Redis \u003e= 0.6.8\n\n## Usage\n\nAdd this settings to `settings.py`:\n\n```python\n# Use this Scheduler, if your scrapy_redis version is \u003c= 0.7.1\nSCHEDULER = \"scrapy_redis_bloomfilter.scheduler.Scheduler\"\n\n# Ensure all spiders share same duplicates filter through redis\nDUPEFILTER_CLASS = \"scrapy_redis_bloomfilter.dupefilter.RFPDupeFilter\"\n\n# Redis URL\nREDIS_URL = 'redis://localhost:6379'\n\n# Number of Hash Functions to use, defaults to 6\nBLOOMFILTER_HASH_NUMBER = 6\n\n# Redis Memory Bit of Bloom Filter Usage, 30 means 2^30 = 128MB, defaults to 30\nBLOOMFILTER_BIT = 10\n\n# Persist\nSCHEDULER_PERSIST = True\n```\n\n## Test\n\nHere is a test of this project, usage:\n\n```\ngit clone https://github.com/Python3WebSpider/ScrapyRedisBloomFilter.git\ncd ScrapyRedisBloomFilter/test\nscrapy crawl test\n```\n\nNote: please change REDIS_URL in settings.py.\n\nSpider like this:\n\n```python\nfrom scrapy import Request, Spider\n\nclass TestSpider(Spider):\n    name = 'test'\n    base_url = 'https://www.baidu.com/s?wd='\n\n    def start_requests(self):\n        for i in range(10):\n            url = self.base_url + str(i)\n            yield Request(url, callback=self.parse)\n\n        # Here contains 10 duplicated Requests\n        for i in range(100):\n            url = self.base_url + str(i)\n            yield Request(url, callback=self.parse)\n\n    def parse(self, response):\n        self.logger.debug('Response of ' + response.url)\n```\n\nResult like this:\n\n```python\n{'bloomfilter/filtered': 10, # This is the number of Request filtered by BloomFilter\n 'downloader/request_bytes': 34021,\n 'downloader/request_count': 100,\n 'downloader/request_method_count/GET': 100,\n 'downloader/response_bytes': 72943,\n 'downloader/response_count': 100,\n 'downloader/response_status_count/200': 100,\n 'finish_reason': 'finished',\n 'finish_time': datetime.datetime(2017, 8, 11, 9, 34, 30, 419597),\n 'log_count/DEBUG': 202,\n 'log_count/INFO': 7,\n 'memusage/max': 54153216,\n 'memusage/startup': 54153216,\n 'response_received_count': 100,\n 'scheduler/dequeued/redis': 100,\n 'scheduler/enqueued/redis': 100,\n 'start_time': datetime.datetime(2017, 8, 11, 9, 34, 26, 495018)}\n```\n","funding_links":[],"categories":["Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FPython3WebSpider%2FScrapyRedisBloomFilter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FPython3WebSpider%2FScrapyRedisBloomFilter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FPython3WebSpider%2FScrapyRedisBloomFilter/lists"}