https://github.com/Python3WebSpider/ScrapyRedisBloomFilter

Scrapy Redis Bloom Filter
https://github.com/Python3WebSpider/ScrapyRedisBloomFilter

Last synced: 9 months ago
JSON representation

Scrapy Redis Bloom Filter

Host: GitHub
URL: https://github.com/Python3WebSpider/ScrapyRedisBloomFilter
Owner: Python3WebSpider
Created: 2017-08-11T08:43:16.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2021-07-25T18:17:09.000Z (over 4 years ago)
Last Synced: 2024-09-17T01:53:13.969Z (over 1 year ago)
Language: Python
Size: 34.2 KB
Stars: 173
Watchers: 3
Forks: 52
Open Issues: 7
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Scrapy-Redis-BloomFilter

This is a package for supporting BloomFilter of Scrapy-Redis.

## Installation

You can easily install this package with pip:

```

pip install scrapy-redis-bloomfilter

```

Dependency:

- Scrapy-Redis >= 0.6.8

## Usage

Add this settings to `settings.py`:

```python

# Use this Scheduler, if your scrapy_redis version is <= 0.7.1

SCHEDULER = "scrapy_redis_bloomfilter.scheduler.Scheduler"

# Ensure all spiders share same duplicates filter through redis

DUPEFILTER_CLASS = "scrapy_redis_bloomfilter.dupefilter.RFPDupeFilter"

# Redis URL

REDIS_URL = 'redis://localhost:6379'

# Number of Hash Functions to use, defaults to 6

BLOOMFILTER_HASH_NUMBER = 6

# Redis Memory Bit of Bloom Filter Usage, 30 means 2^30 = 128MB, defaults to 30

BLOOMFILTER_BIT = 10

# Persist

SCHEDULER_PERSIST = True

```

## Test

Here is a test of this project, usage:

```

git clone https://github.com/Python3WebSpider/ScrapyRedisBloomFilter.git

cd ScrapyRedisBloomFilter/test

scrapy crawl test

```

Note: please change REDIS_URL in settings.py.

Spider like this:

```python

from scrapy import Request, Spider

class TestSpider(Spider):

    name = 'test'

    base_url = 'https://www.baidu.com/s?wd='

    def start_requests(self):

        for i in range(10):

            url = self.base_url + str(i)

            yield Request(url, callback=self.parse)

        # Here contains 10 duplicated Requests

        for i in range(100):

            url = self.base_url + str(i)

            yield Request(url, callback=self.parse)

    def parse(self, response):

        self.logger.debug('Response of ' + response.url)

```

Result like this:

```python

{'bloomfilter/filtered': 10, # This is the number of Request filtered by BloomFilter

 'downloader/request_bytes': 34021,

 'downloader/request_count': 100,

 'downloader/request_method_count/GET': 100,

 'downloader/response_bytes': 72943,

 'downloader/response_count': 100,

 'downloader/response_status_count/200': 100,

 'finish_reason': 'finished',

 'finish_time': datetime.datetime(2017, 8, 11, 9, 34, 30, 419597),

 'log_count/DEBUG': 202,

 'log_count/INFO': 7,

 'memusage/max': 54153216,

 'memusage/startup': 54153216,

 'response_received_count': 100,

 'scheduler/dequeued/redis': 100,

 'scheduler/enqueued/redis': 100,

 'start_time': datetime.datetime(2017, 8, 11, 9, 34, 26, 495018)}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/Python3WebSpider/ScrapyRedisBloomFilter

Awesome Lists containing this project

README