https://github.com/kingname/crawlerutility

Simplify the development of your webcrawler
https://github.com/kingname/crawlerutility

python3 requests scrapy webcrawler

Last synced: 12 months ago
JSON representation

Simplify the development of your webcrawler

Host: GitHub
URL: https://github.com/kingname/crawlerutility
Owner: kingname
License: mit
Created: 2018-04-03T05:09:49.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2018-04-03T08:05:43.000Z (about 8 years ago)
Last Synced: 2025-04-13T20:13:39.996Z (about 1 year ago)
Topics: python3, requests, scrapy, webcrawler
Language: Python
Size: 15.6 KB
Stars: 8
Watchers: 4
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

          # CrawlerUtility

Simplify the development of your webcrawler

# Usage

## Install

```

pip install --upgrade git+https://github.com/kingname/CrawlerUtility.git

```

## Common Utility

You can use this module without installing any third-part packages.

### ChromeHeaders2Dict

```

from CrawlerUtility import ChromeHeaders2Dict

chrome_headers = '''

    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8

    Accept-Encoding: gzip, deflate, br

    Accept-Language: zh-CN,zh;q=0.9,en;q=0.8

    Connection: keep-alive

    Cookie: BAIDUID=E40AF2FAEC8CB0F382A3A8F5F59AC44D:FG=1; BIDUPSID=E40AF2FAEC8CB0F382A3A8F5F59AC44D; PSTM=1513916193; BDRCVFR[C0-VKBuJmg_]=mk3SLVN4HKm; BD_CK_SAM=1; pgv_pvi=8525405184; pgv_si=s3529928704; FP_UID=5eea85cb6e65c4d7a9f0f7b9d23ff3cb; BDRCVFR[w2jhEs_Zudc]=I67x6TjHwwYf0; BD_UPN=123253; shifen[62291884541_98248]=1520672084; BCLID=11171094791344044520; BDSFRCVID=LNCsJeC62ZBf13rACvOD-ViSJHNR0mTTH6aoKULoKvtI-AUyiIRrEG0PqU8g0Ku-sN62ogKK0mOTHvbP; H_BDCLCKID_SF=tbKq_DLXf-bSK4b1-4QD2DCShUFsWU6m-2Q-5KL-yqothDO4Lfb-XU3D3xrgBfvwLJRL-UbdJJjoOU5shUR-5McDLJo8axcN-eTxoUJhQCnJhhvGqJbFj6DebPRiJPr9Qgbq3ftLK-oj-D-mD55P; PSINO=7; MCITY=-131%3A; BDUSS=Vdic1Z6WHhEaGhvSW1KflhWUVYwcFRhemI0RjhDdjVmcGF1bktaVkNWQnppZDVhQUFBQUFBJCQAAAAAAAAAAAEAAACoVyMi1MLC5F-zpLCyAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHP8tlpz~LZac; BD_HOME=1; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; H_PS_PSSID=1993_1436_21094_18560_22157; sugstore=1; BDSVRTM=0

    DNT: 1

    Host: www.baidu.com

    Upgrade-Insecure-Requests: 1

    User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36

    '''

    headers_dict = ChromeHeaders2Dict(chrome_headers)

    headers_dict

```

## Scrapy Utility

If you want to use this module, you must install `Scrapy` first.

### AbuyunProxyMiddleware

Modify `settings.py` of your Scrapy project:

```

DOWNLOADER_MIDDLEWARES = {

   'CrawlerUtility.scrapy_utility.ScrapyUtility.AbuyunProxyMiddleware': 548,

}

ABUYUN_PROXY_SERVER = 'http://http-dyn.abuyun.com:9020' # must be set

ABUYUN_PROXY_USER = 'DWE2341LFOWQC4' # must be set

ABUYUN_PROXY_PASSWORD = '94SLIC1304C' # must be set

SPIDER_BEHIND_PROXY = ['BaiduSpider', 'QQSpider'] # list of spider name. if not set, all spider will be behind the proxy

SKIP_PROXY_KEYWORD = ['http://google.com', 'safeurl.com/aaa'] # the urls will not use proxy if they satisfy these pattern in the list,

```

### LogRequestUrlMiddleware

In default, Scrapy will only log the response's url. But what if the request follow redirect(s)? And sometimes you make 10 post

but Scrapy only show 5 response, you don't know if your request 5 urls in fact or the responses are missing or blocked.

This module solves your problem by logging the request url.

To use this module, you should change Scrapy's `settings.py`. Pay special attention for that this is a `Spider Middleware`,

NOT a `Downloader middleware`

```

SPIDER_MIDDLEWARES = {

   'CrawlerUtility.scrapy_utility.ScrapyUtility.LogRequestUrlMiddleware': 548,

}

# As log request urls will remarkablely scale up log, you should use the following settings to limit what request can be logged.

SPIDER_SHOW_REQUESTS_URL = ['test'] # spiders which you want to log request urls

PATTERN_SHOW_REQUESTS_URL = ['httpbin', 'kingname'] # only urls which satisfy the pattern will be logged.

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kingname/crawlerutility

Awesome Lists containing this project

README