{"id":15665893,"url":"https://github.com/kingname/crawlerutility","last_synced_at":"2025-07-11T20:04:00.630Z","repository":{"id":202930395,"uuid":"127854062","full_name":"kingname/CrawlerUtility","owner":"kingname","description":"Simplify the development of your webcrawler","archived":false,"fork":false,"pushed_at":"2018-04-03T08:05:43.000Z","size":16,"stargazers_count":8,"open_issues_count":0,"forks_count":2,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-13T20:13:39.996Z","etag":null,"topics":["python3","requests","scrapy","webcrawler"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kingname.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2018-04-03T05:09:49.000Z","updated_at":"2021-11-02T07:16:33.000Z","dependencies_parsed_at":null,"dependency_job_id":"adb5c1b3-3acc-4369-addf-bcd6eff03335","html_url":"https://github.com/kingname/CrawlerUtility","commit_stats":null,"previous_names":["kingname/crawlerutility"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/kingname/CrawlerUtility","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kingname%2FCrawlerUtility","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kingname%2FCrawlerUtility/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kingname%2FCrawlerUtility/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kingname%2FCrawlerUtility/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kingname","download_url":"https://codeload.github.com/kingname/CrawlerUtility/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kingname%2FCrawlerUtility/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264890087,"owners_count":23678833,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["python3","requests","scrapy","webcrawler"],"created_at":"2024-10-03T13:56:27.971Z","updated_at":"2025-07-11T20:04:00.609Z","avatar_url":"https://github.com/kingname.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CrawlerUtility\n\nSimplify the development of your webcrawler\n\n# Usage\n\n## Install\n\n```\npip install --upgrade git+https://github.com/kingname/CrawlerUtility.git\n```\n\n## Common Utility\n\nYou can use this module without installing any third-part packages.\n\n### ChromeHeaders2Dict\n\n```\nfrom CrawlerUtility import ChromeHeaders2Dict\n\nchrome_headers = '''\n    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8\n    Accept-Encoding: gzip, deflate, br\n    Accept-Language: zh-CN,zh;q=0.9,en;q=0.8\n    Connection: keep-alive\n    Cookie: BAIDUID=E40AF2FAEC8CB0F382A3A8F5F59AC44D:FG=1; BIDUPSID=E40AF2FAEC8CB0F382A3A8F5F59AC44D; PSTM=1513916193; BDRCVFR[C0-VKBuJmg_]=mk3SLVN4HKm; BD_CK_SAM=1; pgv_pvi=8525405184; pgv_si=s3529928704; FP_UID=5eea85cb6e65c4d7a9f0f7b9d23ff3cb; BDRCVFR[w2jhEs_Zudc]=I67x6TjHwwYf0; BD_UPN=123253; shifen[62291884541_98248]=1520672084; BCLID=11171094791344044520; BDSFRCVID=LNCsJeC62ZBf13rACvOD-ViSJHNR0mTTH6aoKULoKvtI-AUyiIRrEG0PqU8g0Ku-sN62ogKK0mOTHvbP; H_BDCLCKID_SF=tbKq_DLXf-bSK4b1-4QD2DCShUFsWU6m-2Q-5KL-yqothDO4Lfb-XU3D3xrgBfvwLJRL-UbdJJjoOU5shUR-5McDLJo8axcN-eTxoUJhQCnJhhvGqJbFj6DebPRiJPr9Qgbq3ftLK-oj-D-mD55P; PSINO=7; MCITY=-131%3A; BDUSS=Vdic1Z6WHhEaGhvSW1KflhWUVYwcFRhemI0RjhDdjVmcGF1bktaVkNWQnppZDVhQUFBQUFBJCQAAAAAAAAAAAEAAACoVyMi1MLC5F-zpLCyAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHP8tlpz~LZac; BD_HOME=1; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; H_PS_PSSID=1993_1436_21094_18560_22157; sugstore=1; BDSVRTM=0\n    DNT: 1\n    Host: www.baidu.com\n    Upgrade-Insecure-Requests: 1\n    User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36\n    '''\n    headers_dict = ChromeHeaders2Dict(chrome_headers)\n    headers_dict\n```\n\n## Scrapy Utility\n\nIf you want to use this module, you must install `Scrapy` first.\n\n### AbuyunProxyMiddleware\n\n\nModify `settings.py` of your Scrapy project:\n\n```\nDOWNLOADER_MIDDLEWARES = {\n   'CrawlerUtility.scrapy_utility.ScrapyUtility.AbuyunProxyMiddleware': 548,\n}\n\nABUYUN_PROXY_SERVER = 'http://http-dyn.abuyun.com:9020' # must be set\nABUYUN_PROXY_USER = 'DWE2341LFOWQC4' # must be set\nABUYUN_PROXY_PASSWORD = '94SLIC1304C' # must be set\nSPIDER_BEHIND_PROXY = ['BaiduSpider', 'QQSpider'] # list of spider name. if not set, all spider will be behind the proxy\nSKIP_PROXY_KEYWORD = ['http://google.com', 'safeurl.com/aaa'] # the urls will not use proxy if they satisfy these pattern in the list,\n```\n\n### LogRequestUrlMiddleware\n\nIn default, Scrapy will only log the response's url. But what if the request follow redirect(s)? And sometimes you make 10 post\nbut Scrapy only show 5 response, you don't know if your request 5 urls in fact or the responses are missing or blocked.\nThis module solves your problem by logging the request url.\n\nTo use this module, you should change Scrapy's `settings.py`. Pay special attention for that this is a `Spider Middleware`,\nNOT a `Downloader middleware`\n\n```\nSPIDER_MIDDLEWARES = {\n   'CrawlerUtility.scrapy_utility.ScrapyUtility.LogRequestUrlMiddleware': 548,\n}\n\n# As log request urls will remarkablely scale up log, you should use the following settings to limit what request can be logged.\n\nSPIDER_SHOW_REQUESTS_URL = ['test'] # spiders which you want to log request urls\nPATTERN_SHOW_REQUESTS_URL = ['httpbin', 'kingname'] # only urls which satisfy the pattern will be logged.\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkingname%2Fcrawlerutility","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkingname%2Fcrawlerutility","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkingname%2Fcrawlerutility/lists"}