{"id":18402193,"url":"https://github.com/gerapy/gerapyproxy","last_synced_at":"2025-04-07T07:32:04.092Z","repository":{"id":57433918,"uuid":"279847782","full_name":"Gerapy/GerapyProxy","owner":"Gerapy","description":"A package for supporting proxy in Scrapy \u0026 Gerapy","archived":false,"fork":false,"pushed_at":"2020-07-15T16:36:53.000Z","size":19,"stargazers_count":11,"open_issues_count":0,"forks_count":4,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-22T14:34:48.987Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Gerapy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-07-15T11:16:33.000Z","updated_at":"2024-11-21T07:42:09.000Z","dependencies_parsed_at":"2022-09-01T19:10:43.182Z","dependency_job_id":null,"html_url":"https://github.com/Gerapy/GerapyProxy","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gerapy%2FGerapyProxy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gerapy%2FGerapyProxy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gerapy%2FGerapyProxy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gerapy%2FGerapyProxy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Gerapy","download_url":"https://codeload.github.com/Gerapy/GerapyProxy/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247612256,"owners_count":20966701,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T02:41:34.124Z","updated_at":"2025-04-07T07:32:03.840Z","avatar_url":"https://github.com/Gerapy.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Gerapy Proxy\n\nThis is a package for supporting proxy with async mechanism in Scrapy, also this\npackage is a module in [Gerapy](https://github.com/Gerapy/Gerapy).\n\n## Installation\n\n```shell script\npip3 install gerapy-proxy\n```\n\n## Usage\n\nIf you have a ProxyPool which can provide a random proxy for every request, you can use this package\nto integrate proxy into your Scrapy/Gerapy Project.\n\nFor example, there is a [ProxyPool API](https://proxypool.scrape.center/random) which can return a random proxy \nper time, we can configure `GERAPY_PROXY_POOL_URL` setting provided by this package to enable proxy for every Scrapy Request.\n\nTo use this package, firstly install it and then enable it in DownloadMiddleware:\n\n```python\nDOWNLOADER_MIDDLEWARES = {\n    'gerapy_proxy.middlewares.ProxyPoolMiddleware': 543,\n}\n```\n\nand add proxy url in settings:\n\n```shell script\nGERAPY_PROXY_POOL_URL = 'https://proxypool.scrape.center/random'\n```\n\nThis ProxyPool is configured based on this [ProxyPool](https://github.com/Python3WebSpider/ProxyPool) repo, you can\nalso build your own ProxyPool service.\n\nNow, you've finished it.\n\nThe `ProxyPoolMiddleware` will firstly fetch a proxy from `GERAPY_PROXY_POOL_URL` and set `meta.proxy` attribute\nto Scrapy Reqeust.\n\n## Configuration\n\n### Basic Auth\n\nIf your ProxyPool has Basic Auth, you can enable it by configuring these settings:\n\n```shell script\nGERAPY_PROXY_POOL_AUTH = True\nGERAPY_PROXY_POOL_USERNAME = \u003cusername\u003e\nGERAPY_PROXY_POOL_PASSWORD = \u003cpassword\u003e\n```\n\n### Min Retry Times\n\nIf you want to enable Proxy depends on the retry times, you can configure this settings:\n\n```shell script\nGERAPY_PROXY_POOL_MIN_RETRY_TIMES = 2\n```\n\nThen proxy will only work if the retry times of Request greater or equal than 2.\n\n### Random Enabled\n\nIf you want to enable the proxy randomly, you can configure the probability of enabling it:\n\n```shell script\nGERAPY_PROXY_POOL_RANDOM_ENABLE_RATE = 0.8\n```\n\nThen probability of enabling the proxy is 80%, if you configure it to 1, proxy will always be enabled.\n\n### Fetch Timeout\n\nYou can also configure the max time of fetching proxy from ProxyPool:\n\n```shell script\nGERAPY_PROXY_POOL_TIMEOUT = 5\n```\n\nAfter configuring this, if Proxy Pool does not return result in 5s, proxy will not be used.\n\n### ProxyPool Response Parser\n\nYour ProxyPool may not return the same format as [this](https://github.com/Python3WebSpider/ProxyPool) in plain text,\nyou can also define a parser to extract proxy from your ProxyPool.\n\nFor example, if your ProxyPool return this for every request:\n\n```json\n{\n  \"host\": \"111.222.223.224\",\n  \"port\": 3128\n}\n```\n\nYou can define a method like:\n\n```python\nimport json\ndef parse_result(text):\n    data = json.loads(text)\n    return f'{data.get(\"host\")}:{data.get(\"port\")}'\n  \nGERAPY_PROXY_EXTRACT_FUNC = parse_result \n```\n\nThen you will get the proxy with correct format.\n\n## Example\n\nFor more detail, please see [example](./example).\n\nAlso you can directly run with Docker:\n\n```\ndocker run germey/gerapy-proxy-example\n```\n\nOutputs:\n\n```shell script\n2020-07-15 19:17:34 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: example)\n2020-07-15 19:17:34 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.7 (default, May  6 2020, 04:59:01) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Darwin-19.4.0-x86_64-i386-64bit\n2020-07-15 19:17:34 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor\n2020-07-15 19:17:34 [scrapy.crawler] INFO: Overridden settings:\n{'BOT_NAME': 'example',\n 'CONCURRENT_REQUESTS': 3,\n 'DOWNLOAD_TIMEOUT': 10,\n 'NEWSPIDER_MODULE': 'example.spiders',\n 'RETRY_TIMES': 10,\n 'SPIDER_MODULES': ['example.spiders']}\n2020-07-15 19:17:34 [scrapy.extensions.telnet] INFO: Telnet Password: 33299ca0ce64f215\n2020-07-15 19:17:34 [scrapy.middleware] INFO: Enabled extensions:\n['scrapy.extensions.corestats.CoreStats',\n 'scrapy.extensions.telnet.TelnetConsole',\n 'scrapy.extensions.memusage.MemoryUsage',\n 'scrapy.extensions.logstats.LogStats']\n2020-07-15 19:17:34 [asyncio] DEBUG: Using selector: KqueueSelector\n2020-07-15 19:17:34 [scrapy.middleware] INFO: Enabled downloader middlewares:\n['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',\n 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',\n 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',\n 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',\n 'gerapy_proxy.middlewares.ProxyPoolMiddleware',\n 'scrapy.downloadermiddlewares.retry.RetryMiddleware',\n 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',\n 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',\n 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',\n 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',\n 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',\n 'scrapy.downloadermiddlewares.stats.DownloaderStats']\n2020-07-15 19:17:34 [scrapy.middleware] INFO: Enabled spider middlewares:\n['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',\n 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',\n 'scrapy.spidermiddlewares.referer.RefererMiddleware',\n 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',\n 'scrapy.spidermiddlewares.depth.DepthMiddleware']\n2020-07-15 19:17:34 [scrapy.middleware] INFO: Enabled item pipelines:\n[]\n2020-07-15 19:17:34 [scrapy.core.engine] INFO: Spider opened\n2020-07-15 19:17:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)\n2020-07-15 19:17:34 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023\n2020-07-15 19:17:34 [gerapy_proxy.middlewares] DEBUG: start to get proxy from proxy pool\n2020-07-15 19:17:34 [gerapy_proxy.middlewares] DEBUG: get proxy using kwargs {'timeout': 5, 'url': 'https://proxypool.scrape.center/random'}\n2020-07-15 19:17:35 [gerapy_proxy.middlewares] DEBUG: start to get proxy from proxy pool\n2020-07-15 19:17:35 [gerapy_proxy.middlewares] DEBUG: get proxy using kwargs {'timeout': 5, 'url': 'https://proxypool.scrape.center/random'}\n2020-07-15 19:17:35 [gerapy_proxy.middlewares] DEBUG: start to get proxy from proxy pool\n2020-07-15 19:17:35 [gerapy_proxy.middlewares] DEBUG: get proxy using kwargs {'timeout': 5, 'url': 'https://proxypool.scrape.center/random'}\n2020-07-15 19:17:35 [gerapy_proxy.middlewares] DEBUG: get proxy 113.124.94.189:9999\n2020-07-15 19:17:35 [gerapy_proxy.middlewares] DEBUG: get proxy 84.53.238.49:23500\n2020-07-15 19:17:35 [gerapy_proxy.middlewares] DEBUG: get proxy 217.150.77.31:53281\n2020-07-15 19:17:40 [scrapy.core.engine] DEBUG: Crawled (200) \u003cPOST https://httpbin.org/delay/3\u003e (referer: None)\n2020-07-15 19:17:40 [gerapy_proxy.middlewares] DEBUG: start to get proxy from proxy pool\n2020-07-15 19:17:40 [gerapy_proxy.middlewares] DEBUG: get proxy using kwargs {'timeout': 5, 'url': 'https://proxypool.scrape.center/random'}\n2020-07-15 19:17:40 [example.spiders.httpbin] INFO: got request from 113.124.94.189 successfully, current page 1\n2020-07-15 19:17:40 [gerapy_proxy.middlewares] DEBUG: get proxy 144.52.244.3:9999\n2020-07-15 19:17:45 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying \u003cPOST https://httpbin.org/delay/3\u003e (failed 1 times): User timeout caused connection failure: Getting https://httpbin.org/delay/3 took longer than 10.0 seconds..\n2020-07-15 19:17:45 [gerapy_proxy.middlewares] DEBUG: start to get proxy from proxy pool\n2020-07-15 19:17:45 [gerapy_proxy.middlewares] DEBUG: get proxy using kwargs {'timeout': 5, 'url': 'https://proxypool.scrape.center/random'}\n2020-07-15 19:17:45 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying \u003cPOST https://httpbin.org/delay/3\u003e (failed 1 times): User timeout caused connection failure: Getting https://httpbin.org/delay/3 took longer than 10.0 seconds..\n2020-07-15 19:17:45 [gerapy_proxy.middlewares] DEBUG: start to get proxy from proxy pool\n2020-07-15 19:17:45 [gerapy_proxy.middlewares] DEBUG: get proxy using kwargs {'timeout': 5, 'url': 'https://proxypool.scrape.center/random'}\n2020-07-15 19:17:45 [gerapy_proxy.middlewares] DEBUG: get proxy 1.20.101.149:44778\n2020-07-15 19:17:45 [gerapy_proxy.middlewares] DEBUG: get proxy 105.27.116.46:56792\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgerapy%2Fgerapyproxy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgerapy%2Fgerapyproxy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgerapy%2Fgerapyproxy/lists"}