{"id":18402191,"url":"https://github.com/gerapy/gerapyselenium","last_synced_at":"2025-04-07T07:32:08.346Z","repository":{"id":57433907,"uuid":"293325481","full_name":"Gerapy/GerapySelenium","owner":"Gerapy","description":"Downloader Middleware to support Selenium in Scrapy \u0026 Gerapy","archived":false,"fork":false,"pushed_at":"2020-09-13T09:10:50.000Z","size":47,"stargazers_count":30,"open_issues_count":4,"forks_count":3,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-10-29T05:48:58.111Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Gerapy.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-09-06T17:15:38.000Z","updated_at":"2024-05-16T02:43:06.000Z","dependencies_parsed_at":"2022-08-27T22:30:55.196Z","dependency_job_id":null,"html_url":"https://github.com/Gerapy/GerapySelenium","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gerapy%2FGerapySelenium","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gerapy%2FGerapySelenium/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gerapy%2FGerapySelenium/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gerapy%2FGerapySelenium/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Gerapy","download_url":"https://codeload.github.com/Gerapy/GerapySelenium/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223275161,"owners_count":17118147,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T02:41:33.772Z","updated_at":"2024-11-06T02:41:34.601Z","avatar_url":"https://github.com/Gerapy.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Gerapy Selenium\n\nThis is a package for supporting selenium in Scrapy, also this\npackage is a module in [Gerapy](https://github.com/Gerapy/Gerapy).\n\n## Installation\n\n```shell script\npip3 install gerapy-selenium\n```\n\n## Usage\n\nYou can use `SeleniumRequest` to specify a request which uses selenium to render.\n\nFor example:\n\n```python\nyield SeleniumRequest(detail_url, callback=self.parse_detail)\n```\n\nAnd you also need to enable `SeleniumMiddleware` in `DOWNLOADER_MIDDLEWARES`:\n\n```python\nDOWNLOADER_MIDDLEWARES = {\n    'gerapy_selenium.downloadermiddlewares.SeleniumMiddleware': 543,\n}\n```\n\nCongratulate, you've finished the all of the required configuration.\n\nIf you run the Spider again, Selenium will be started to render every\nweb page which you configured the request as SeleniumRequest.\n\n## Settings\n\nGerapySelenium provides some optional settings.\n\n### Concurrency \n\nYou can directly use Scrapy's setting to set Concurrency of Selenium,\nfor example:\n\n```python\nCONCURRENT_REQUESTS = 3\n```\n\n### Pretend as Real Browser\n\nSome website will detect WebDriver or Headless, GerapySelenium can \npretend Chromium by inject scripts. This is enabled by default.\n\nYou can close it if website does not detect WebDriver to speed up:\n\n```python\nGERAPY_SELENIUM_PRETEND = False\n```\n\nAlso you can use `pretend` attribute in `SeleniumRequest` to overwrite this \nconfiguration.\n\n### Logging Level\n\nBy default, Selenium will log all the debug messages, so GerapySelenium\nconfigured the logging level of Selenium to WARNING.\n\nIf you want to see more logs from Selenium, you can change the this setting: \n\n```python\nimport logging\nGERAPY_SELENIUM_LOGGING_LEVEL = logging.DEBUG\n```\n\n### Download Timeout\n\nSelenium may take some time to render the required web page, you can also change this setting, default is `30s`:\n\n```python\n# selenium timeout\nGERAPY_SELENIUM_DOWNLOAD_TIMEOUT = 30\n```\n\n### Headless\n\nBy default, Selenium is running in `Headless` mode, you can also \nchange it to `False` as you need, default is `True`:\n\n```python\nGERAPY_SELENIUM_HEADLESS = False \n```\n\n### Window Size\n\nYou can also set the width and height of Selenium window:\n\n```python\nGERAPY_SELENIUM_WINDOW_WIDTH = 1400\nGERAPY_SELENIUM_WINDOW_HEIGHT = 700\n```\n\nDefault is 1400, 700.\n\n## SeleniumRequest\n\n`SeleniumRequest` provide args which can override global settings above.\n\n* url: request url\n* callback: callback\n* wait_for: wait for some element to load, also supports dict\n* script: script to execute\n* proxy: use proxy for this time, like `http://x.x.x.x:x`\n* sleep: time to sleep after loaded, override `GERAPY_SELENIUM_SLEEP`\n* timeout: load timeout, override `GERAPY_SELENIUM_DOWNLOAD_TIMEOUT`\n* pretend: pretend as normal browser, override `GERAPY_SELENIUM_PRETEND`\n* screenshot: ignored resource types, see\n        https://miyakogi.github.io/selenium/_modules/selenium/page.html#Page.screenshot,\n        override `GERAPY_SELENIUM_SCREENSHOT`\n\nFor example, you can configure SeleniumRequest as:\n\n```python\nfrom gerapy_selenium import SeleniumRequest\n\ndef parse(self, response):\n    yield SeleniumRequest(url, \n        callback=self.parse_detail,\n        wait_for='title',\n        script='() =\u003e { console.log(document) }',\n        sleep=2)\n```\n\nThen Selenium will:\n* wait for title to load\n* execute `console.log(document)` script\n* sleep for 2s\n* return the rendered web page content\n\n## Example\n\nFor more detail, please see [example](./example).\n\nAlso you can directly run with Docker:\n\n```\ndocker run germey/gerapy-selenium-example\n```\n\nOutputs:\n\n```shell script\n2020-07-13 01:49:13 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: example)\n2020-07-13 01:49:13 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.7 (default, May  6 2020, 04:59:01) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Darwin-19.4.0-x86_64-i386-64bit\n2020-07-13 01:49:13 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor\n2020-07-13 01:49:13 [scrapy.crawler] INFO: Overridden settings:\n{'BOT_NAME': 'example',\n 'CONCURRENT_REQUESTS': 3,\n 'NEWSPIDER_MODULE': 'example.spiders',\n 'RETRY_HTTP_CODES': [403, 500, 502, 503, 504],\n 'SPIDER_MODULES': ['example.spiders']}\n2020-07-13 01:49:13 [scrapy.extensions.telnet] INFO: Telnet Password: 83c276fb41754bd0\n2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled extensions:\n['scrapy.extensions.corestats.CoreStats',\n 'scrapy.extensions.telnet.TelnetConsole',\n 'scrapy.extensions.memusage.MemoryUsage',\n 'scrapy.extensions.logstats.LogStats']\n2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled downloader middlewares:\n['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',\n 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',\n 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',\n 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',\n 'gerapy_selenium.downloadermiddlewares.SeleniumMiddleware',\n 'scrapy.downloadermiddlewares.retry.RetryMiddleware',\n 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',\n 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',\n 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',\n 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',\n 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',\n 'scrapy.downloadermiddlewares.stats.DownloaderStats']\n2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled spider middlewares:\n['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',\n 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',\n 'scrapy.spidermiddlewares.referer.RefererMiddleware',\n 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',\n 'scrapy.spidermiddlewares.depth.DepthMiddleware']\n2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled item pipelines:\n[]\n2020-07-13 01:49:13 [scrapy.core.engine] INFO: Spider opened\n2020-07-13 01:49:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)\n2020-07-13 01:49:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023\n2020-07-13 01:49:13 [example.spiders.book] INFO: crawling https://dynamic5.scrape.center/page/1\n2020-07-13 01:49:13 [gerapy.selenium] DEBUG: processing request \u003cGET https://dynamic5.scrape.center/page/1\u003e\n2020-07-13 01:49:13 [gerapy.selenium] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']}\n2020-07-13 01:49:14 [gerapy.selenium] DEBUG: crawling https://dynamic5.scrape.center/page/1\n2020-07-13 01:49:19 [gerapy.selenium] DEBUG: waiting for .item .name finished\n2020-07-13 01:49:20 [gerapy.selenium] DEBUG: wait for .item .name finished\n2020-07-13 01:49:20 [gerapy.selenium] DEBUG: close selenium\n2020-07-13 01:49:20 [scrapy.core.engine] DEBUG: Crawled (200) \u003cGET https://dynamic5.scrape.center/page/1\u003e (referer: None)\n2020-07-13 01:49:20 [gerapy.selenium] DEBUG: processing request \u003cGET https://dynamic5.scrape.center/detail/26898909\u003e\n2020-07-13 01:49:20 [gerapy.selenium] DEBUG: processing request \u003cGET https://dynamic5.scrape.center/detail/26861389\u003e\n2020-07-13 01:49:20 [gerapy.selenium] DEBUG: processing request \u003cGET https://dynamic5.scrape.center/detail/26855315\u003e\n2020-07-13 01:49:20 [gerapy.selenium] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']}\n2020-07-13 01:49:20 [gerapy.selenium] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']}\n2020-07-13 01:49:21 [gerapy.selenium] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']}\n2020-07-13 01:49:21 [gerapy.selenium] DEBUG: crawling https://dynamic5.scrape.center/detail/26855315\n2020-07-13 01:49:21 [gerapy.selenium] DEBUG: crawling https://dynamic5.scrape.center/detail/26861389\n2020-07-13 01:49:21 [gerapy.selenium] DEBUG: crawling https://dynamic5.scrape.center/detail/26898909\n2020-07-13 01:49:24 [gerapy.selenium] DEBUG: waiting for .item .name finished\n2020-07-13 01:49:24 [gerapy.selenium] DEBUG: wait for .item .name finished\n2020-07-13 01:49:24 [gerapy.selenium] DEBUG: close selenium\n2020-07-13 01:49:24 [scrapy.core.engine] DEBUG: Crawled (200) \u003cGET https://dynamic5.scrape.center/detail/26861389\u003e (referer: https://dynamic5.scrape.center/page/1)\n2020-07-13 01:49:24 [gerapy.selenium] DEBUG: processing request \u003cGET https://dynamic5.scrape.center/page/2\u003e\n2020-07-13 01:49:24 [gerapy.selenium] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']}\n2020-07-13 01:49:25 [scrapy.core.scraper] DEBUG: Scraped from \u003c200 https://dynamic5.scrape.center/detail/26861389\u003e\n{'name': '壁穴ヘブンホール',\n 'score': '5.6',\n 'tags': ['BL漫画', '小基漫', 'BL', '『又腐又基』', 'BLコミック']}\n2020-07-13 01:49:25 [gerapy.selenium] DEBUG: waiting for .item .name finished\n2020-07-13 01:49:25 [gerapy.selenium] DEBUG: crawling https://dynamic5.scrape.center/page/2\n2020-07-13 01:49:26 [gerapy.selenium] DEBUG: wait for .item .name finished\n2020-07-13 01:49:26 [gerapy.selenium] DEBUG: close selenium\n2020-07-13 01:49:26 [scrapy.core.engine] DEBUG: Crawled (200) \u003cGET https://dynamic5.scrape.center/detail/26855315\u003e (referer: https://dynamic5.scrape.center/page/1)\n2020-07-13 01:49:26 [gerapy.selenium] DEBUG: processing request \u003cGET https://dynamic5.scrape.center/detail/27047626\u003e\n2020-07-13 01:49:26 [gerapy.selenium] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']}\n2020-07-13 01:49:26 [scrapy.core.scraper] DEBUG: Scraped from \u003c200 https://dynamic5.scrape.center/detail/26855315\u003e\n{'name': '冒险小虎队', 'score': '9.4', 'tags': ['冒险小虎队', '童年', '冒险', '推理', '小时候读的']}\n2020-07-13 01:49:26 [gerapy.selenium] DEBUG: waiting for .item .name finished\n2020-07-13 01:49:26 [gerapy.selenium] DEBUG: crawling https://dynamic5.scrape.center/detail/27047626\n2020-07-13 01:49:27 [gerapy.selenium] DEBUG: wait for .item .name finished\n2020-07-13 01:49:27 [gerapy.selenium] DEBUG: close selenium\n...\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgerapy%2Fgerapyselenium","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgerapy%2Fgerapyselenium","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgerapy%2Fgerapyselenium/lists"}