{"id":22903450,"url":"https://github.com/sawyerbutton/nlp-lesson-3-scrapy","last_synced_at":"2025-08-23T02:33:02.099Z","repository":{"id":216803337,"uuid":"742397415","full_name":"sawyerbutton/NLP-Lesson-3-Scrapy","owner":"sawyerbutton","description":"Scrapy Starter project","archived":false,"fork":false,"pushed_at":"2024-01-12T12:51:08.000Z","size":1524,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-06-09T08:07:59.784Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sawyerbutton.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2024-01-12T11:39:39.000Z","updated_at":"2024-06-13T12:52:59.000Z","dependencies_parsed_at":"2024-01-13T00:13:16.495Z","dependency_job_id":"4d513978-1386-4bed-99eb-b6e4464e0362","html_url":"https://github.com/sawyerbutton/NLP-Lesson-3-Scrapy","commit_stats":null,"previous_names":["sawyerbutton/nlp-lesson-3-scrapy"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/sawyerbutton/NLP-Lesson-3-Scrapy","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sawyerbutton%2FNLP-Lesson-3-Scrapy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sawyerbutton%2FNLP-Lesson-3-Scrapy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sawyerbutton%2FNLP-Lesson-3-Scrapy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sawyerbutton%2FNLP-Lesson-3-Scrapy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sawyerbutton","download_url":"https://codeload.github.com/sawyerbutton/NLP-Lesson-3-Scrapy/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sawyerbutton%2FNLP-Lesson-3-Scrapy/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271732406,"owners_count":24811316,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-23T02:00:09.327Z","response_time":69,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-14T02:36:45.579Z","updated_at":"2025-08-23T02:33:02.066Z","avatar_url":"https://github.com/sawyerbutton.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# NLP-Lesson-3-Scrapy\nScrapy Starter project\n\n## Basic Setup\n\n```bash\n# 选择你的虚拟环境 from Python or Conda\n# $ python3.6 -m venv venv\n# $ source venv/bin/activate\n# 安装依赖\n$ pip install -r requirements.txt\n```\n\n## Target Website\n\n[一个专门用于爬虫的站点](http://quotes.toscrape.com)\n\n![示例图片](image.png)\n\n在一个网页tag上包含：\n\n- 作者的名言\n- 作者名\n- 作者的标签\n\n点击作者的名会进入作者详情页面，包含\n\n- 作者的名称\n- 作者的生日\n- 作者的生平履历\n\n## 项目编码流程\n\n### Step1\n\n创建项目\n\n在`spiders`文件夹中创建新文件`quotes-spider.py`，用于向站点发送请求、\n\n于scapy项目目录下，在命令行运行指令`scrapy crawl quotes`得到如下结果\n\n```bash\n2024-01-12 20:08:40 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: tutorial)\n2024-01-12 20:08:40 [scrapy.utils.log] INFO: Versions: lxml 5.1.0.0, libxml2 2.12.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.9.18 (main, Sep 11 2023, 08:38:23) - [Clang 14.0.6 ], pyOpenSSL 23.3.0 (OpenSSL 3.1.4 24 Oct 2023), cryptography 41.0.7, Platform macOS-10.16-x86_64-i386-64bit\n2024-01-12 20:08:40 [scrapy.addons] INFO: Enabled addons:\n[]\n2024-01-12 20:08:40 [asyncio] DEBUG: Using selector: KqueueSelector\n2024-01-12 20:08:40 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor\n2024-01-12 20:08:40 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop\n2024-01-12 20:08:40 [scrapy.extensions.telnet] INFO: Telnet Password: c0e271de13f34462\n2024-01-12 20:08:40 [scrapy.middleware] INFO: Enabled extensions:\n['scrapy.extensions.corestats.CoreStats',\n 'scrapy.extensions.telnet.TelnetConsole',\n 'scrapy.extensions.memusage.MemoryUsage',\n 'scrapy.extensions.logstats.LogStats']\n2024-01-12 20:08:40 [scrapy.crawler] INFO: Overridden settings:\n{'BOT_NAME': 'tutorial',\n 'FEED_EXPORT_ENCODING': 'utf-8',\n 'NEWSPIDER_MODULE': 'tutorial.spiders',\n 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',\n 'ROBOTSTXT_OBEY': True,\n 'SPIDER_MODULES': ['tutorial.spiders'],\n 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}\n2024-01-12 20:08:40 [scrapy.middleware] INFO: Enabled downloader middlewares:\n['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',\n 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',\n 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',\n 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',\n 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',\n 'scrapy.downloadermiddlewares.retry.RetryMiddleware',\n 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',\n 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',\n 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',\n 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',\n 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',\n 'scrapy.downloadermiddlewares.stats.DownloaderStats']\n2024-01-12 20:08:40 [scrapy.middleware] INFO: Enabled spider middlewares:\n['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',\n 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',\n 'scrapy.spidermiddlewares.referer.RefererMiddleware',\n 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',\n 'scrapy.spidermiddlewares.depth.DepthMiddleware']\n2024-01-12 20:08:40 [scrapy.middleware] INFO: Enabled item pipelines:\n[]\n2024-01-12 20:08:40 [scrapy.core.engine] INFO: Spider opened\n2024-01-12 20:08:41 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)\n2024-01-12 20:08:41 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023\n2024-01-12 20:08:45 [scrapy.core.engine] DEBUG: Crawled (404) \u003cGET http://quotes.toscrape.com/robots.txt\u003e (referer: None)\n2024-01-12 20:08:45 [scrapy.core.engine] DEBUG: Crawled (200) \u003cGET http://quotes.toscrape.com\u003e (referer: None)\n2024-01-12 20:08:46 [quotes] INFO: hello this is my first spider\n2024-01-12 20:08:46 [scrapy.core.engine] INFO: Closing spider (finished)\n2024-01-12 20:08:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:\n{'downloader/request_bytes': 450,\n 'downloader/request_count': 2,\n 'downloader/request_method_count/GET': 2,\n 'downloader/response_bytes': 11504,\n 'downloader/response_count': 2,\n 'downloader/response_status_count/200': 1,\n 'downloader/response_status_count/404': 1,\n 'elapsed_time_seconds': 4.75622,\n 'finish_reason': 'finished',\n 'finish_time': datetime.datetime(2024, 1, 12, 12, 8, 46, 30082, tzinfo=datetime.timezone.utc),\n 'log_count/DEBUG': 5,\n 'log_count/INFO': 11,\n 'memusage/max': 54034432,\n 'memusage/startup': 54034432,\n 'response_received_count': 2,\n 'robotstxt/request_count': 1,\n 'robotstxt/response_count': 1,\n 'robotstxt/response_status_count/404': 1,\n 'scheduler/dequeued': 1,\n 'scheduler/dequeued/memory': 1,\n 'scheduler/enqueued': 1,\n 'scheduler/enqueued/memory': 1,\n 'start_time': datetime.datetime(2024, 1, 12, 12, 8, 41, 273862, tzinfo=datetime.timezone.utc)}\n2024-01-12 20:08:46 [scrapy.core.engine] INFO: Spider closed (finished)\n```\n基于函数`response.xpath(“//div[@class=’quote’]”).get()`读取每一个页面tag的HTML标签\n基于如下函数获取tag中的信息,注意运行在shell中\n\n``` bash\n\u003e\u003e\u003e scrapy shell http://quotes.toscrape.com/\n\u003e\u003e\u003e quotes = response.xpath(\"//div[@class='quote']\")\n\u003e\u003e\u003e quotes[0].css(\".text::text\").getall()\n['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”']\n\u003e\u003e\u003e quotes[0].css(\".author::text\").getall()\n['Albert Einstein']\n\u003e\u003e\u003e quotes[0].css(\".tag::text\").getall()\n['change', 'deep-thoughts', 'thinking', 'world']\n```\n\n上述代码中分别使用了xpath的语法和css的语法，仅作为展示用途\n\n通过在执行命令行命令时增加一个标签的方式使用json文件进行存储\n```bash\nscrapy crawl quotes -o quotes.json\n```\n\n观察页面后，你可以看到页面的下端有一个 `next` 按钮用于跳页\n\n![Alt text](image-2.png)\n\n通过css指令找到对应的按钮\n\n```bash\n$ scrapy shell http://quotes.toscrape.com/\n...\n\u003e\u003e\u003e response.css('li.next a::attr(href)').get()\n'/page/2/'\n```\n\n`next_page = response.urljoin(next_page)` 用于获取完整的URL地址\n\n`yield scrapy.Request(next_page, callback=self.parse)` 用于发送一个新请求以获取下一页，并使用回调函数调用相同的解析函数以从新页面获取内容\n\n```phthon\nfor a in response.css('li.next a'):\n            yield response.follow(a, callback=self.parse)\n``` \n\n使用`response.follow`函数可以更快地简化这一过程\n\n我们也希望获取每一个作者的更详细信息，可以使用如下的bash\n```bash\n$ scrapy shell http://quotes.toscrape.com/\n...\n\u003e\u003e\u003e response.css('.author + a::attr(href)').get()\n'/author/Albert-Einstein'\n```\n\n而使用代码的方式，我们可以将这一过程循环化,在提取每个引用的循环中，发出另一个请求，转到相应作者的页面，并创建另一个 parse_author 函数来提取作者的姓名、生日、出生地点和简介，并将其输出到控制台\n\n```python\ndef parse(self, response):\n        # self.logger.info('hello this is my first spider')\n        quotes = response.css('div.quote')\n        for quote in quotes:\n\n            yield {\n                'text': quote.css('.text::text').get(),\n                'author': quote.css('.author::text').get(),\n                'tags': quote.css('.tag::text').getall(),\n            }\n\n            author_url = quote.css('.author + a::attr(href)').get()\n            self.logger.info('get author page url')\n            # go to the author page\n            yield response.follow(author_url, callback=self.parse_author)\n\n        for a in response.css('li.next a'):\n            yield response.follow(a, callback=self.parse)\n\n\n    def parse_author(self, response):\n        yield {\n            'author_name': response.css('.author-title::text').get(),\n            'author_birthday': response.css('.author-born-date::text').get(),\n            'author_bornlocation': response.css('.author-born-location::text').get(),\n            'author_bio': response.css('.author-description::text').get(),\n        }\n```\n\n**Question：What is the issue**\n\n### Step2 \n\nTodo","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsawyerbutton%2Fnlp-lesson-3-scrapy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsawyerbutton%2Fnlp-lesson-3-scrapy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsawyerbutton%2Fnlp-lesson-3-scrapy/lists"}