{"id":16619529,"url":"https://github.com/java-edge/scrapy-tutorial","last_synced_at":"2026-04-19T08:31:53.105Z","repository":{"id":94726366,"uuid":"176679353","full_name":"Java-Edge/Scrapy-Tutorial","owner":"Java-Edge","description":null,"archived":false,"fork":false,"pushed_at":"2019-03-25T09:29:49.000Z","size":55,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-11T07:16:21.103Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Java-Edge.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-03-20T07:34:39.000Z","updated_at":"2020-07-27T05:30:25.000Z","dependencies_parsed_at":null,"dependency_job_id":"aa01bffd-abf6-4c2e-9171-9cb3b7395f74","html_url":"https://github.com/Java-Edge/Scrapy-Tutorial","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Java-Edge/Scrapy-Tutorial","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Java-Edge%2FScrapy-Tutorial","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Java-Edge%2FScrapy-Tutorial/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Java-Edge%2FScrapy-Tutorial/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Java-Edge%2FScrapy-Tutorial/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Java-Edge","download_url":"https://codeload.github.com/Java-Edge/Scrapy-Tutorial/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Java-Edge%2FScrapy-Tutorial/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32000188,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-18T20:23:30.271Z","status":"online","status_checked_at":"2026-04-19T02:00:07.110Z","response_time":55,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-12T02:42:11.196Z","updated_at":"2026-04-19T08:31:53.063Z","avatar_url":"https://github.com/Java-Edge.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Python分布式爬虫打造搜索引擎\n未来是什么时代？是数据时代！数据分析服务、互联网金融，数据建模、自然语言处理、医疗病例分析……越来越多的工作会基于数据来做，而爬虫正是快速获取数据最重要的方式，相比其它语言，Python爬虫更简单、高效\n\n#### 单机爬虫（Scrapy）到分布式爬虫（Scrapy-Redis）的完美实战\n\n### 说真的，你再也没有理由学不会爬虫了\n\n![](https://upload-images.jianshu.io/upload_images/16782311-555251c239b2848a.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\n\n从0讲解爬虫基本原理，对爬虫中所需要用到的知识点进行梳理，从搭建开发环境、设计数据库开始，通过爬取三个知名网站的真实数据，带你由浅入深的掌握Scrapy原理、各模块使用、组件开发，Scrapy的进阶开发以及反爬虫的策略\n\n彻底掌握Scrapy之后，带你基于Scrapy、Redis、elasticsearch和django打造一个完整的搜索引擎网站\n\n![](https://upload-images.jianshu.io/upload_images/16782311-15fcfcb29a5f9315.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\n\n# 我们的目标：分布式爬虫Scrapy-Redis搭建搜索引擎\n\n![](https://upload-images.jianshu.io/upload_images/16782311-9cee9a6dc4a8834b.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)\n\n# 由浅入深掌握Scrapy\n## 爬虫开发环境搭建及基础知识\n#### 基于Mac OS\n整个开发过程中还会讲到很多爬虫开发的知识， 这些知识不管是对Web系统的理解还是面试都是非常重要的知识点，包括正则表达式、url去重的策略、深度优先和广度优先遍历算法及实现、session和cookie的区别以及如何通过多种方式去实现模拟登录\n\n## Scrapy爬虫搭建及单机爬虫实战案例\n### 爬取技术社区文章\n掌握：xpath， css选择器 / items设计 / pipeline，twisted保存数据到mysql\n### 爬取问答网站\n掌握：session和cookie原理 / scrapy FormRequest和requests模拟知乎登录 item loader方式提取数据\n### 爬取招聘网站\n掌握：link extractor  / Scrapy Rule提取url  / CrawlSpider爬取全站\n\n# Scrapy进阶\n## 突破反爬机制\nScrapy原理\n\nip代理 、user-agent随机切换\n\n云打码实现验证码识别\n\n## Scrapy进阶\nselenium和phantomjs动态网站爬取\n\nScrapy telnet、Web service\n\nScrapy信号和核心api\n\n## Scrapy-Redis分布式爬虫\nRedis\n\nScrapy-Redis源码分析\n\nRedis-bloomfilter集成到Scrapy-Redis\n\n# 搭建搜索引擎\n- 数据解析和入库\n\n- Scrapy-Redis分布式爬虫开发\n\n- 数据保存到elasticsearch\n\n- 通过django搭建搜索引擎\n\n# 环境参数\n- 技术语言 \npython3.5 \n\n- 框架 \nscrapy1.3 elasticsearch5 \n\n- 框架 \ndjango1.11 redis \n\n- 开发系统\nmac \n\n- 数据库 \nmysql5.7 redis \n\n- IDE \npycharm \n\n- 工具 \nvirtualenv navicat\n\n# 参考\n[聚焦Python分布式爬虫必学框架Scrapy 打造搜索引擎](https://coding.imooc.com/class/92.html)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjava-edge%2Fscrapy-tutorial","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjava-edge%2Fscrapy-tutorial","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjava-edge%2Fscrapy-tutorial/lists"}