{"id":19285953,"url":"https://github.com/zhuozhuocrayon/pythoncrawler","last_synced_at":"2025-04-13T10:57:48.100Z","repository":{"id":99306564,"uuid":"169070616","full_name":"ZhuoZhuoCrayon/pythonCrawler","owner":"ZhuoZhuoCrayon","description":"python3网络爬虫笔记与实战源码。记录python爬虫学习全程笔记、参考资料和常见错误，约40个爬取实例与思路解析，涵盖urllib、requests、bs4、jsonpath、re、 pytesseract、PIL等常用库的使用。","archived":false,"fork":false,"pushed_at":"2021-01-22T12:17:25.000Z","size":8047,"stargazers_count":230,"open_issues_count":0,"forks_count":80,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-03-27T02:11:41.585Z","etag":null,"topics":["python-crawler","python3"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ZhuoZhuoCrayon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-02-04T11:47:28.000Z","updated_at":"2025-03-24T13:16:51.000Z","dependencies_parsed_at":null,"dependency_job_id":"4a990149-21d6-4c36-b19e-8b663cc70dd1","html_url":"https://github.com/ZhuoZhuoCrayon/pythonCrawler","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZhuoZhuoCrayon%2FpythonCrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZhuoZhuoCrayon%2FpythonCrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZhuoZhuoCrayon%2FpythonCrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZhuoZhuoCrayon%2FpythonCrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ZhuoZhuoCrayon","download_url":"https://codeload.github.com/ZhuoZhuoCrayon/pythonCrawler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248703195,"owners_count":21148117,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["python-crawler","python3"],"created_at":"2024-11-09T21:47:27.015Z","updated_at":"2025-04-13T10:57:48.077Z","avatar_url":"https://github.com/ZhuoZhuoCrayon.png","language":"HTML","readme":"# pythonCrawler\r\n[![HitCount](https://hits.b3log.org/ZhuoZhuoCrayon/pythonCrawler.svg)](https://github.com/ZhuoZhuoCrayon/pythonCrawler/)\r\n\u003e## Notice\r\n1. exe_file 是本程序爬取的附录，全部测试、实战读写路径全部指向exe\\_file\r\n2. 本爬虫笔记基于b站 [Python爬虫从入门到高级实战【92集】千锋Python高级教程](https://www.bilibili.comvideo/av37027372)\r\n3. 在该教程的基础上对教程中的思路进行实践，对教程出现的错误进行修正，并且另外扩展，**并非教程源码照搬**\r\n4. 由于时间有限，笔记与代码都位于.py文件中，以注释及代码形式存在，对学习过程中会出现的bug以及难点进行分析\r\n5. 由于作者能力有限以及爬虫技术迭代速度快，代码可能会存在bug，如有此情况，欢迎联系我更正或者pull request\r\n6. **更新日志的正确打开方式：**\r\n    - 数字代表每一章，每个数字的第一个py文件是基础知识讲解及简单实践\r\n    - x.x形式的py文件一般是实战内容\r\n    - 例如6.基于xpath...是基础知识，那么6.1就是项目实战内容\r\n    - **所有的py文件都会有思路、踩坑以及知识点的介绍**\r\n    - **人性化设置，md文件的更新日志附属笔记的超链接跳转**\r\n7. 如果笔记对您有用，麻烦Star谢谢\r\n- - -\r\n\u003e## Update log\r\n1. __2019/03-2019/03/12__\r\n    - [1.urllib基础](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/1urllib_base.py)\r\n    - [2.利用ajax的特点构建post请求，及对url异常的处理实例：豆瓣，kfc餐厅，百度贴吧的页面爬取](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/2ajax.py)\r\n    - [3.以百度翻译为例介绍fiddler中json包的解析](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/fillder.py)\r\n    - [4.Handler处理器的应用：设置ip及cookieJar，人人网模拟登陆](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/4handler.py)\r\n    - [5.1.利用正则表达式提取糗图网页面信息](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/5.1%E6%AD%A3%E5%88%99%E7%88%AC%E5%8F%96%E7%B3%97.py)\r\n    - [5.2.正则爬取励志网并建立文章集合页面](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/5.2%E6%AD%A3%E5%88%99%E7%88%AC%E5%8F%96%E5%8A%B1%E5%BF%97%E7%BD%91%E5%B9%B6%E5%BB%BA%E7%AB%8B%E6%96%87%E7%AB%A0%E9%9B%86%E5%90%88%E9%A1%B5%E9%9D%A2.py)\r\n2. __2019/04-__\r\n    - 项目实战：[智联招聘爬虫-通用版：目前已爬取2019年第一季度IT领域招聘信息数据集](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/zhilianCrawler.py)\r\n        + urllib, BeautifulSoup, 正则表达式, 多线程爬取, json获取, csv文件读写\r\n3. __2019/07/10__\r\n    - [6.基于xpath的html页面信息提取](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/6xpath.py)\r\n        + 实例：段子网爬取\r\n4. __2019/07/11__\r\n    - [6.1.读取文件中的列表格式](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/6.1read_list.py)\r\n        + 实例：文本文件中对象的读取\r\n    - [7.基于图片懒加载技术的图片下载](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/7pictureLoad.py)\r\n5. __2019/07/15__\r\n    - [8.基于jsonpath的json文件解析方法](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/8jsonpath.py)\r\n        + 实例：智联招聘，填补之前智联爬虫采用正则表达式解析json文件的繁琐方法\r\n        + b站教程以爬取淘宝评论为例，但现淘宝系统过于难爬，**此处留坑**\r\n6. __2019/07/16__\r\n    - 谷歌浏览器驱动，适配谷歌75版本---在exeFile目录下\r\n7. __2019/07/17__\r\n    - [9.基于selenium的浏览器控制访问](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/9selenium.py)\r\n        + 实例：百度关键字搜索\r\n8. __2019/07/19__\r\n    - [9.1.基于Chrome无界面模式浏览，图片懒加载的特点，异步加载的解决方法](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/9.1Chrome-headless.py)\r\n        + 实例1：豆瓣电影下拉滚动条，懒加载变化解析\r\n        + 实例2：百度图片搜索，无界面模式实践\r\n9. __2019/07/20__\r\n    - **告知：**\r\n        + 为方便实例的各种测试文件的查找，在第10章包括以后，每章的测试文件保存在exe\\_file/x/下\r\n        + **x为对应章节，例如第10章，则位于exe\\_file/10/**\r\n    - [10.Requests库的基本用法](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/10-Requests.py)\r\n        + 实例：百度搜索，必应翻译，登陆人人网为例介绍post、cookie、get的用法\r\n        + 代理使用\r\n    - [10.1.Requests库实战](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/10.1busPath_Crawler.py)\r\n        + 实例：爬取深圳所有公交路线\r\n        + 运用：json文件读写、Requests库及xpath解析\r\n        + 数据集：[深圳公交线路json文件](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/exe_file/10/bus_line.json)\r\n    - [11.验证码登陆方式](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/11verification_code.py)\r\n        + 实例：利用返回验证码到本地的方法登陆古诗文网\r\n        + 运用：Requests库（创建会话用于支持cookie），美味汤(beautifulSoup)\r\n10. __2019/07/21-2019/07/26__\r\n    - [11.1pytesser介绍](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/11.1pytesser.py)\r\n        + 介绍了pytesser库以及PIL库的基本使用\r\n    - [11.2jTessBoxEditor-tesseract字库训练模式](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/11.2jTessBoxEditor-tesseract.py)\r\n        + 验证码测试脚本\r\n    - **[重点：tesseract训练字库详解](https://github.com/ZhuoZhuoCrayon/pythonCrawler/tree/master/tesseract%E8%AE%AD%E7%BB%83%E6%A8%A1%E5%9E%8B)**\r\n        + 通过建立特征字符库，逐层加入识别错误的验证码进行补充训练，可以在三次扩充样本训练后达到90%以上识别率\r\n11. __2019/07/28__\r\n    - [12.视频爬取](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/12video.py)\r\n        + 基于xpath, json, chromeDrive-headless的视频爬取方案\r\n12. __2019/07/29-2019/07/31__\r\n    - [13.多线程基础汇总](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/13multiThread.py)\r\n    - [13.1多线程的面向对象构造形式](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/13.1thread_ood.py)\r\n    - [13.2队列的基本Queue的基本操作](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/13.2thread_queue.py)\r\n    - [13.3多线程爬取深圳公交线路](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/13.3Mthread_crawler.py)\r\n        + 基于10.1的程序进行多线程重构\r\n        + 多线程爬取速度提升至500%\r\n13. __2019/03-2019/05__\r\n    - [实战：58同城租房价格爬取](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/58crawler/58decode.py)\r\n        + 涉及反爬策略，关于编码转化的技巧\r\n    - [实战：中国大学排名爬取](https://github.com/ZhuoZhuoCrayon/pythonCrawler/blob/master/chineseUniversityRankCrawler/RankofNuni.py)\r\n        + 美味汤、requests库的使用\r\n    - [实战：美桌网图片爬取实例4则](https://github.com/ZhuoZhuoCrayon/pythonCrawler/tree/master/pictureCrawler)\r\n        + 入门级别\r\n        + 实践多线程、美味汤、requests库\r\n---\r\n\u003e## Contributing\r\n\u003e如果你对这个项目感兴趣，非常乐意你可以将.py文件的笔记和代码进行格式加工\r\n\u003e\u003e[版权声明]笔记内容是我原创并且开源到github上的，所有内容仅限于学习，不作商用，欢迎star/download/fork，但务必遵守相关开源协议进行使用，原创不易，请勿copy。在实践时遵守爬虫协议，目的只是为了更好的掌握爬虫知识，如果有所影响，请联系我删除，谢谢！\r\n\r\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzhuozhuocrayon%2Fpythoncrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzhuozhuocrayon%2Fpythoncrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzhuozhuocrayon%2Fpythoncrawler/lists"}