{"id":15673011,"url":"https://github.com/huangcongqing/spider","last_synced_at":"2025-05-06T21:03:33.884Z","repository":{"id":107569659,"uuid":"134965705","full_name":"HuangCongQing/Spider","owner":"HuangCongQing","description":"爬虫python3 (request,BeautifulSoup,xpath,re,Selenium,wordcloud等模块)","archived":false,"fork":false,"pushed_at":"2024-08-24T18:16:49.000Z","size":20865,"stargazers_count":14,"open_issues_count":0,"forks_count":12,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-31T02:51:11.318Z","etag":null,"topics":["bf4","charles","lxml","python3","python3x","re","request","requests","selenium","spider","spiders","xpath"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/HuangCongQing.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-05-26T13:54:47.000Z","updated_at":"2024-11-05T01:54:02.000Z","dependencies_parsed_at":null,"dependency_job_id":"59db1ff2-92d1-426e-b209-4d897fd3a641","html_url":"https://github.com/HuangCongQing/Spider","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HuangCongQing%2FSpider","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HuangCongQing%2FSpider/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HuangCongQing%2FSpider/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HuangCongQing%2FSpider/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/HuangCongQing","download_url":"https://codeload.github.com/HuangCongQing/Spider/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252769398,"owners_count":21801376,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bf4","charles","lxml","python3","python3x","re","request","requests","selenium","spider","spiders","xpath"],"created_at":"2024-10-03T15:35:14.498Z","updated_at":"2025-05-06T21:03:33.851Z","avatar_url":"https://github.com/HuangCongQing.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# spider\n\npython3 各种爬虫技术\n\n**个人爬虫笔记：https://www.yuque.com/huangzhongqing/spider**\n\n@[双愚](https://github.com/HuangCongQing/Spider) , 若fork或star请注明来源\n\n### note笔记\n\n* 爬虫介绍：https://www.yuque.com/docs/share/edb944f3-880a-4a48-a053-df2953be56b4?# 《爬虫基础学习（总结）》\n* [notes/01数据爬取requests_note](notes/01数据爬取requests_note)\n* [notes/02数据解析note](notes/02数据解析note)\n\n### 模块库\n\n1. [package/1request](package/1request)\n2. [package/1request-advanced](package/1request-advanced): cookie\u0026代理\n3. [package/2BeautifulSoup4](package/2BeautifulSoup4)\n4. [package/3xpath](package/3xpath)\n5. [package/4re正则表达式](package/4re正则表达式)\n   1. [re.findall](package/4re正则表达式/re基础/findall.py)\n   2. [re.search](package/4re正则表达式/re基础/search.py)\n6. [package/5selenium](package/5selenium)\n7. [package/6wordcloud\u0026jieba](package/6wordcloud\u0026jieba) 词云\n\n\n| 功能 | **包名** | **作用** |\n| - | - | - |\n| 数据获取 | request | 爬取网页 |\n| 数据 解析 | re | 正则表达式 |\n| \u003cbr/\u003e | BeautifulSoup | \u003cbr/\u003e |\n| \u003cbr/\u003e | xpath | xpath语法来进行文件格式解析 |\n| \u003cbr/\u003e | lxml | lxml库结合libxml2快速强大的特性，使用xpath语法来进行文件格式解析，与Beautiful相比，效率更高。 |\n| 模拟浏览器 | Selenium | 用于测试网站的自动化测试工具，支持各种浏览器包括Chrome、Firefox、Safari等主流界面浏览器，同时也支持phantomJS无界面浏览器。模拟点击 |\n| \u003cbr/\u003e | PhantomJS | 无界面浏览器 |\n| \u003cbr/\u003e | pandas | \u003cbr/\u003e |\n| \u003cbr/\u003e | jieba | 使用结巴分词进行中文分词 |\n| \u003cbr/\u003e | pandas | \u003cbr/\u003e |\n| \u003cbr/\u003e | wordcloud | 词云包 |\n| \u003cbr/\u003e | matplotlib | 绘制图表 |\n|   | random | \u003cbr/\u003e |\n\n[]()[]()\n\n### 通用代码(输出|表格|)\n\n* [common.ipynb](common.ipynb)\n\n### 爬虫实战\n\n1. [practice/01复仇者联盟3豆瓣影评爬虫](practice/01复仇者联盟3豆瓣影评爬虫)\n2. [practice/02分析豆瓣中最新电影的影评（词云显示）《超时空同居》](practice/02分析豆瓣中最新电影的影评（词云显示）《超时空同居》)\n3. [practice/03王菊微博评论数据抓取jupyter](practice/03王菊微博评论数据抓取jupyter)\n4. [practice/04python模拟登录带验证码的网站](practice/04python模拟登录带验证码的网站)\n5. [practice/05抓取得到App音频数据](practice/05抓取得到App音频数据)\n6. [practice/06python爬取公众号文章](practice/06python爬取公众号文章)\n7. [practice/07通过关键词爬取csdn博客文章](practice/07通过关键词爬取csdn博客文章)\n8. [practice/08百度搜狗百科关键词爬取](practice/08百度搜狗百科关键词爬取)\n9. [practice/09大学排行榜榜单爬取](practice/09大学排行榜榜单爬取)\n10. [practice/10bilibili视频爬取下载](practice/10bilibili视频爬取下载)\n11.\n\n### 文件操作\n\n读取保存excel，txt等文件\n\n1. [文件操作/excel](文件操作/excel)\n2. [文件操作/json](文件操作/json)【todo】\n3. [文件操作/txt](文件操作/txt)\n\n### LICENSE\n\n本项目全部内容遵守 MIT 许可协议.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuangcongqing%2Fspider","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhuangcongqing%2Fspider","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuangcongqing%2Fspider/lists"}