{"id":20392896,"url":"https://github.com/zhaotianff/csharpcrawler","last_synced_at":"2025-04-09T15:09:55.018Z","repository":{"id":37617097,"uuid":"67911848","full_name":"zhaotianff/CSharpCrawler","owner":"zhaotianff","description":"C#爬虫示例程序，想学习爬虫入门知识的可以看过来。后续会慢慢加入更多爬虫相关的知识。","archived":false,"fork":false,"pushed_at":"2022-12-08T04:19:12.000Z","size":51946,"stargazers_count":229,"open_issues_count":5,"forks_count":56,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-09T15:09:47.190Z","etag":null,"topics":["crawler","csharp","wpf"],"latest_commit_sha":null,"homepage":"","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zhaotianff.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-09-11T05:22:10.000Z","updated_at":"2025-04-08T12:23:28.000Z","dependencies_parsed_at":"2023-01-24T10:00:36.206Z","dependency_job_id":null,"html_url":"https://github.com/zhaotianff/CSharpCrawler","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhaotianff%2FCSharpCrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhaotianff%2FCSharpCrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhaotianff%2FCSharpCrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhaotianff%2FCSharpCrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zhaotianff","download_url":"https://codeload.github.com/zhaotianff/CSharpCrawler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248055282,"owners_count":21040157,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","csharp","wpf"],"created_at":"2024-11-15T03:46:31.527Z","updated_at":"2025-04-09T15:09:54.999Z","avatar_url":"https://github.com/zhaotianff.png","language":"C#","funding_links":[],"categories":[],"sub_categories":[],"readme":"# C\\#爬虫项目\n\n\u003cp align=\"center\"\u003e\n\u003ca href=\"https://github.com/zhaotianff/CSharpCrawler\" target=\"_blank\"\u003e\n\u003cimg align=\"center\" alt=\"CSharpCrawler\" src=\"CSharpCrawler/crawler.png\" /\u003e\n\u003c/a\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\u003ca href=\"https://github.com/zhaotianff/CSharpCrawler/stargazers\" target=\"_blank\"\u003e\n \u003cimg alt=\"GitHub stars\" src=\"https://img.shields.io/github/stars/zhaotianff/CSharpCrawler.svg\" /\u003e\n\u003c/a\u003e\n\u003ca href=\"https://github.com/zhaotianff/CSharpCrawler/releases\" target=\"_blank\"\u003e\n \u003cimg alt=\"All releases\" src=\"https://img.shields.io/github/downloads/zhaotianff/CSharpCrawler/total.svg\" /\u003e\n\u003c/a\u003e\n\u003ca href=\"https://github.com/zhaotianff/CSharpCrawler/network/members\" target=\"_blank\"\u003e\n \u003cimg alt=\"Github forks\" src=\"https://img.shields.io/github/forks/zhaotianff/CSharpCrawler.svg\" /\u003e\n\u003c/a\u003e\n\u003ca href=\"https://github.com/zhaotianff/CSharpCrawler/issues\" target=\"_blank\"\u003e\n \u003cimg alt=\"All issues\" src=\"https://img.shields.io/github/issues/zhaotianff/CSharpCrawler.svg\" /\u003e\n\u003c/a\u003e\n\u003c/p\u003e\n\u003ch1 align=\"center\"\u003eCSharpCrawler :spider: \u003c/h1\u003e\n\n### 关于项目\nC#开发爬虫的知识总结，目前还在更新中。这并不是一个完整的爬虫程序，只是一些示例。  \n\u003e 为什么要拿C#开发爬虫项目，因为个人还是比较喜欢C#。C#虽然库少一点，但想要的功能基本还是能实现的。  \n\u003e 总结的知识点如果什么错误之处，还恳请大家提个issue指正，一起学习进步♂（￣▽￣）/  \n\n### 功能介绍\n\n* 基础知识\n  * [爬虫基础知识](CSharpCrawler/PrerequisiteKnowledge.md)\n  * [如何绕开反爬虫机制](CSharpCrawler/AvoidAnti-CrawlingMechanisms.md)\n  \n* 网页抓取原理\n  * 使用套接字来获取网页源码\n\n* 法律与道德约束 \n  * 爬虫协议\n    * [爬虫协议介绍，以及它的语法规则;](https://github.com/zhaotianff/CSharpCrawler/blob/master/CSharpCrawler/RobotsExclusionProtocol.md)\n    * C#如何获取网站的爬虫协议;\n    * C#中如何解析爬虫协议;\n  * [法律相关](CSharpCrawler/CrawlerLaw.md)\n  \n* 抓取网页\n  * HttpWebRequest类的使用\n  * HttpClient类的使用\n  * 获取指定url的IP地址\n  * 获取指定url的网页头信息\n  * 如何从网页源码中提取页面的编码\n   \n* 抓取动态网页\n  * 使用[CEFSharp](https://github.com/cefsharp/CefSharp)来抓取动态网页\n  * 使用WebBrowser(IE)来抓取动态网页\n  * 使用[Puppeteer](https://github.com/hardkoded/puppeteer-sharp)来抓取动态网页\n  * 使用[Selenium](https://github.com/SeleniumHQ/selenium)来抓取动态网页\n\n* WebAPI调用\n  * 获取实时天气\n    * 调用中国天气网公开API接口来获取天气\n      \n  * 获取Bing每日图片\n    * 调用cn bing API接口来获取Bing每日图片\n\n* 获取网页DOM\n  * 使用[HtmlAgilityPack](https://github.com/zzzprojects/html-agility-pack)来获取网页的DOM结构\n  \n* 使用CSS选择器和XPath选取元素\n  * CSS选择器\n  * XPath\n  \n* [正则表达式的使用](https://github.com/zhaotianff/CSharpCrawler/blob/master/CSharpCrawler/%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E5%BC%8F.md)\n  * 正则表达式的基础知识和基本使用;\n  * 正则表达式中的分组构造;\n  * 常用匹配模式;\n\n* Url抓取(当Url太多时，UI会卡)\n  * 抓取指定网址的全部链接\n  * 通过指定深度，抓取子网页的全部链接\n  * 动态网页链接抓取\n  * 限定抓取当前页面的子链接\n    \n* 图片抓取\n  * 抓取指定url页面中的图片，通过配置url的页码规则，来进行翻页。\n  * 自动获取下一页\n\n* 文件下载\n  * 使用WebClient类下载文件\n  * 多线程下载文件\n  * 从文件加载批量下载\n  \n* 多线程抓取\n  \n* 抓包工具使用\n  * [Fiddler](https://github.com/zhaotianff/CSharpCrawler/blob/master/CSharpCrawler/AnalysisPacket_Fiddler.md)\n\t\n* 模拟登录并获取登录后的内容\n  * 使用Cookie(实现中)\n  * [使用Selenium](https://github.com/zhaotianff/CSharpCrawler/blob/master/CSharpCrawler/Selenium.md)(实现中)\n\t* 说明：示例程序使用的是EdgeDriver，所以需要Windows10系统，如果需要其它浏览器Driver,可自行修改。\n    * 测试系统：Windows 10 1703 Edge 15.15063.0，如果Edge驱动版本不一致，需要手动更新至对应的版本。 \t\n\n* 必应图片搜索(*仅供交流学习使用，请勿用作商业用途*)\n  * 实现必应图片搜索的功能\n  * 翻页及优化(待更新)\n\n* 爬虫数据存储\n  * Berkeley DB\n    * [介绍BerkeleyDB以及使用方式](https://github.com/zhaotianff/CSharpCrawler/blob/master/CSharpCrawler/BerkeleyDB介绍.txt)\n    \n  * SQLite\n    * 介绍SQLite以及使用方式\n  \n  * MongoDB\n    * [介绍MongoDB以及使用方式](https://github.com/zhaotianff/CSharpCrawler/blob/master/CSharpCrawler/MongoDB.md)\n\t \n* 小例子-全国家常菜价格统计(*仅供交流学习使用，请勿用作商业用途*)\n  * 获取全国城市，以及城市代码\n  * 抓取家常菜价格\n  * 生成统计图表\t  \n\t  \t\n* 小例子-通用抓取\n  * 电商类网站\n  * 新闻类网站(待更新)\n  \n* 将网页保存为图片/PDF\n\t\n### Roadmap\n* 视频下载\n  * 常规视频下载\n  * ffmpeg的使用\n  * blob类型视频下载\n  * AES加密的m3u8视频下载\n* Url Encode/Decode 原理\n* 抓包工具Charles的使用\n* 使用抓包工具分析网站接口\n* 使用抓包工具分析APP接口\n* 验证码识别(字符验证码，滑块验证码)\n* BloomFilter算法\n* NLP基础\n* 中文分词\n* Lucene.net使用\n* 优先级队列实现\n* 基本爬虫架构\n* 分布式爬虫架构\n* 抓取豆瓣书评\n* 当抓取的数据到非常大的的数量级时该怎么处理\n* 使用代理\n\t\n    \n### 开发环境\n~~Visual Studio 2013 + .Net 4.5\u003cbr/\u003e~~\n~~Visual Studio 2015 + .Net 4.5.2\u003cbr/\u003e~~\nVisual Studio 2017 + .Net 4.7.2\n\n**如果没有安装Blend SDK，GAC中没有System.Windows.Interactivity.dll，需要自己引用bin/x64/Debug目录下的System.Windows.Interactivity.dll**\n\n**编译时可能会显示各种库找不到，Nuget还原下包就可以正常编译了**\n\n**更新CEF至85.3.130版本后，会出现找不到ChromiumWebBrowser的问题。解决方法是：还原Nuget包后重新打开项目**\n\n**Berkeley DB需要引用bin/x64/Debug目录下的libdb_dotnet181.dll，运行时还需要libdb_csharp181.dll和libdb181.dll，已置于bin/x64/Debug目录下**\n\n### 使用的三方组件\n* [CefSharp](https://github.com/cefsharp/CefSharp)\n* [HtmlAgilityPack](https://github.com/zzzprojects/html-agility-pack)\n* [Oracle Berkeley DB](https://www.oracle.com/database/technologies/related/berkeleydb.html)\n* [SQLite](https://www.sqlite.org/index.html)\n* [Json.NET](https://github.com/JamesNK/Newtonsoft.Json)\n* [Selenium](https://github.com/SeleniumHQ/selenium)\n* [AngleSharp](https://github.com/AngleSharp/AngleSharp)\n* [Puppeteer Sharp](https://github.com/hardkoded/puppeteer-sharp)\n\n### 软件截图\n\u003cp align=\"center\"\u003e\n \u003cimg align=\"center\" alt=\"start up\" src=\"https://github.com/zhaotianff/CSharpCrawler/blob/master/CSharpCrawler/ScreenShots/1.png\" /\u003e\n\u003c/p\u003e\n\n\n\u003cp align=\"center\"\u003e\n \u003cimg align=\"center\" alt=\"start up\" src=\"https://github.com/zhaotianff/CSharpCrawler/blob/master/CSharpCrawler/ScreenShots/2.png\" /\u003e\n\u003c/p\u003e\n\n\n\u003cp align=\"center\"\u003e\n \u003cimg align=\"center\" alt=\"start up\" src=\"https://github.com/zhaotianff/CSharpCrawler/blob/master/CSharpCrawler/ScreenShots/3.png\" /\u003e\n\u003c/p\u003e\n\n\n\u003cp align=\"center\"\u003e\n \u003cimg align=\"center\" alt=\"file download\" src=\"https://github.com/zhaotianff/CSharpCrawler/blob/master/CSharpCrawler/ScreenShots/4.png\" /\u003e\n\u003c/p\u003e\n\n\n\u003cp align=\"center\"\u003e\n \u003cimg align=\"center\" alt=\"file download\" src=\"https://github.com/zhaotianff/CSharpCrawler/blob/master/CSharpCrawler/ScreenShots/5.png\" /\u003e\n\u003c/p\u003e\n\n\n\u003cp align=\"center\"\u003e\n \u003cimg align=\"center\" alt=\"file download\" src=\"https://github.com/zhaotianff/CSharpCrawler/blob/master/CSharpCrawler/ScreenShots/6.png\" /\u003e\n\u003c/p\u003e\n\n### 爬虫项目\n* [MSDN-Magazine-To-PDF](https://github.com/zhaotianff/MSDN-Magazine-To-PDF)\n\n### License\n\n[MIT License](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzhaotianff%2Fcsharpcrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzhaotianff%2Fcsharpcrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzhaotianff%2Fcsharpcrawler/lists"}