{"id":15037544,"url":"https://github.com/dotnetcore/dotnetspider","last_synced_at":"2026-04-02T18:09:47.988Z","repository":{"id":41264336,"uuid":"54357610","full_name":"dotnetcore/DotnetSpider","owner":"dotnetcore","description":"DotnetSpider, a .NET standard web crawling library. It is lightweight, efficient and fast high-level web crawling \u0026 scraping framework","archived":false,"fork":false,"pushed_at":"2024-09-25T09:14:25.000Z","size":59091,"stargazers_count":4073,"open_issues_count":5,"forks_count":1052,"subscribers_count":259,"default_branch":"master","last_synced_at":"2025-05-13T18:15:51.820Z","etag":null,"topics":["crawler","cross-platform","csharp","distributed","dotnetcore"],"latest_commit_sha":null,"homepage":"","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dotnetcore.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-03-21T03:37:32.000Z","updated_at":"2025-05-13T08:29:58.000Z","dependencies_parsed_at":"2022-07-13T15:29:46.855Z","dependency_job_id":"601b7e97-3b91-47c3-9a69-5d9f0752a6eb","html_url":"https://github.com/dotnetcore/DotnetSpider","commit_stats":{"total_commits":111,"total_committers":11,"mean_commits":"10.090909090909092","dds":0.1711711711711712,"last_synced_commit":"f01fd36c4c8ed5b91e18908d6cc14b68111661cd"},"previous_names":["zlzforever/dotnetspider"],"tags_count":17,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dotnetcore%2FDotnetSpider","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dotnetcore%2FDotnetSpider/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dotnetcore%2FDotnetSpider/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dotnetcore%2FDotnetSpider/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dotnetcore","download_url":"https://codeload.github.com/dotnetcore/DotnetSpider/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254000895,"owners_count":21997444,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","cross-platform","csharp","distributed","dotnetcore"],"created_at":"2024-09-24T20:34:57.681Z","updated_at":"2026-04-02T18:09:47.918Z","avatar_url":"https://github.com/dotnetcore.png","language":"C#","readme":"# DotnetSpider\n\n免责申明：本框架是为了帮助开发人员简化开发流程、提高开发效率，请勿使用此框架做任何违法国家法律的事情，使用者所做任何事情也与本框架的作者无关。\n\n[![Build Status](https://dev.azure.com/zlzforever/DotnetSpider/_apis/build/status/dotnetcore.DotnetSpider?branchName=master)](https://dev.azure.com/zlzforever/DotnetSpider/_build/latest?definitionId=3\u0026branchName=master)\n[![NuGet](https://img.shields.io/nuget/vpre/DotnetSpider.svg)](https://www.nuget.org/packages/DotnetSpider)\n[![Member project of .NET Core Community](https://img.shields.io/badge/member%20project%20of-NCC-9e20c9.svg)](https://github.com/dotnetcore)\n[![GitHub license](https://img.shields.io/github/license/dotnetcore/DotnetSpider.svg)](https://github.com/dotnetcore/DotnetSpider/blob/master/LICENSE.txt)\n\nDotnetSpider, a .NET Standard web crawling library. It is a lightweight, efficient, and fast high-level web crawling \u0026 scraping framework.\n\nIf you want to get the latest beta packages, you should add the myget feed:\n\n````html\n\u003cadd key=\"myget.org\" value=\"https://www.myget.org/F/zlzforever/api/v3/index.json\" protocolVersion=\"3\" /\u003e\n````\n\n### DESIGN\n\n![DESIGN IMAGE](https://github.com/dotnetcore/DotnetSpider/blob/master/images/%E6%95%B0%E6%8D%AE%E9%87%87%E9%9B%86%E7%B3%BB%E7%BB%9F.png?raw=true)\n\n### DEVELOP ENVIROMENT\n\n1. Visual Studio 2017 (15.3 or later) or Jetbrains Rider\n2. [.NET Core 2.2 or later](https://www.microsoft.com/net/download/windows)\n3. Docker\n4. MySql\n\n        docker run --name mysql -d -p 3306:3306 --restart always -e MYSQL_ROOT_PASSWORD=1qazZAQ! mysql:5.7\n\n5. Redis (option)\n\n        docker run --name redis -d -p 6379:6379 --restart always redis\n\n6. SqlServer\n\n        docker run --name sqlserver -d -p 1433:1433 --restart always  -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=1qazZAQ!' mcr.microsoft.com/mssql/server:2017-latest\n\n8. PostgreSQL (option)\n\n        docker run --name postgres -d  -p 5432:5432 --restart always -e POSTGRES_PASSWORD=1qazZAQ! postgres\n\n9. MongoDb  (option)\n\n        docker run --name mongo -d -p 27017:27017 --restart always mongo\n\n10. RabbitMQ\n\n        docker run -d --restart always --name rabbimq -p 4369:4369 -p 5671-5672:5671-5672 -p 25672:25672 -p 15671-15672:15671-15672 \\\n               -e RABBITMQ_DEFAULT_USER=user -e RABBITMQ_DEFAULT_PASS=password \\\n               rabbitmq:3-management\n\n11. Docker remote api for mac\n\n        docker run -d  --restart always --name socat -v /var/run/docker.sock:/var/run/docker.sock -p 2376:2375 bobrik/socat TCP4-LISTEN:2375,fork,reuseaddr UNIX-CONNECT:/var/run/docker.sock\n\n12. HBase\n\n        docker run -d --restart always --name hbase -p 20550:8080 -p 8085:8085 -p 9090:9090 -p 9095:9095 -p 16010:16010 dajobe/hbase\n\n### MORE DOCUMENTS\n\nhttps://github.com/dotnetcore/DotnetSpider/wiki\n\n### SAMPLES\n\n    Please see the Project DotnetSpider.Sample in the solution.\n\n### BASE USAGE\n\n[Base usage Codes](https://github.com/dotnetcore/DotnetSpider/blob/master/src/DotnetSpider.Sample/samples/BaseUsageSpider.cs)\n\n### ADDITIONAL USAGE: Configurable Entity Spider\n\n[View complete Codes](https://github.com/dotnetcore/DotnetSpider/blob/master/src/DotnetSpider.Sample/samples/EntitySpider.cs)\n\n````csharp\n[DisplayName(\"博客园爬虫\")]\npublic class EntitySpider(\n    IOptions\u003cSpiderOptions\u003e options,\n    DependenceServices services,\n    ILogger\u003cSpider\u003e logger)\n    : Spider(options, services, logger)\n{\n    public static async Task RunAsync()\n    {\n        var builder = Builder.CreateDefaultBuilder\u003cEntitySpider\u003e(options =\u003e\n        {\n            options.Speed = 1;\n        });\n        builder.UseSerilog();\n        builder.IgnoreServerCertificateError();\n        await builder.Build().RunAsync();\n    }\n\n    protected override async Task InitializeAsync(CancellationToken stoppingToken = default)\n    {\n        AddDataFlow\u003cDataParser\u003cCnblogsEntry\u003e\u003e();\n        AddDataFlow(GetDefaultStorage);\n        await AddRequestsAsync(\n            new Request(\n                \"https://news.cnblogs.com/n/page/1\", new Dictionary\u003cstring, object\u003e { { \"网站\", \"博客园\" } }));\n    }\n\n    [Schema(\"cnblogs\", \"news\")]\n    [EntitySelector(Expression = \".//div[@class='news_block']\", Type = SelectorType.XPath)]\n    [GlobalValueSelector(Expression = \".//a[@class='current']\", Name = \"类别\", Type = SelectorType.XPath)]\n    [GlobalValueSelector(Expression = \"//title\", Name = \"Title\", Type = SelectorType.XPath)]\n    [FollowRequestSelector(Expressions = [\"//div[@class='pager']\"])]\n    public class CnblogsEntry : EntityBase\u003cCnblogsEntry\u003e\n    {\n        protected override void Configure()\n        {\n            HasIndex(x =\u003e x.Title);\n            HasIndex(x =\u003e new { x.WebSite, x.Guid }, true);\n        }\n\n        public int Id { get; set; }\n\n        [Required]\n        [StringLength(200)]\n        [ValueSelector(Expression = \"类别\", Type = SelectorType.Environment)]\n        public string Category { get; set; }\n\n        [Required]\n        [StringLength(200)]\n        [ValueSelector(Expression = \"网站\", Type = SelectorType.Environment)]\n        public string WebSite { get; set; }\n\n        [StringLength(200)]\n        [ValueSelector(Expression = \"Title\", Type = SelectorType.Environment)]\n        [ReplaceFormatter(NewValue = \"\", OldValue = \" - 博客园\")]\n        public string Title { get; set; }\n\n        [StringLength(40)]\n        [ValueSelector(Expression = \"GUID\", Type = SelectorType.Environment)]\n        public string Guid { get; set; }\n\n        [ValueSelector(Expression = \".//h2[@class='news_entry']/a\")]\n        public string News { get; set; }\n\n        [ValueSelector(Expression = \".//h2[@class='news_entry']/a/@href\")]\n        public string Url { get; set; }\n\n        [ValueSelector(Expression = \".//div[@class='entry_summary']\")]\n        [TrimFormatter]\n        public string PlainText { get; set; }\n\n        [ValueSelector(Expression = \"DATETIME\", Type = SelectorType.Environment)]\n        public DateTime CreationTime { get; set; }\n    }\n}\n\n````\n\n#### Distributed spider\n\n\n[Read this document](https://github.com/dotnetcore/DotnetSpider/wiki/3-Distributed-Spider)\n\n#### Puppeteer downloader\n\nComing soon\n\n### NOTICE\n\n#### when you use redis scheduler, please update your redis config:\n\n    timeout 0\n    tcp-keepalive 60\n\n ### Dependencies\n\n| Package | License |\n| --- | --- |\n| Bert.RateLimiters | Apache 2.0 |\n | MessagePack  |  MIT   |\n | Newtonsoft.Json  |  MIT   |\n | Dapper  |  Apache 2.0   |\n | HtmlAgilityPack  |  MIT   |\n | ZCJ.HashedWheelTimer  |  MIT   |\n | murmurhash  |  Apache 2.0   |\n | Serilog.AspNetCore  |  Apache 2.0   |\n | Serilog.Sinks.Console  |  Apache 2.0   |\n | Serilog.Sinks.RollingFile  |  Apache 2.0   |\n | Serilog.Sinks.PeriodicBatching  |  Apache 2.0   |\n | MongoDB.Driver  |  Apache 2.0   |\n | MySqlConnector  |  MIT   |\n | AutoMapper.Extensions.Microsoft.DependencyInjection  | MIT   |\n | Docker.DotNet  |  MIT   |\n | BuildBundlerMinifier  |  Apache 2.0   |\n | Pomelo.EntityFrameworkCore.MySql  |  MIT   |\n | Quartz.AspNetCore  |  Apache 2.0    |\n | Quartz.AspNetCore.MySqlConnector  | Apache 2.0  |\n | Npgsql  |  PostgreSQL License   |\n | RabbitMQ.Client  |  Apache 2.0   |\n | Polly  | BSD 3-C   |\n\n### AREAS FOR IMPROVEMENTS\n\nQQ Group: 477731655\nEmail: zlzforever@163.com\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdotnetcore%2Fdotnetspider","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdotnetcore%2Fdotnetspider","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdotnetcore%2Fdotnetspider/lists"}