{"id":30325903,"url":"https://github.com/win7user10/laraue.crawling","last_synced_at":"2025-08-17T23:08:40.486Z","repository":{"id":37585764,"uuid":"505996777","full_name":"win7user10/Laraue.Crawling","owner":"win7user10","description":"The set of tools for fast writing crawlers on the .NET","archived":false,"fork":false,"pushed_at":"2024-05-30T19:30:57.000Z","size":194,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-05-30T22:25:04.997Z","etag":null,"topics":["crawler","csharp","csharp-crawler","parser"],"latest_commit_sha":null,"homepage":"","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/win7user10.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-06-21T20:38:47.000Z","updated_at":"2024-05-30T19:31:00.000Z","dependencies_parsed_at":"2023-09-27T21:08:03.040Z","dependency_job_id":"27df25b8-f25a-4525-98b2-0db43a1ab057","html_url":"https://github.com/win7user10/Laraue.Crawling","commit_stats":null,"previous_names":[],"tags_count":48,"template":false,"template_full_name":null,"purl":"pkg:github/win7user10/Laraue.Crawling","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/win7user10%2FLaraue.Crawling","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/win7user10%2FLaraue.Crawling/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/win7user10%2FLaraue.Crawling/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/win7user10%2FLaraue.Crawling/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/win7user10","download_url":"https://codeload.github.com/win7user10/Laraue.Crawling/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/win7user10%2FLaraue.Crawling/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270918404,"owners_count":24667679,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-17T02:00:09.016Z","response_time":129,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","csharp","csharp-crawler","parser"],"created_at":"2025-08-17T23:08:38.856Z","updated_at":"2025-08-17T23:08:40.465Z","avatar_url":"https://github.com/win7user10.png","language":"C#","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Laraue.Crawling packages\n\nThe set of tools for fast writing crawlers on .NET.\n\n\n[![latest version](https://img.shields.io/nuget/v/Laraue.Crawling.Common)](https://www.nuget.org/packages/Laraue.Crawling.Common)\n[![latest version](https://img.shields.io/nuget/dt/Laraue.Crawling.Common)](https://www.nuget.org/packages/Laraue.Crawling.Common)\n\n### Static HTML crawling\n\nStatic means the crawling process is performing with the static html that not changes.\nYou can build a strongly typed schema with binding each element to related html block.\nThen this schema can be parsed via AngleSharpParser class located in Laraue.Crawling.Static.AngleSharp library.\n\n#### Build static HTML schema\n\n```html\n\u003cdiv\u003e\n    \u003cdiv class=\"title\"\u003ePrivate info\u003c/div\u003e\n    \u003cdiv class=\"user\"\u003e\n        \u003cdiv class=\"name\"\u003eAlex\u003c/div\u003e\n        \u003cdiv class=\"age\"\u003e10\u003c/div\u003e\n        \u003cdiv class=\"dogs\"\u003e\n            \u003cdiv class=\"dog\"\u003e\n                \u003cdiv class=\"name\"\u003eJelly\u003c/div\u003e\n                \u003cdiv class=\"age\"\u003e5\u003c/div\u003e\n            \u003c/div\u003e\n            \u003cdiv class=\"dog\"\u003e\n                \u003cdiv class=\"name\"\u003eMarly\u003c/div\u003e\n                \u003cdiv class=\"age\"\u003e7\u003c/div\u003e\n            \u003c/div\u003e\n        \u003c/div\u003e\n    \u003c/div\u003e\n    \u003cdiv class=\"links\"\u003e\n        \u003ca href=\"https://hey1.html\"\u003e\u003c/a\u003e\n        \u003ca href=\"https://hey2.html\"\u003e\u003c/a\u003e\n    \u003c/div\u003e\n\u003c/div\u003e\n```\n\n```csharp\npublic record OnePage(string Title, string[] ImageLinks, User User);\npublic record User(string Name, int Age, Dog[] Dogs);\npublic record Dog(string Name, int Age);\n\n\n var schema = new AngleSharpSchemaBuilder\u003cOnePage\u003e()\n    .HasProperty(x =\u003e x.Title, \".title\")\n    .HasObjectProperty(x =\u003e x.User, \".user\", userBuilder =\u003e\n    {\n        userBuilder.HasProperty(x =\u003e x.Name, \".name\")\n            .HasProperty(x =\u003e x.Age, \".age\")\n            .HasArrayProperty(x =\u003e x.Dogs, \".dog\", dogsBuilder =\u003e\n            {\n                dogsBuilder.HasProperty(x =\u003e x.Age, \".age\")\n                    .HasProperty(x =\u003e x.Name, \".name\");\n            });\n    })\n    .HasArrayProperty(\n        x =\u003e x.ImageLinks,\n        \".links a\",\n        x =\u003e Task.FromResult(x.GetAttributeValue(\"href\")))\n    .Build();\n```\n\n#### Using of the static schema to parse the passed html\n\n```csharp\nvar parser = new AngleSharpParser(new NullLoggerFactory());\n\nvar html = await File.ReadAllTextAsync(\"test.html\");\nvar model = await parser.RunAsync(schema, html);\n\nAssert.Equal(\"Private info\", model.Title);\nAssert.Equal(\"Alex\", model.User.Name);\nAssert.Equal(10, model.User.Age);\n\nvar dogs = model.User.Dogs;\nAssert.Equal(2, dogs.Length);\n\nvar dog1 = dogs[0];\nAssert.Equal(5, dog1.Age);\nAssert.Equal(\"Jelly\", dog1.Name);\n\nvar dog2 = dogs[1];\nAssert.Equal(7, dog2.Age);\nAssert.Equal(\"Marly\", dog2.Name);\n\nvar links = model.ImageLinks;\nAssert.Equal(2, links.Length);\nAssert.Equal(\"https://hey1.html\", links[0]);\nAssert.Equal(\"https://hey2.html\", links[1]);\n```\n\n#### Element schema\n\nSometimes the full schema binding is not necessary (only one value is required). Then the element schema class can be used.\n\n```csharp\nvar dogNamesSchema = new AngleSharpElementSchema\u003cstring[]\u003e(builder =\u003e builder.UseSelector(\".dog .name\"));\n\nvar parser = new AngleSharpParser(new NullLoggerFactory());\nvar html = await File.ReadAllTextAsync(\"test.html\");\nvar dogNames = await parser.RunAsync(schema, html);\n\nAssert.Equal(2, dogNames.Length);\nAssert.Equal(\"Jelly\", dogNames[0]);\nAssert.Equal(\"Marly\", dogNames[1]);\n```\n\n### Dynamic HTML crawling\n\nThe package Laraue.Crawling.Dynamic.PuppeterSharp intended to parse schemas using PuppeterSharp library.\nLet's rewrite static schema to the dynamic format:\n\n```csharp\npublic record OnePage(string Title, string[] ImageLinks, User User);\npublic record User(string Name, int Age, Dog[] Dogs);\npublic record Dog(string Name, int Age);\n\n\nvar schema = new PuppeterSharpSchemaBuilder\u003cOnePage\u003e()\n    .HasProperty(x =\u003e x.Title, \".title\")\n    .HasObjectProperty(x =\u003e x.User, \".user\", userBuilder =\u003e\n    {\n        userBuilder.HasProperty(x =\u003e x.Name, \".name\")\n            .HasProperty(x =\u003e x.Age, \".age\")\n            .HasArrayProperty(x =\u003e x.Dogs, \".dog\", dogsBuilder =\u003e\n            {\n                dogsBuilder.HasProperty(x =\u003e x.Age, \".age\")\n                    .HasProperty(x =\u003e x.Name, \".name\");\n            });\n    })\n    .HasArrayProperty(\n        x =\u003e x.ImageLinks,\n        \".links a\",\n        async handle =\u003e await handle.GetAttributeValueAsync(\"href\"))\n    .Build();\n```\n\nThe main difference that all functions now interacts with ElementHandle class from PuppeterSharp library.\nThe crawling can be executed this way:\n\n```csharp\nawait new BrowserFetcher().DownloadAsync();\nawait using var browser = await Puppeteer.LaunchAsync(new LaunchOptions());\nvar page = await browser.NewPageAsync();\nvar response = await page.GoToAsync(link);\nvar model = await _parser.RunAsync(schema, await page.QuerySelectorAsync(\"body\"));\n```\n\n### Extended features\n\nSometimes binding of html element to property is not enough. For example - one string should\nbe divided into three elements.\n\n```html\n\u003cp class=\"info\"\u003e\n    Bob Martin 37\n\u003c/p\u003e\n```\n\n```csharp\nrecord User(string Name, string Surname, int Age);\nvar schema = new PuppeterSharpSchemaBuilder\u003cUser\u003e()\n    .BindManually(async (element, modelBinder) =\u003e {\n        var element = await element.QuerySelectorAsync(\".info\");\n        if (element is null) return;\n        var elementText = await element.GetInnerTextAsync();\n        var stringParts = elementText.Split(' ');\n        if (stringParts.Length != 3) return;\n        modelBinder.BindProperty(x =\u003e x.Name, stringParts[0]);\n        modelBinder.BindProperty(x =\u003e x.Surname, stringParts[1]);\n        modelBinder.BindProperty(x =\u003e x.Age, int.Parse(stringParts[2]));\n    })\n```\n\n### XML static crawling\n\n```xml\n\u003call\u003e\n    \u003cnote\u003e\n        \u003cto id=\"15\"\u003eTove\u003c/to\u003e\n        \u003cbody\u003eDon't forget me this weekend!\u003c/body\u003e\n    \u003c/note\u003e\n    \u003cnote\u003e\n        \u003cto id=\"16\"\u003eMax\u003c/to\u003e\n        \u003cbody\u003eHi!\u003c/body\u003e\n    \u003c/note\u003e\n\u003c/all\u003e\n```\n\nUse class XmlSchemaBuilder to build the schema. \n\n```csharp\nvar schema = new XmlSchemaBuilder\u003cXmlContent\u003e()\n    .HasArrayProperty\u003cNote\u003e(x =\u003e x.Notes, \"//note\", builder =\u003e\n    {\n        builder.HasProperty(y =\u003e y.Body, b =\u003e b.UseSelector(\"body\"));\n        builder.HasProperty(y =\u003e y.Id, b =\u003e b\n            .UseSelector(\"to\")\n            .GetInnerTextFromAttribute(\"id\"));\n    })\n    .Build();\n```\n\nSchema parsing\n```csharp\nvar parser = new XmlParser(new NullLoggerFactory());\nvar xmlDocument = new XmlDocument();\nxmlDocument.LoadXml(xml);\n        \nvar result = await parser.RunAsync(schema, xmlDocument);\n\nAssert.NotEmpty(result!.Notes);\nvar notes = result.Notes.ToArray();\n\nAssert.Equal(\"Don't forget me this weekend!\", notes[0].Body);\nAssert.Equal(15, notes[0].Id);\n\nAssert.Equal(\"Hi!\", notes[1].Body);\nAssert.Equal(16, notes[1].Id);\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwin7user10%2Flaraue.crawling","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwin7user10%2Flaraue.crawling","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwin7user10%2Flaraue.crawling/lists"}