{"id":15039009,"url":"https://github.com/marcel0024/cococrawler","last_synced_at":"2025-04-10T00:05:31.842Z","repository":{"id":246914020,"uuid":"824663156","full_name":"Marcel0024/CocoCrawler","owner":"Marcel0024","description":"An declarative and easy to use web crawler and scraper in C#","archived":false,"fork":false,"pushed_at":"2024-09-11T21:21:51.000Z","size":85,"stargazers_count":27,"open_issues_count":1,"forks_count":3,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-10T00:05:19.558Z","etag":null,"topics":["cococrawler","crawler","crawling-tool","csharp","dotnet","dotnetcore","scraper","scraping-tool","webcrawler","webcrawler-csharp","webcrawling","webscraper"],"latest_commit_sha":null,"homepage":"","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Marcel0024.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-05T16:20:26.000Z","updated_at":"2024-10-04T19:05:51.000Z","dependencies_parsed_at":"2024-09-12T06:38:57.692Z","dependency_job_id":"8d525017-7796-4eb8-a591-b293676fa34e","html_url":"https://github.com/Marcel0024/CocoCrawler","commit_stats":null,"previous_names":["marcel0024/cococrawler"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Marcel0024%2FCocoCrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Marcel0024%2FCocoCrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Marcel0024%2FCocoCrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Marcel0024%2FCocoCrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Marcel0024","download_url":"https://codeload.github.com/Marcel0024/CocoCrawler/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248131319,"owners_count":21052819,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cococrawler","crawler","crawling-tool","csharp","dotnet","dotnetcore","scraper","scraping-tool","webcrawler","webcrawler-csharp","webcrawling","webscraper"],"created_at":"2024-09-24T20:41:12.909Z","updated_at":"2025-04-10T00:05:31.819Z","avatar_url":"https://github.com/Marcel0024.png","language":"C#","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CocoCrawler 🥥\n\n[![NuGet](https://img.shields.io/nuget/v/CocoCrawler?logo=nuget\u0026logoColor=fff)](https://www.nuget.org/packages/CocoCrawler)\n[![Build and Publish](https://github.com/Marcel0024/CocoCrawler/actions/workflows/main.yml/badge.svg)](https://github.com/Marcel0024/CocoCrawler/actions/workflows/main.yml)\n\n\n`CocoCrawler` is an easy to use web crawler, scraper and parser in C#. By combing `PuppeteerSharp` and `AngleSharp` it brings the best of both sides, and merges them into an easy to use API.\n\nIt provides an simple API to get started\n\n```csharp\nvar crawlerEngine = await new CrawlerEngineBuilder()\n    .AddPage(\"https://old.reddit.com/r/csharp\", pageOptions =\u003e pageOptions\n        .ExtractList(containersSelector: \"div.thing.link.self\", [\n            new(\"Title\",\"a.title\"),\n            new(\"Upvotes\", \"div.score.unvoted\"),\n            new(\"Datetime\", \"time\", \"datetime\"),\n            new(\"Total Comments\",\"a.comments\"),\n            new(\"Url\",\"a.title\", \"href\")\n        ])\n        .AddPagination(\"span.next-button \u003e a\")\n        .ConfigurePageActions(options =\u003e // Only for showing the possibilities, not needed for running sample\n        {\n            options.ScrollToEnd();\n            options.Wait(2000);\n            // options.Click(\"span.next-button \u003e a\");\n        })\n        .AddOutputToConsole()\n        .AddOutputToCsvFile(\"results.csv\")\n    )\n    .ConfigureEngine(options =\u003e\n    {\n        options.UseHeadlessMode(false);\n        options.PersistVisitedUrls();\n        options.WithLoggerFactory(loggerFactory);\n        options.WithCookies([\n            new(\"auth-cookie\", \"l;alqpekcoizmdfugnvkjgvsaaprufc\", \"thedomain.com\")\n        ]);\n    })\n    .BuildAsync(cancellationToken);\n\nawait crawlerEngine.RunAsync(cancellationToken);\n```\n\nThis examples starts at page `https://old.reddit.com/r/csharp` scrapes all the posts, then continues to the next page and scrapes everything again, and on and on. And outputs everything scraped to the console and a csv file.\n\nWith this library it's easy to \n\n* Scrape Single Page Apps\n* Scrape Listings\n* Add pagination\n* Alternative to list is open each post and scrape the page and continue with pagination\n* Scrape multiple pages in parallel\n* Add custom outputs\n* Customize Everything\n\n## Scraping pages\n\nWith each Page (a page a is a single URL job) added it's possible to add a Task. For each Page it's possible to:\n\n### `.ExtractObject(...)`\n```csharp\n   var crawlerEngine = await new CrawlerEngineBuilder()\n       .AddPage(\"https://github.com/\", pageOptions =\u003e pageOptions\n           .ExtractObject([\n                new(Name: \"Title\", Selector: \"div.title \u003e a \u003e span\"),\n                new(Name: \"Description\", Selector: \"div.title \u003e a \u003e span\"),\n            ])\n        .BuildAsync(cancellationToken);\n```\n\nWhich scrapes the title and description of the page and outputs it. \n\n### `.ExtractList(...)`\n\n```csharp\nvar crawlerEngine = await new CrawlerEngineBuilder()\n    .AddPage(\"https://github.com/\", pageOptions =\u003e pageOptions\n        .ExtractList(containersSelector: \"div \u003e div.repos\", [\n            new(Name: \"Title\", Selector: \"div.title \u003e a \u003e span\"),\n            new(Name: \"Description\", Selector: \"div.title \u003e a \u003e span\"),\n        ]))\n    .BuildAsync(cancellationToken);\n```\nExtractList scrapes a list of objects. The `containersSelector` is the selector for the container that holds the objects. And all selectors after that are relative to the container.\nEach object in the list is inidividually send to the output.\n\n\n### `.OpenLinks(...)`\n\n```csharp\nvar crawlerEngine = await new CrawlerEngineBuilder()\n    .AddPage(\"https://github.com/\", pageOptions =\u003e pageOptions\n        .OpenLinks(linksSelector: \"div.example-link-to-repose\", subPage =\u003e subPage\n            .ExtractObject([\n                new(\"Title\",\"div.sitetable.linklisting a.title\"),\n            ])))\n    .BuildAsync(cancellationToken);\n```\n\nOpenLinks opens each link in the `linksSelector` and scrapes that page. It's usually combined with `.ExtractObject(...)` and `.AddPagination(...)`. `linksSelector` expects a list of a tags. It's also possible to chain multiple `.OpenLinks(...)`.\n\n\n### `.AddPagination(...)`\n\n```csharp\nvar crawlerEngine = await new CrawlerEngineBuilder()\n    .AddPage(\"https://github.com/\", pageOptions =\u003e pageOptions\n        .ExtractList(containersSelector: \"div \u003e div.repos\", [\n            new(Name: \"Title\", Selector: \"div.title \u003e a \u003e span\"),\n            new(Name: \"Description\", Selector: \"div.title \u003e a \u003e span\"),\n        ]))\n        .AddPagination(\"span.next-button \u003e a\")\n    .BuildAsync(cancellationToken);\n```\n\nAddPagination adds pagination to the page. It expects a selector to the next page. It's usually the `Next` button.\n\n\n## Multiple Pages\n\nIt's possible to add multiple pages to scrape with the same Tasks.\n\n```csharp\n   var crawlerEngine = await new CrawlerEngineBuilder()\n       .AddPages([\"https://old.reddit.com/r/csharp\", \"https://old.reddit.com/r/dotnet\"], pageOptions =\u003e pageOptions\n           .OpenLinks(\"div.thing.link.self a.bylink.comments\", subPageOptions =\u003e\n           {\n                subPageOptions.ExtractObject([\n                       new(\"Title\",\"div.sitetable.linklisting a.title\"),\n                       new(\"Url\",\"div.sitetable.linklisting a.title\", \"href\"),\n                       new(\"Upvotes\", \"div.sitetable.linklisting div.score.unvoted\"),\n                       new(\"Top comment\", \"div.commentarea div.entry.unvoted div.md\"),\n               ]);\n               subPageOptions.ConfigurePageActions(ops =\u003e\n                {\n                    ops.ScrollToEnd();\n                    ops.Wait(4000);\n                });\n           })\n           .AddPagination(\"span.next-button \u003e a\")\n        .BuildAsync(cancellationToken);\n\n   await crawlerEngine.RunAsync(cancellationToken);\n```\nThis example starts at `https://old.reddit.com/r/csharp` and `https://old.reddit.com/r/dotnet` and opens each post and scrapes the title, url, upvotes and top comment. It also scrolls to the end of the page and waits 4 seconds before scraping the page. And then it continues with the next pagination page.\n\n\n## PageActions - A way to interact with the browser\n\nPage Actions are a way to interact with the browser. It's possible to add page actions to each page. It's possible to click away popups, or scroll to bottom. The following actions are available:\n\n```csharp\nvar crawlerEngine = await new CrawlerEngineBuilder()\n    .AddPage(\"https://github.com/\", pageOptions =\u003e pageOptions\n        .ExtractList(containersSelector: \"div \u003e div.repos\", [\n            new(Name: \"Title\", Selector: \"div.title \u003e a \u003e span\"),\n            new(Name: \"Description\", Selector: \"div.title \u003e a \u003e span\"),\n        ]))\n        .ConfigurePageActions(ops =\u003e\n        {\n            ops.ScrollToEnd();\n            ops.Click(\"button#load-more\");\n            ops.Wait(4000);\n        });\n    .BuildAsync(cancellationToken);\n```\n\n## Outputs\n\nIt's possible to add multiple outputs to the engine. The following outputs are available:\n\n```csharp\nvar crawlerEngine = await new CrawlerEngineBuilder()\n    .AddPage(\"https://github.com/\", pageOptions =\u003e pageOptions\n        .OpenLinks(linksSelector: \"div.example-link-to-repose\", subPage =\u003e subPage\n            .ExtractObject([\n                new(\"Title\",\"div.sitetable.linklisting a.title\"),\n            ])))\n        .AddOutputToConsole()\n        .AddOutputToCsvFile(\"results.csv\")    \n    .BuildAsync(cancellationToken);\n```\n\nYou can add your own output by implementing the `ICrawlOutput` interface.\n\n```csharp\npublic interface ICrawlOutput\n{\n    Task Initiaize(CancellationToken cancellationToken);\n    Task WriteAsync(JObject jObject, CancellationToken cancellationToken);\n}\n```\n\nInitialize is called once before the engine starts. WriteAsync is called for each object that is scraped.\n\nOn Page level it's possible to add custom outputs\n\n```csharp\nvar crawlerEngine = await new CrawlerEngineBuilder()\n    .AddPage(\"\", p =\u003e p.AddOutput(new MyCustomOutput()))\n    .BuildAsync(cancellationToken);\n```\n\n## Configuring the Engine\n\n### Cookies\n\nIt's possible to add cookies to all request\n\n```csharp\nvar crawlerEngine = await new CrawlerEngineBuilder()\n    .AddPage(...)\n    .ConfigureEngine(options =\u003e\n    {\n        options.WithCookies([\n            new(\"auth-cookie\", \"l;alqpekcoizmdfugnvkjgvsaaprufc\", \"thedomain.com\"),\n            new(\"Cookie2\", \"def\", \"localhost\")\n        ]);\n    })\n    .BuildAsync(cancellationToken);\n```\n\n### Setting the User Agent\n\n```csharp\nvar crawlerEngine = await new CrawlerEngineBuilder()\n    .AddPage(...)\n    .ConfigureEngine(options =\u003e\n    {\n        options.WithUserAgent(\"linux browser - example user agent\");\n    })\n    .BuildAsync(cancellationToken);\n```\nDefault User Agent is from Chrome browser.\n\n### Ignoring URLS\n\n```csharp\nvar crawlerEngine = await new CrawlerEngineBuilder()\n    .AddPage(...)\n    .ConfigureEngine(options =\u003e\n    {\n        options.WithIgnoreUrls([\"https://example.com\", \"https://example2.com\"]);\n    })    \n    .BuildAsync(cancellationToken);\n```\n\n### Stopping the engine\n\nThe engine stops when the \n* The total number of pages to crawl is reached.\n* 2 minutes have passed since the last job was added\n\n### Persisting visited pages\n\nIt's possible to persist visited pages to a file. Once persisted the engine will skip the pages next time.\n\n```csharp\nvar crawlerEngine = await new CrawlerEngineBuilder()\n    .AddPage(...)\n    .ConfigureEngine(options =\u003e\n    {\n        options.PersistVisitedUrls();\n    })\n    .BuildAsync(cancellationToken);\n```\n\n### Other notable options\nThe engine can be configured with the following options:\n\n* `UseHeadlessMode(bool headless)`: If the browser should be headless or not\n* `WithLoggerFactory(ILoggerFactory loggerFactory)`: The logger factory to use, to enable logging.\n* `TotalPagesToCrawl(int total)`: The total number of pages to crawl\n* `WithParallelismDegree(int parallelismDegree)` : The number of browser tabs it can open in parallel\n\n## Extensibility\n\nThe library is designed to be extensible. It's possible to add custom `IParser`, `IScheduler`, `IVisitedUrlTracker` and `ICrawler` implementations.\n\nusing the engine builder it's possible to add custom implementations\n\n```csharp\n.ConfigureEngine(options =\u003e\n{\n    options.WithCrawler(new MyCustomCrawler());\n    options.WithScheduler(new MyCustomScheduler());\n    options.WithParser(new MyCustomParser());\n    options.WithVisitedUrlTracker(new MyCustomParser());\n})\n```\n\n| Interfaces           | Description                                                                                                        |\n| -------------------- | ------------------------------------------------------------------------------------------------------------------ |\n| `IParser`            | IParser uses default AngleSharp. If you want to use something else then CSS selector, overwrite this.              |\n| `IVisitedUrlTracker` | Default uses in memory tracker. It's possible to persist to a file. Those two options are available in the libary. |\n| `IScheduler`         | Holds the current Jobs.                                                                                            |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmarcel0024%2Fcococrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmarcel0024%2Fcococrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmarcel0024%2Fcococrawler/lists"}