{"id":13590697,"url":"https://github.com/pavlovtech/WebReaper","last_synced_at":"2025-04-08T14:31:34.706Z","repository":{"id":61167329,"uuid":"480985506","full_name":"pavlovtech/WebReaper","owner":"pavlovtech","description":"Web scraper, crawler and parser in C#. Designed as simple, declarative and scalable web scraping solution.","archived":false,"fork":false,"pushed_at":"2024-10-29T17:28:10.000Z","size":39105,"stargazers_count":119,"open_issues_count":5,"forks_count":28,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-04T23:18:32.918Z","etag":null,"topics":["crawler","datamining","parser","parsing","scraper","scraping","scraping-api","scraping-data","scraping-tool","scraping-web","scraping-websites","webcrawler","webscraping"],"latest_commit_sha":null,"homepage":"","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pavlovtech.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-04-12T21:59:25.000Z","updated_at":"2025-03-25T06:30:54.000Z","dependencies_parsed_at":"2024-10-29T18:50:01.287Z","dependency_job_id":null,"html_url":"https://github.com/pavlovtech/WebReaper","commit_stats":{"total_commits":525,"total_committers":4,"mean_commits":131.25,"dds":"0.013333333333333308","last_synced_commit":"a2efb8f67a94b5931f4e9fa8b7978d3f0ed2e656"},"previous_names":["pavlovtech/exoscraper","pavlovtech/exoscan"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pavlovtech%2FWebReaper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pavlovtech%2FWebReaper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pavlovtech%2FWebReaper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pavlovtech%2FWebReaper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pavlovtech","download_url":"https://codeload.github.com/pavlovtech/WebReaper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247860662,"owners_count":21008329,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","datamining","parser","parsing","scraper","scraping","scraping-api","scraping-data","scraping-tool","scraping-web","scraping-websites","webcrawler","webscraping"],"created_at":"2024-08-01T16:00:49.586Z","updated_at":"2025-04-08T14:31:34.290Z","avatar_url":"https://github.com/pavlovtech.png","language":"C#","funding_links":[],"categories":["C#","C\\#"],"sub_categories":[],"readme":"![logo](https://user-images.githubusercontent.com/6662454/221978697-3f35564a-f442-46e6-9182-f2604a17e1f6.png)\n\n# WebReaper\n\n[![NuGet](https://img.shields.io/nuget/v/WebReaper)](https://www.nuget.org/packages/WebReaper)\n[![build status](https://github.com/pavlovtech/WebReaper/actions/workflows/CI.yml/badge.svg)](https://github.com/pavlovtech/WebReaper/actions/workflows/CI.yml)\n\n## Overview\n\nWebReaper is a declarative high performance web scraper, crawler and parser in C#. Designed as simple, extensible and\nscalable web scraping solution. Easily crawl any web site and parse the data, save structed result to a file, DB, or\npretty much to anywhere you want.\n\nIt provides a simple yet extensible API to make web scraping a breeze.\n\n### 📋 Example:\n\n![ray-so-export](https://user-images.githubusercontent.com/6662454/229387724-82ad04cb-6b90-42b8-ba2a-a3735fb94abe.png)\n\n## Table of contents\n\n- [Install](#install)\n- [Requirements](#requirements)\n- [Features](#features)\n- [Usage examples](#usage-examples)\n- [API overview](#api-overview)\n  * [Parsing Single Page Applications](#parsing-single-page-applications)\n  * [Persist the progress locally](#persist-the-progress-locally)\n  * [Authorization](#authorization)\n  * [How to disable headless mode](#how-to-disable-headless-mode)\n  * [How to clean scraped data from the previous web scrapping run](#how-to-clean-scraped-data-from-the-previous-web-scrapping-run)\n  * [How to clean visited links from the previous web scrapping run](#how-to-clean-visited-links-from-the-previous-web-scrapping-run)\n  * [How to clean job queue from the previous web scraping run](#how-to-clean-job-queue-from-the-previous-web-scraping-run)\n  * [Distributed web scraping with Serverless approach](#distributed-web-scraping-with-serverless-approach)\n  * [Extensibility](#extensibility)\n    + [Adding a new sink to persist your data](#adding-a-new-sink-to-persist-your-data)\n  * [Intrefaces](#intrefaces)\n  * [Main entities](#main-entities)\n- [Repository structure](#repository-structure)\n\n## Install\n\n```\ndotnet add package WebReaper\n```\n\n## Requirements\n\n.NET 8\n\n## Features\n\n* :zap: High crawling speed due to parallelism and asynchrony\n* 🗒 Declarative and easy to use\n* 💾 Saving data to any data storages such as JSON or CSV file, MongoDB, CosmosDB, Redis, etc.\n* :earth_americas: Scalable: run your web scraper on ony cloud VMs, serverless functions, on-prem\n  servers, etc.\n* :octopus: Crawling and parsing Single Page Applications with Puppeteer\n* 🖥 Proxy support\n* 🌀 Extensible: replace out-of-the-box implementations with your own\n\n## Usage examples\n\n* Data mining\n* Gathering data for machine learning\n* Online price change monitoring and price comparison\n* News aggregation\n* Product review scraping (to watch the competition)\n* Tracking online presence and reputation\n\n## API overview\n\n### Parsing Single Page Applications\n\nParsing single page applications is super simple, just use the *GetWithBrowser* and/or *FollowWithBrowser* method. In this\ncase Puppeteer will be used to load the pages.\n\n```C#\nusing WebReaper.Builders;\n\nvar engine = await new ScraperEngineBuilder()\n    .GetWithBrowser(\"https://www.alexpavlov.dev/blog\")\n    .FollowWithBrowser(\".text-gray-900.transition\")\n    .Parse(new()\n    {\n        new(\"title\", \".text-3xl.font-bold\"),\n        new(\"text\", \".max-w-max.prose.prose-dark\")\n    })\n    .WriteToJsonFile(\"output.json\")\n    .PageCrawlLimit(10)\n    .WithParallelismDegree(30)\n    .LogToConsole()\n    .BuildAsync();\n\nawait engine.RunAsync();\n```\n\nAdditionally, you can run any JavaScript on dynamic pages as they are loaded with headless browser. In order to do that\nyou need to add some page actions such as *.ScrollToEnd()*:\n\n```C#\nusing WebReaper.Core.Builders;\n\nvar engine = await new ScraperEngineBuilder()\n    .GetWithBrowser(\"https://www.reddit.com/r/dotnet/\", actions =\u003e actions\n        .ScrollToEnd()\n        .Build())\n    .Follow(\"a.SQnoC3ObvgnGjWt90zD9Z._2INHSNB8V5eaWp4P0rY_mE\")\n    .Parse(new()\n    {\n        new(\"title\", \"._eYtD2XCVieq6emjKBH3m\"),\n        new(\"text\", \"._3xX726aBn29LDbsDtzr_6E._1Ap4F5maDtT1E1YuCiaO0r.D3IL3FD0RFy_mkKLPwL4\")\n    })\n    .WriteToJsonFile(\"output.json\")\n    .LogToConsole()\n    .BuildAsync()\n\nawait engine.RunAsync();\n\nConsole.ReadLine();\n```\n\nIt can be helpful if the required content is loaded only after some user interactions such as clicks, scrolls, etc.\n\n### Persist the progress locally\n\nIf you want to persist the visited links and job queue locally, so that you can start crawling where you left off you\ncan use *ScheduleWithTextFile* and *TrackVisitedLinksInFile* methods:\n\n```C#\nvar engine = await new ScraperEngineBuilder()\n    .WithLogger(logger)\n    .Get(\"https://rutracker.org/forum/index.php?c=33\")\n    .Follow(\"#cf-33 .forumlink\u003ea\")\n    .Follow(\".forumlink\u003ea\")\n    .Paginate(\"a.torTopic\", \".pg\")\n    .Parse(new()\n    {\n\tnew(\"name\", \"#topic-title\"),\n\tnew(\"category\", \"td.nav.t-breadcrumb-top.w100.pad_2\u003ea:nth-child(3)\"),\n\tnew(\"subcategory\", \"td.nav.t-breadcrumb-top.w100.pad_2\u003ea:nth-child(5)\"),\n\tnew(\"torrentSize\", \"div.attach_link.guest\u003eul\u003eli:nth-child(2)\"),\n\tnew(\"torrentLink\", \".magnet-link\", \"href\"),\n\tnew(\"coverImageUrl\", \".postImg\", \"src\")\n    })\n    .WriteToJsonFile(\"result.json\")\n    .IgnoreUrls(blackList)\n    .ScheduleWithTextFile(\"jobs.txt\", \"progress.txt\")\n    .TrackVisitedLinksInFile(\"links.txt\")\n    .BuildAsync();\n```\n\n### Authorization\n\nIf you need to pass authorization before parsing the web site, you can call SetCookies method on Scraper that has to\nfill CookieContainer with all cookies required for authorization. You are responsible for performing the login operation\nwith your credentials, the Scraper only uses the cookies that you provide.\n\n```C#\nvar engine = await new ScraperEngineBuilder()\n    .WithLogger(logger)\n    .Get(\"https://rutracker.org/forum/index.php?c=33\")\n    .SetCookies(cookies =\u003e\n    {\n        cookies.Add(new Cookie(\"AuthToken\", \"123\");\n    })\n    ...\n```\n\n### How to disable headless mode\n\nIf you scrape pages with a browser using GetWithBrowser and FollowWithBrowser methods, the default mode is headless\nmeaning that you won't see the browser during scraping. However, seeing the browser during scraping for debugging or\ntroubleshooting may be useful. To disable headless mode you the .HeadlessMode(false) method call.\n\n```C#\n\nvar engine = await new ScraperEngineBuilder()\n    .GetWithBrowser(\"https://www.reddit.com/r/dotnet/\", actions =\u003e actions\n        .ScrollToEnd()\n        .Build())\n    .HeadlessMode(false)\n    ...\n```\n\n### How to clean scraped data from the previous web scrapping run\n\nYou may want to clean the data recived during the previous scraping to start you web scraping from scratch. In this case\nuse dataCleanupOnStart when adding a new sink:\n\n```C#\n\nvar engine = await new ScraperEngineBuilder()\n    .Get(\"https://www.reddit.com/r/dotnet/\")\n    .WriteToJsonFile(\"output.json\", dataCleanupOnStart: true)\n```\n\nThis dataCleanupOnStart parameter is present for all sinks, e.g. MongoDbSink, RedisSink, CosmosSink, etc.\n\n### How to clean visited links from the previous web scrapping run\n\nTo clean up the list of visited links just pass true for dataCleanupOnStart parameter:\n\n```C#\nvar engine = await new ScraperEngineBuilder()\n    .Get(\"https://www.reddit.com/r/dotnet/\")\n    .TrackVisitedLinksInFile(\"visited.txt\", dataCleanupOnStart: true)\n```\n\n### How to clean job queue from the previous web scraping run\n\nJob queue is a queue of tasks schedules for web scraper. To clean up the job queue pass the dataCleanupOnStart parameter set to true.\n\n```C#\nvar engine = await new ScraperEngineBuilder()\n    .Get(\"https://www.reddit.com/r/dotnet/\")\n    .WithTextFileScheduler(\"jobs.txt\", \"currentJob.txt\", dataCleanupOnStart: true)\n```\n\n### Distributed web scraping with Serverless approach\n\nIn the Examples folder you can find the project called WebReaper.AzureFuncs. It demonstrates the use of WebReaper with\nAzure Functions. It consists of two serverless functions:\n\n#### StartScrapting\n\nFirst of all, this function uses ScraperConfigBuilder to build the scraper configuration e. g.:\n\nSecondly, this function writes the first web scraping job with startUrl to the Azure Service Bus queue:\n\n#### WebReaperSpider\n\nThis Azure function is triggered by messages sent to the Azure Service Bus queue. Messages represent web scraping job.\n\nFirstly, this function builds the spider that is going to execute the job from the queue.\n\nSecondly, it executes the job by loading the page, parsing content, saving to the database, etc.\n\nFinally, it iterates through these new jobs and sends them the the Job queue.\n\n### Extensibility\n\n#### Adding a new sink to persist your data\n\nOut of the box there are 4 sinks you can send your parsed data to: ConsoleSink, CsvFileSink, JsonFileSink, CosmosSink (\nAzure Cosmos database).\n\nYou can easily add your own by implementing the IScraperSink interface:\n\n```C#\npublic interface IScraperSink\n{\n    public Task EmitAsync(ParsedData data);\n}\n```\n\nHere is an example of the Console sink:\n\n```C#\npublic class ConsoleSink : IScraperSink\n{\n    public Task EmitAsync(ParsedData parsedItam)\n    {\n        Console.WriteLine($\"{parsedItam.Data.ToString()}\");\n        return Task.CompletedTask;\n    }\n}\n```\n\nAdding your sink to the Scraper is simple, just call *AddSink* method on the Scraper:\n\n```C#\nvar engine = await new ScraperEngineBuilder()\n    .AddSink(new ConsoleSink());\n    .Get(\"https://rutracker.org/forum/index.php?c=33\")\n    .Follow(\"#cf-33 .forumlink\u003ea\")\n    .Follow(\".forumlink\u003ea\")\n    .Paginate(\"a.torTopic\", \".pg\")\n    .Parse(new() {\n        new(\"name\", \"#topic-title\"),\n    })\n    .BuildAsync();\n```\n\nFor other ways to extend your functionality see the next section.\n\n### Intrefaces\n\n| Interface           | Description                                                                                                                   |\n|---------------------|-------------------------------------------------------------------------------------------------------------------------------|\n| IScheduler          | Reading and writing from the job queue. By default, the in-memory queue is used, but you can provider your implementation     |\n| IVisitedLinkTracker | Tracker of visited links. A default implementation is an in-memory tracker. You can provide your own for Redis, MongoDB, etc. |\n| IPageLoader         | Loader that takes URL and returns HTML of the page as a string                                                                |\n| IContentParser      | Takes HTML and schema and returns JSON representation (JObject).                                                              |\n| ILinkParser         | Takes HTML as a string and returns page links                                                                                 |\n| IScraperSink        | Represents a data store for writing the results of web scraping. Takes the JObject as parameter                               |\n| ISpider             | A spider that does the crawling, parsing, and saving of the data                                                              |\n\n### Main entities\n\n* Job - a record that represents a job for the spider\n* LinkPathSelector - represents a selector for links to be crawled\n\n## Repository structure\n\n| Project                                   | Description                                                                       |\n|-------------------------------------------|-----------------------------------------------------------------------------------|\n| WebReaper                                 | Library for web scraping                                                          |\n| WebReaper.ScraperWorkerService            | Example of using WebReaper library in a Worker Service .NET project.              |\n| WebReaper.DistributedScraperWorkerService | Example of using WebReaper library in a distributed way wih Azure Service Bus     |\n| WebReaper.AzureFuncs                      | Example of using WebReaper library with serverless approach using Azure Functions |\n| WebReaper.ConsoleApplication              | Example of using WebReaper library with in a console application                  |\n\nSee the [LICENSE](LICENSE.txt) file for license rights and limitations (GNU GPLv3).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpavlovtech%2FWebReaper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpavlovtech%2FWebReaper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpavlovtech%2FWebReaper/lists"}