{"id":15036160,"url":"https://github.com/sjdirect/abotx","last_synced_at":"2025-04-09T23:21:11.885Z","repository":{"id":65259949,"uuid":"43165670","full_name":"sjdirect/abotx","owner":"sjdirect","description":"Cross Platform C# Web crawler framework, headless browser, parallel crawler. Please star this project! +1.","archived":false,"fork":false,"pushed_at":"2023-10-02T17:39:06.000Z","size":17791,"stargazers_count":134,"open_issues_count":8,"forks_count":23,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-03-24T01:11:38.312Z","etag":null,"topics":["abotx","abotx-website","cross-platform","csharp","csharp-library","framework","headless","headless-br","headless-browser","javascript-renderer","netcore","netcore3","netstan","netstandard","netstandard-libraries","netstandard20","spider","spiders","spiders-","web-crawler"],"latest_commit_sha":null,"homepage":"https://abotx.org","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sjdirect.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-09-25T17:47:28.000Z","updated_at":"2025-03-09T12:24:21.000Z","dependencies_parsed_at":"2024-09-24T20:30:26.631Z","dependency_job_id":null,"html_url":"https://github.com/sjdirect/abotx","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sjdirect%2Fabotx","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sjdirect%2Fabotx/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sjdirect%2Fabotx/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sjdirect%2Fabotx/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sjdirect","download_url":"https://codeload.github.com/sjdirect/abotx/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248126362,"owners_count":21051909,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["abotx","abotx-website","cross-platform","csharp","csharp-library","framework","headless","headless-br","headless-browser","javascript-renderer","netcore","netcore3","netstan","netstandard","netstandard-libraries","netstandard20","spider","spiders","spiders-","web-crawler"],"created_at":"2024-09-24T20:30:23.089Z","updated_at":"2025-04-09T23:21:11.864Z","avatar_url":"https://github.com/sjdirect.png","language":"C#","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AbotX [![Build Status](https://dev.azure.com/sjdirect0945/AbotX/_apis/build/status/AbotX%20CI?branchName=master)](https://dev.azure.com/sjdirect0945/AbotX/_build/latest?definitionId=1\u0026branchName=master) [![NuGet](https://img.shields.io/nuget/v/Abotx.svg)](https://www.nuget.org/packages/Abotx/)\n\n*Please star this project!!*\n\nA powerful C# web crawler that makes advanced crawling features easy to use. AbotX builds upon [Abot C# Web Crawler Framework](https://github.com/sjdirect/abot/blob/master/README.md) by providing a powerful set of wrappers and extensions. \n\n## Features\n* Crawl multiple sites concurrently (ParallelCrawlerEngine)\n* Pause/resume live crawls (CrawlerX \u0026 ParallelCrawlerEngine)\n* Render javascript before processing (CrawlerX \u0026 ParallelCrawlerEngine)\n* Simplified pluggability/extensibility (CrawlerX \u0026 ParallelCrawlerEngine)\n* Avoid getting blocked by sites (AutoThrottling)\n* Automatically tune speed/concurrency (AutoTuning)\n\nAbotX use to be a commercial product but is now FREE! Use the AbotX.Lic file in the root of this repository.\n\n## Technical Details\n* Version 2.x targets .NET Standard 2.0 (compatible with .NET framework 4.6.1+ or .NET Core 2+)\n* Version 1.x targets .NET Framework 4.0 (support ends soon, please upgrade)\n\n## Installing AbotX\nInstall AbotX using [Nuget](https://www.nuget.org/packages/Abotx/)\n  \n```command\nPM\u003e Install-Package AbotX\n```\n\nIf you have an AbotX.lic file. Just make sure it ends up in the bin directory of your application (ie.. in the same directory as the AbotX.dll file). \n\n## Quick Start \n\nAbotX adds advanced functionality, shortcuts and configurations to the rock solid [Abot C# Web Crawler](https://github.com/sjdirect/abot/blob/master/README.md). It is recommended that you start with Abot's documentation and quick start before coming here.  \n\nAbotX consists of the two main entry points. They are CrawlerX and ParallelCrawlerEngine. CrawlerX is a single crawler instance (child of Abot's PoliteWebCrawler class) while ParallelCrawlerEngine creates and manages multiple instances of CrawlerX. If you want to just crawl a single site then CrawlerX is where you want to start. If you want to crawl a configurable number of sites concurrently within the same process then the ParallelCrawlerEngine is what you are after. \n\n#### Using AbotX\n```c#\nusing System;\nusing System.Collections.Generic;\nusing System.Threading;\nusing System.Threading.Tasks;\nusing Abot2;\nusing AbotX2.Crawler;\nusing AbotX2.Parallel;\nusing AbotX2.Poco;\nusing Serilog;\n\nnamespace AbotX2.Demo\n{\n    class Program\n    {\n        static async Task Main(string[] args)\n        {\n            //Use Serilog to log\n            Log.Logger = new LoggerConfiguration()\n                .MinimumLevel.Information()\n                .Enrich.WithThreadId()\n                .WriteTo.Console(outputTemplate: Constants.LogFormatTemplate)\n                .CreateLogger();\n\n            var siteToCrawl = new Uri(\"YourSiteHere\");\n\n            //Uncomment to demo major features\n            //await DemoCrawlerX_PauseResumeStop(siteToCrawl);\n            //await DemoCrawlerX_JavascriptRendering(siteToCrawl);\n            //await DemoCrawlerX_AutoTuning(siteToCrawl);\n            //await DemoCrawlerX_Throttling(siteToCrawl);\n            //await DemoParallelCrawlerEngine();\n        }\n\n        private static async Task DemoCrawlerX_PauseResumeStop(Uri siteToCrawl)\n        {\n            using (var crawler = new CrawlerX(GetSafeConfig()))\n            {\n                crawler.PageCrawlCompleted += (sender, args) =\u003e\n                {\n                    //Check out args.CrawledPage for any info you need\n                };\n                var crawlTask = crawler.CrawlAsync(siteToCrawl);\n\n                crawler.Pause();    //Suspend all operations\n\n                Thread.Sleep(7000);\n\n                crawler.Resume();   //Resume as if nothing happened\n\n                crawler.Stop(true); //Stop or abort the crawl\n\n                await crawlTask;\n            }\n        }\n\n        private static async Task DemoCrawlerX_JavascriptRendering(Uri siteToCrawl)\n        {\n            var pathToPhantomJSExeFolder = @\"[YourNugetPackagesLocationAbsolutePath]\\PhantomJS.2.1.1\\tools\\phantomjs]\";\n            var config = new CrawlConfigurationX\n            {\n                IsJavascriptRenderingEnabled = true,\n                JavascriptRendererPath = pathToPhantomJSExeFolder,\n                IsSendingCookiesEnabled = true,\n                MaxConcurrentThreads = 1,\n                MaxPagesToCrawl = 1,\n                JavascriptRenderingWaitTimeInMilliseconds = 3000,\n                CrawlTimeoutSeconds = 20\n            };\n\n            using (var crawler = new CrawlerX(config))\n            {\n                crawler.PageCrawlCompleted += (sender, args) =\u003e\n                {\n                    //JS should be fully rendered here args.CrawledPage.Content.Text\n                };\n\n                await crawler.CrawlAsync(siteToCrawl);\n            }\n        }\n\n        private static async Task DemoCrawlerX_AutoTuning(Uri siteToCrawl)\n        {\n            var config = GetSafeConfig();\n            config.AutoTuning = new AutoTuningConfig\n            {\n                IsEnabled = true,\n                CpuThresholdHigh = 85,\n                CpuThresholdMed = 65,\n                MinAdjustmentWaitTimeInSecs = 10\n            };\n            //Optional, configure how aggressively to speed up or down during throttling\n            config.Accelerator = new AcceleratorConfig();\n            config.Decelerator = new DeceleratorConfig();\n\n            //Now the crawl is able to \"AutoTune\" itself if the host machine\n            //is showing signs of stress.\n            using (var crawler = new CrawlerX(config))\n            {\n                crawler.PageCrawlCompleted += (sender, args) =\u003e\n                {\n                    //Check out args.CrawledPage for any info you need\n                };\n                await crawler.CrawlAsync(siteToCrawl);\n            }\n        }\n\n        private static async Task DemoCrawlerX_Throttling(Uri siteToCrawl)\n        {\n            var config = GetSafeConfig();\n            config.AutoThrottling = new AutoThrottlingConfig\n            {\n                IsEnabled = true,\n                ThresholdHigh = 2,\n                ThresholdMed = 2,\n                MinAdjustmentWaitTimeInSecs = 10\n            };\n            //Optional, configure how aggressively to speed up or down during throttling\n            config.Accelerator = new AcceleratorConfig();\n            config.Decelerator = new DeceleratorConfig();\n\n            //Now the crawl is able to \"Throttle\" itself if the site being crawled\n            //is showing signs of stress.\n            using (var crawler = new CrawlerX(config))\n            {\n                crawler.PageCrawlCompleted += (sender, args) =\u003e\n                {\n                    //Check out args.CrawledPage for any info you need\n                };\n                await crawler.CrawlAsync(siteToCrawl);\n            }\n        }\n\n        private static async Task DemoParallelCrawlerEngine()\n        {\n            var siteToCrawlProvider = new SiteToCrawlProvider();\n            siteToCrawlProvider.AddSitesToCrawl(new List\u003cSiteToCrawl\u003e\n            {\n                new SiteToCrawl{ Uri = new Uri(\"YOURSITE1\") },\n                new SiteToCrawl{ Uri = new Uri(\"YOURSITE2\") },\n                new SiteToCrawl{ Uri = new Uri(\"YOURSITE3\") },\n                new SiteToCrawl{ Uri = new Uri(\"YOURSITE4\") },\n                new SiteToCrawl{ Uri = new Uri(\"YOURSITE5\") }\n            });\n\n            var config = GetSafeConfig();\n            config.MaxConcurrentSiteCrawls = 3;\n                \n            var crawlEngine = new ParallelCrawlerEngine(\n                config, \n                new ParallelImplementationOverride(config, \n                    new ParallelImplementationContainer()\n                    {\n                        SiteToCrawlProvider = siteToCrawlProvider,\n                        WebCrawlerFactory = new WebCrawlerFactory(config)//Same config will be used for every crawler\n                    })\n                );                \n            \n            var crawlCounts = new Dictionary\u003cGuid, int\u003e();\n            var siteStartingEvents = 0;\n            var allSitesCompletedEvents = 0;\n            crawlEngine.CrawlerInstanceCreated += (sender, eventArgs) =\u003e\n            {\n                var crawlId = Guid.NewGuid();\n                eventArgs.Crawler.CrawlBag.CrawlId = crawlId;\n            };\n            crawlEngine.SiteCrawlStarting += (sender, args) =\u003e\n            {\n                Interlocked.Increment(ref siteStartingEvents);\n            };\n            crawlEngine.SiteCrawlCompleted += (sender, eventArgs) =\u003e\n            {\n                lock (crawlCounts)\n                {\n                    crawlCounts.Add(eventArgs.CrawledSite.SiteToCrawl.Id, eventArgs.CrawledSite.CrawlResult.CrawlContext.CrawledCount);\n                }\n            };\n            crawlEngine.AllCrawlsCompleted += (sender, eventArgs) =\u003e\n            {\n                Interlocked.Increment(ref allSitesCompletedEvents);\n            };\n\n            await crawlEngine.StartAsync();\n        }\n\n        private static CrawlConfigurationX GetSafeConfig()\n        {\n            /*The following settings will help not get your ip banned\n             by the sites you are trying to crawl. The idea is to crawl\n             only 5 pages and wait 2 seconds between http requests\n             */\n            return new CrawlConfigurationX\n            {\n                MaxPagesToCrawl = 10,\n                MinCrawlDelayPerDomainMilliSeconds = 2000\n            };\n        }\n    }\n}\n\n```\n## CrawlerX\nCrawlerX is an object that represents an individual crawler that crawls a single site at a time. It is a subclass of Abot's PoliteWebCrawler and adds some useful functionality.\n\n#### Example Usage\nCreate an instance and register for events...\n```c#\nvar crawler = new CrawlerX();\ncrawler.PageCrawlStarting += crawler_ProcessPageCrawlStarting;\ncrawler.PageCrawlCompleted += crawler_ProcessPageCrawlCompleted;\ncrawler.PageCrawlDisallowed += crawler_PageCrawlDisallowed;\ncrawler.PageLinksCrawlDisallowed += crawler_PageLinksCrawlDisallowed;\n```\nWorking with some common events...\n```c#\nvoid crawler_ProcessPageCrawlStarting(object sender, PageCrawlStartingArgs e)\n{\n    PageToCrawl pageToCrawl = e.PageToCrawl;\n    Console.WriteLine(\"About to crawl link {0} which was found on page {1}\", pageToCrawl.Uri.AbsoluteUri,   pageToCrawl.ParentUri.AbsoluteUri);\n}\n\nvoid crawler_ProcessPageCrawlCompleted(object sender, PageCrawlCompletedArgs e)\n{\n    CrawledPage crawledPage = e.CrawledPage;\n\n    if (crawledPage.WebException != null || crawledPage.HttpWebResponse.StatusCode != HttpStatusCode.OK)\n        Console.WriteLine(\"Crawl of page failed {0}\", crawledPage.Uri.AbsoluteUri);\n    else\n        Console.WriteLine(\"Crawl of page succeeded {0}\", crawledPage.Uri.AbsoluteUri);\n\n    if (string.IsNullOrEmpty(crawledPage.Content.Text))\n        Console.WriteLine(\"Page had no content {0}\", crawledPage.Uri.AbsoluteUri);\n}\n\nvoid crawler_PageLinksCrawlDisallowed(object sender, PageLinksCrawlDisallowedArgs e)\n{\n    CrawledPage crawledPage = e.CrawledPage;\n    Console.WriteLine(\"Did not crawl the links on page {0} due to {1}\", crawledPage.Uri.AbsoluteUri, e.DisallowedReason);\n}\n\nvoid crawler_PageCrawlDisallowed(object sender, PageCrawlDisallowedArgs e)\n{\n    PageToCrawl pageToCrawl = e.PageToCrawl;\n    Console.WriteLine(\"Did not crawl page {0} due to {1}\", pageToCrawl.Uri.AbsoluteUri, e.DisallowedReason);\n}\n```\nRun the crawl synchronously\n```c#\nvar result = crawler.Crawl(new Uri(\"YourSiteHere\"));\n```\nRun the crawl asynchronously\n```c#\nvar result = await crawler.CrawlAsync(new Uri(\"YourSiteHere\"));\n```\n#### Easy Override\nCrawlerX has default implementations for all its dependencies. However, there are times where you may want to override one or all of those implementations. Below is an example of how you would plugin your own implementations. The new ImplementationOverride class makes plugging in nested dependencies much easier than it use to be with Abot. It will handle finding exactly where that implementation is needed.\n\n```c#\nvar impls = new ImplementationOverride(config, ImplementationContainer {\n    HyperlinkParser = new YourImpl1(),\n    PageRequester = new YourImpl2()\n});\n\nvar crawler = new CrawlerX(config, impls);\n```\n#### Pause And Resume\nPause and resume work as you would expect. However, just be aware that any in progress http requests will be finished, processed and any events related to those will be fired.\n```c#\nvar crawler = new CrawlerX();\n\ncrawler.PageCrawlCompleted += (sender, args) =\u003e\n{\n    //You will be interested in args.CrawledPage \u0026 args.CrawlContext\n};\n\nvar crawlerTask = crawler.CrawlAsync(new Uri(\"http://blahblahblah.com\"));\n\nSystem.Threading.Thread.Sleep(3000);\ncrawler.Pause();\nSystem.Threading.Thread.Sleep(10000);\ncrawler.Resume();\n\nvar result = crawlerTask.Result;\n```\n\n#### Stop\nStopping the crawl is as simple as calling Stop(). The call to Stop() tells AbotX to not make any new http requests but to finish any that are in progress. Any events and processing of the in progress requests will finish before CrawlerX stops the crawl.\n```c#\nvar crawler = new CrawlerX();\n\ncrawler.PageCrawlCompleted += (sender, args) =\u003e\n{\n    //You will be interested in args.CrawledPage \u0026 args.CrawlContext\n};\n\nvar crawlerTask = crawler.CrawlAsync(new Uri(\"http://blahblahblah.com\"));\n\nSystem.Threading.Thread.Sleep(3000);\ncrawler.Stop();\nvar result = crawlerTask.Result;\n```\nBy passing true to the Stop() method, AbotX will stop the crawl more abruptly. Anything in pogress will be aborted.\n```c#\ncrawler.Stop(true);\n```\n#### Speed Up\nCrawlerX can be \"sped up\" by calling the SpeedUp() method. The call to SpeedUp() tells AbotX to increase the number of concurrent http requests to the currently running sites. You can can call this method as many times as you like. Adjustments are made instantly so you should see more concurrency immediately.\n\n```c#\ncrawler.CrawlAsync(new Uri(\"http://localhost:1111/\"));\n\nSystem.Threading.Thread.Sleep(3000);\ncrawler.SpeedUp();\n\nSystem.Threading.Thread.Sleep(3000);\ncrawler.SpeedUp();\n```\nSee the \"Configure Speed Up And Slow Down\" section for more details on how to control exactly what happens when SpeedUp() is called.\n\n#### Slow Down\n\nCrawlerX can be \"slowed down\" by calling the SlowDown() method. The call to SlowDown() tells AbotX to reduce the number of concurrent http requests to the currently runnning sites. You can can call this method as many times as you like. Any currently executing http requests will finish normally before any adjustments are made.\n\n```c#\ncrawler.CrawlAsync(new Uri(\"http://localhost:1111/\"));\n\nSystem.Threading.Thread.Sleep(3000);\ncrawler.SlowDown();\n\nSystem.Threading.Thread.Sleep(3000);\ncrawler.SlowDown();\n```\nSee the \"Configure Speed Up And Slow Down\" section for more details on how to control exactly what happens when SlowDown() is called.\n\n## Parallel Crawler Engine\nA crawler instance can crawl a single site quickly. However, if you have to crawl 10,000 sites quickly you need the ParallelCrawlerEngine. It allows you to crawl a configurable number of sites concurrently to maximize throughput.\n\n#### Example Usage\nThe concurrency is configurable by setting the maxConcurrentSiteCrawls in the config. The default value is 3 so the following block of code will crawl three sites simultaneously.\n```c#\nstatic void Main(string[] args)\n{\n    var siteToCrawlProvider = new SiteToCrawlProvider();\n    siteToCrawlProvider.AddSitesToCrawl(new List\u003cSiteToCrawl\u003e\n    {\n        new SiteToCrawl{ Uri = new Uri(\"http://somesitetocrawl1.com/\") },\n        new SiteToCrawl{ Uri = new Uri(\"http://somesitetocrawl2.com/\") },\n        new SiteToCrawl{ Uri = new Uri(\"http://somesitetocrawl3.com/\") },\n    });\n\n    //Create the crawl engine instance\n    var impls = new ParallelImplementationOverride(\n        config,\n        new ParallelImplementationContainer\n        {\n            SiteToCrawlProvider = siteToCrawlProvider\n            WebCrawlerFactory = yourWebCrawlerFactory //YOU NEED TO IMPLEMENT THIS!!!!\n        }\n    );\n\n    var crawlEngine = new ParallelCrawlerEngine(config, impls);\n\n    //Register for site level events\n    crawlEngine.AllCrawlsCompleted += (sender, eventArgs) =\u003e\n    {\n        Console.WriteLine(\"Completed crawling all sites\");\n    };\n    crawlEngine.SiteCrawlCompleted += (sender, eventArgs) =\u003e\n    {\n        Console.WriteLine(\"Completed crawling site {0}\", eventArgs.CrawledSite.SiteToCrawl.Uri);       \n    };\n    crawlEngine.CrawlerInstanceCreated += (sender, eventArgs) =\u003e\n    {\n        //Register for crawler level events. These are Abot's events!!!\n        eventArgs.Crawler.PageCrawlCompleted += (abotSender, abotEventArgs) =\u003e\n        {\n            Console.WriteLine(\"You have the crawled page here in abotEventArgs.CrawledPage...\");\n        };\n    };\n\n    crawlEngine.StartAsync();\n\n    Console.WriteLine(\"Press enter key to stop\");\n    Console.Read();\n}\n```\n#### Easy Override Of Default Implementations\nParallelCrawlerEngine allows easy override of one or all of it's dependent implementations. Below is an example of how you would plugin your own implementations (same as above). The new ParallelImplementationOverride class makes plugging in nested dependencies much easier than it use to be. It will handle finding exactly where that implementation is needed.\n\n```c#\nvar impls = new ParallelImplementationOverride(config, new ImplementationContainer {\n    SiteToCrawlProvider = yourSiteToCrawlProvider,\n    WebCrawlerFactory = yourFactory,\n        ...(Excluded)\n});\n\nvar crawlEngine = new ParallelCrawlerEngine(config, impls);\n```\n\n#### Pause And Resume\nPause and resume on the ParallelCrawlerEngine simply relays the command to each active CrawlerX instance. However, just be aware that any in progress http requests will be finished, processed and any events related to those will be fired.\n\n```c#\ncrawlEngine.StartAsync();\n\nSystem.Threading.Thread.Sleep(3000);\ncrawlEngine.Pause();\nSystem.Threading.Thread.Sleep(10000);\ncrawlEngine.Resume();\n```\n\n#### Stop\nStopping the crawl is as simple as calling Stop(). The call to Stop() tells AbotX to not make any new http requests but to finish any that are in progress. Any events and processing of the in progress requests will finish before each CrawlerX instance stops its crawl as well.\n\n```c#\ncrawlEngine.StartAsync();\n\nSystem.Threading.Thread.Sleep(3000);\ncrawlEngine.Stop();\n```\n\nBy passing true to the Stop() method, it will stop each CrawlerX instance more abruptly. Anything in pogress will be aborted.\n\n```c#\ncrawlEngine.Stop(true);\n```\n\n#### Speed Up\nThe ParallelCrawlerEngine can be \"sped up\" by calling the SpeedUp() method. The call to SpeedUp() tells AbotX to increase the number of concurrent site crawls that are currently running. You can can call this method as many times as you like. Adjustments are made instantly so you should see more concurrency immediately.\n\n```c#\ncrawlEngine.StartAsync();\n\nSystem.Threading.Thread.Sleep(3000);\ncrawlEngine.SpeedUp();\n\nSystem.Threading.Thread.Sleep(3000);\ncrawlEngine.SpeedUp();\n```\n\nSee the \"Configure Speed Up And Slow Down\" section for more details on how to control exactly what happens when SpeedUp() is called.\n\n#### Slow Down\nThe ParallelCrawlerEngine can be \"slowed down\" by calling the SlowDown() method. The call to SlowDown() tells AbotX to reduce the number of concurrent site crawls that are currently running. You can can call this method as many times as you like. Any currently executing crawls will finish normally before any adjustments are made.\n\n```c#\ncrawlEngine.StartAsync();\n\nSystem.Threading.Thread.Sleep(3000);\ncrawlEngine.SlowDown();\n\nSystem.Threading.Thread.Sleep(3000);\ncrawlEngine.SlowDown();\n```\n\nSee the \"Configure Speed Up And Slow Down\" section for more details on how to control exactly what happens when SlowDown() is called.\n\n\n\n## Configure Speed Up And Slow Down\nMultiple features trigger AbotX to speed up or to slow down crawling. The Accelerator and Decelerator are two independently configurable components that determine exactly how agressively AbotX reacts to a situation that triggers a SpeedUp or SlowDown. The default works fine for most cases but the following are options you have to take further control.\n\n#### Accelerator\n\nName | Description | Used By\n--- | --- | ---\nconfig.Accelerator.ConcurrentSiteCrawlsIncrement | The number to increment the MaxConcurrentSiteCrawls for each call the the SpeedUp() method. This deals with site crawl concurrency, NOT the number of concurrent http requests to a single site crawl. | ParallelCrawlerEngine\nconfig.Accelerator.ConcurrentRequestIncrement\t| The number to increment the MaxConcurrentThreads for each call the the SpeedUp() method. This deals with the number of concurrent http requests for a single crawl. |\tCrawlerX\nconfig.Accelerator.DelayDecrementInMilliseconds\t| If there is a configured (manual or programatically determined) delay in between requests to a site, this is the amount of milliseconds to remove from that configured value on every call to the SpeedUp() method.\t| CrawlerX\nconfig.Accelerator.MinDelayInMilliseconds |\tIf there is a configured (manual or programatically determined) delay in between requests to a site, this is the minimum amount of milliseconds to delay no matter how many calls to the SpeedUp() method. |\tCrawlerX\nconfig.Accelerator.ConcurrentSiteCrawlsMax\t| The maximum amount of concurrent site crawls to allow no matter how many calls to the SpeedUp() method.\t| ParallelCrawlerEngine\nconfig.Accelerator.ConcurrentRequestMax\t| The maximum amount of concurrent http requests to a single site no matter how many calls to the SpeedUp() method.\t| CrawlerX\n\n#### Decelerator\n\nName | Description | Used By\n--- | --- | ---\nconfig.Decelerator.ConcurrentSiteCrawlsDecrement |\tThe number to decrement the MaxConcurrentSiteCrawls for each call the the SlowDown() method. This deals with site crawl concurrency, NOT the number of concurrent http requests to a single site crawl.\t| ParallelCrawlerEngine\nconfig.Decelerator.ConcurrentRequestDecrement\t| The number to decrement the MaxConcurrentThreads for each call the the SlowDown() method. This deals with the number of concurrent http requests for a single crawl. |\tCrawlerX\nconfig.Decelerator.DelayIncrementInMilliseconds |\tIf there is a configured (manual or programatically determined) delay in between requests to a site, this is the amount of milliseconds to add to that configured value on every call to the SlowDown() method\tCrawlerX\nconfig.Decelerator.MaxDelayInMilliseconds\t| The maximum value the delay can be.\t| CrawlerX\nconfig.Decelerator.ConcurrentSiteCrawlsMin |\tThe minimum amount of concurrent site crawls to allow no matter how many calls to the SlowDown() method.\t| ParallelCrawlerEngine\nconfig.Decelerator.ConcurrentRequestMin |\tThe minimum amount of concurrent http requests to a single site no matter how many calls to the SlowDown() method.\t| CrawlerX\n\n\n## Javascript Rendering\nMany web pages on the internet today use javascript to create the final page rendering. Most web crawlers do not render the javascript but instead just process the raw html sent back by the server. Use this feature to render javascript before processing.\n\n#### Additional Installation Step\nIf you plan to use Javascript rendering there is an additional step for the time being. Unfortunately, NUGET has proven to be a train wreck as .NET has advanced (.NET Core vs Standard, PackageReference vs Packages.config, dotnet pack vs nuget pack, etc..). This has caused some packages that AbotX depends on no longer install correctly. Specifically the PhatomJS package no longer adds the phantomjs.exe file to your project and marks it for output to the bin directory.\n\nThe workaround is to manually add this file to your project, set it as \"Content\" and \"Copy If Newer\". This will make sure the phantom.exe file is in the bin when AbotX needs it. This package is already referenced by AbotX so you will have a copy of this file at \"[YourNugetPackagesLocationAbsolutePath]\\PhantomJS.2.1.1\\tools\\phantomjs\". Another option would be to tell AbotX where to look for the file by using the CrawlConfigurationX.JavascriptRendererPath config value. This path is of the DIRECTORY that contains the phantomjs.exe file.\n\n#### Performance Considerations\nRendering javascript is a much slower operation than just requesting the page source. The browser has to make the initial request to the web server for the page source. Then it must request, wait for and load all the external resources. Care must be taken in how you configure AbotX when this feature is enabled. A modern machine with an intel I7 processor and 8+ gigs of ram could crawl 30-50 sites concurrently and each of those crawls spawning 10+ threads each. However if javascript rendering is enabled that same configuration would overwhelm the host machine\n\n#### Safe Configuration\nThe following is an example how to configure Abot/AbotX to run with javascript rendering enabled for a modern host machine that has an Intel I7 processor and at least 16GB of ram. If it has 4 cores and 8 logical processors, it should be able to handle this configuration under normal circumstances.\n\n```c#\nvar config = new CrawlConfigurationX\n{\n    IsJavascriptRenderingEnabled = true,\n    JavascriptRenderingWaitTimeInMilliseconds = 3000, //How long to wait for js to process \n    MaxConcurrentSiteCrawls = 1,                      //Only crawl a single site at a time\n    MaxConcurrentThreads = 8,                         //Logical processor count to avoid cpu thrashing\n};\nvar crawler = new CrawlerX(config);\n\n//Add optional decision whether javascript should be rendered\ncrawler.ShouldRenderPageJavascript((crawledPage, crawlContext) =\u003e\n{\n    if(crawledPage.Uri.AbsoluteUri.Contains(\"ghost\"))\n        return new CrawlDecision {Allow = false, Reason = \"scared to render ghost javascript\"};\n\n    return new CrawlDecision { Allow = true };\n}); //You can implement IDecisionMakerX interface for even more control\nvar crawlerTask = crawler.CrawlAsync(new Uri(\"http://blahblahblah.com\"));\n```\n\n## Auto Throttling\nMost websites you crawl cannot or will not handle the load of a web crawler. Auto Throttling automatically slows down the crawl speed if the website being crawled is showing signs of stress or unwillingness to respond to the frequency of http requests.\n\n#### Example Usage\n```c#\nvar config = new CrawlConfigurationX\n{\n    AutoThrottling = new AutoThrottlingConfig\n    {\n        IsEnabled = true,\n        ThresholdHigh = 10,                 //default\n        ThresholdMed = 5,                   //default\n        ThresholdTimeInMilliseconds = 5000, //default\n        MinAdjustmentWaitTimeInSecs = 30    //default\n    },\n    Decelerator = new DeceleratorConfig\n    {\n        ConcurrentSiteCrawlsDecrement = 2,      //default\n        ConcurrentRequestDecrement = 2,         //default\n        DelayIncrementInMilliseconds = 2000,    //default\n        MaxDelayInMilliseconds = 15000,         //default\n        ConcurrentSiteCrawlsMin = 1,            //default\n        ConcurrentRequestMin = 1                //default\n    },\n    MaxRetryCount = 3,\n};\n```\n\nUsing CrawlerX (single instance of a crawler)\n```c#\nvar crawler = new CrawlerX(config);\ncrawler.CrawlAsync(new Uri(url));\n```\n\nUsing ParallelCrawlerEngine (multiple instances of crawlers)\n```c#\nvar crawlEngine = new ParallelCrawlerEngine(config);\n```\n\nConfigure the sensitivity to what will trigger throttling\n\nName |\tDescription\t| Used By\n--- | --- |\nconfig.AutoThrottling.IsEnabled |\tWhether to enable the AutoThrottling feature |\tCrawlerX\nconfig.AutoThrottling.ThresholdHigh\t| The number of \"stressed\" requests before considering a crawl as under high stress |\tCrawlerX\nconfig.AutoThrottling.ThresholdMed |\tThe number of \"stressed\" requests before considering a crawl as under medium stress\t| CrawlerX\nconfig.AutoThrottling.ThresholdTimeInMilliseconds |\tThe number of elapsed milliseconds in response time that would consider the response \"stressed\" |\tCrawlerX\nconfig.AutoThrottling.MinAdjustmentWaitTimeInSecs\t| The minimum number of seconds since the last throttled request to wait before attempting to check/adjust throttling again. We want to give the last adjustment a chance to work before adjusting again.\t| CrawlerX\n\nSee the \"Configure Speed Up And Slow Down\" section for more details on how to control exactly what happens during AutoThrottling in regards to slowing down the crawl (Decelerator).\n\n## Auto Tuning\nIts difficult to predict what your machine can handle when the sites you will crawl/process all require different levels of machine resources. Auto tuning automatically monitors the host machine's resource usage and adjusts the crawl speed and concurrency to maximize throughput without overrunning it.\n\n#### Example Usage\n```c#\nvar config = new CrawlConfigurationX\n{\n    AutoTuning = new AutoTuningConfig\n    {\n        IsEnabled = true,\n        CpuThresholdHigh = 85,              //default\n        CpuThresholdMed = 65,               //default\n        MinAdjustmentWaitTimeInSecs = 30    //default\n    },\n    Accelerator = new AcceleratorConfig\n    {\n        ConcurrentSiteCrawlsIncrement = 2,      //default\n        ConcurrentRequestIncrement = 2,         //default\n        DelayDecrementInMilliseconds = 2000,    //default\n        MinDelayInMilliseconds = 0,             //default\n        ConcurrentSiteCrawlsMax = config.MaxConcurrentSiteCrawls,   //default is 0\n        ConcurrentRequestMax = config.MaxConcurrentThreads          //default is 0\n    },\n    Decelerator = new DeceleratorConfig\n    {\n        ConcurrentSiteCrawlsDecrement = 2,      //default\n        ConcurrentRequestDecrement = 2,         //default\n        DelayIncrementInMilliseconds = 2000,    //default\n        MaxDelayInMilliseconds = 15000,         //default\n        ConcurrentSiteCrawlsMin = 1,            //default\n        ConcurrentRequestMin = 1                //default\n    },\n    MaxRetryCount = 3,\n};\n```\nUsing CrawlerX (single instance of a crawler)\n```c#\nvar crawler = new CrawlerX(config);\ncrawler.CrawlAsync(new Uri(url));\n```\n\nUsing ParallelCrawlerEngine (multiple instances of crawlers)\n```c#\nvar crawlEngine = new ParallelCrawlerEngine(config);\n```\n\nConfigure the sensitivity to what will trigger tuning\n\nName |\tDescription |\tUsed By\n--- | --- | ---\nconfig.AutoTuning.IsEnabled |\tWhether to enable the AutoTuning feature\t| CrawlerX \u0026 ParallelCrawlerEngine\nconfig.AutoTuning.CpuThresholdHigh |\tThe avg cpu percentage before considering a host as under high stress |\tCrawlerX \u0026 ParallelCrawlerEngine\nconfig.AutoTuning.CpuThresholdMed\t| The avg cpu percentage before considering a host as under medium stress\t| CrawlerX \u0026 ParallelCrawlerEngine\nconfig.AutoTuning.MinAdjustmentWaitTimeInSecs |\tThe minimum number of seconds since the last tuned action to wait before attempting to check/adjust tuning again. We want to give the last adjustment a chance to work before adjusting again.\t| CrawlerX \u0026 ParallelCrawlerEngine\n\nSee the \"Configure Speed Up And Slow Down\" section for more details on how to control exactly what happens during AutoTuning in regards to speeding up and slowing down the crawl (Accelerator \u0026 Decelerator).\n\n\u003cbr /\u003e\u003cbr /\u003e\u003cbr /\u003e\n\u003chr /\u003e\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsjdirect%2Fabotx","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsjdirect%2Fabotx","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsjdirect%2Fabotx/lists"}