https://github.com/jacraig/spidey
A multi threaded web crawler library that is generic enough to allow different engines to be swapped in.
https://github.com/jacraig/spidey
crawler webcrawler
Last synced: 10 months ago
JSON representation
A multi threaded web crawler library that is generic enough to allow different engines to be swapped in.
- Host: GitHub
- URL: https://github.com/jacraig/spidey
- Owner: JaCraig
- License: apache-2.0
- Created: 2017-09-27T12:30:00.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2024-04-12T00:57:33.000Z (about 2 years ago)
- Last Synced: 2024-04-12T09:21:41.197Z (about 2 years ago)
- Topics: crawler, webcrawler
- Language: C#
- Homepage: https://jacraig.github.io/Spidey/
- Size: 17 MB
- Stars: 11
- Watchers: 5
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
Awesome Lists containing this project
README
#
Spidey
[](https://github.com/JaCraig/Spidey/actions/workflows/dotnet-publish.yml) [](https://www.nuget.org/packages/Spidey/)
Spidey is a flexible and extensible .NET library for crawling web content. It is designed for .NET Core applications and provides a modular architecture, allowing you to customize or extend any part of the crawling pipeline.
## Features
- Simple API for crawling websites
- Highly configurable via the `Options` class
- Dependency injection support (IoC/DI)
- Easily replaceable subsystems (engine, parser, scheduler, etc.)
- Callback-based result handling
- NuGet package available
## Quick Start
Install the NuGet package:
```powershell
dotnet add package Spidey
```
## Setting up the Library
Register Spidey in your app's service collection using the `RegisterSpidey` extension method:
```csharp
using Microsoft.Extensions.DependencyInjection;
using Spidey;
var services = new ServiceCollection();
services.RegisterSpidey();
// Optionally, register your Options configuration
services.AddSingleton(new Options
{
ItemFound = result => Console.WriteLine($"Found: {result.Url}"),
Allow = new List { "http://mywebsite", "http://mywebsite2" },
FollowOnly = new List { /* regex patterns */ },
Ignore = new List { /* regex patterns */ },
StartLocations = new List { "http://mywebsite", "http://mywebsite2" },
UrlReplacements = new Dictionary { /* { "old", "new" } */ },
// Other options as needed
});
var provider = services.BuildServiceProvider();
var crawler = provider.GetRequiredService();
```
Alternatively, you can instantiate `Crawler` and `Options` directly without DI:
```csharp
var options = new Options
{
ItemFound = result => Console.WriteLine($"Found: {result.Url}"),
// ...other options
};
var crawler = new Crawler(options);
```
## Options Configuration
The `Options` class configures the crawler's behavior. Key properties include:
- `ItemFound` (`Action`): Callback invoked when a new page is discovered.
- `Allow` (`List`): Regex patterns for URLs allowed to be crawled.
- `FollowOnly` (`List`): Regex patterns for pages whose links should be followed.
- `Ignore` (`List`): Regex patterns for URLs to ignore.
- `StartLocations` (`List`): Initial URLs to start crawling from.
- `UrlReplacements` (`Dictionary`): URL replacements during crawling.
- `NetworkCredentials` (`NetworkCredential`): Optional credentials for authentication.
- `UseDefaultCredentials` (`bool`): Use default system credentials.
- `Proxy` (`IWebProxy`): Optional proxy settings.
Example callback method:
```csharp
void OnItemFound(ResultFile result)
{
Console.WriteLine($"Discovered: {result.Url} (Status: {result.StatusCode})");
// Additional processing...
}
```
## Basic Usage
Once configured, start the crawl process:
```csharp
crawler.StartCrawl();
```
The library will handle link discovery, content downloading, and result parsing. Your callback will be invoked for each discovered item.
## Customization
Spidey is built with extensibility in mind. The system is divided into the following subsystems, each replaceable via DI:
1. **Content Parser (`IContentParser`)** – Parses downloaded data into `ResultFile` objects.
2. **Engine (`IEngine`)** – Handles HTTP requests and content downloading.
3. **Link Discoverer (`ILinkDiscoverer`)** – Extracts links from content.
4. **Processor (`IProcessor`)** – Processes parsed content (default: invokes your callback).
5. **Scheduler (`IScheduler`)** – Manages work distribution.
6. **Pipeline (`IPipeline`)** – Orchestrates the crawling process.
To customize, implement the relevant interface from `Spidey.Engines.Interfaces` and register your implementation in the service provider. Note that if you call RegisterSpidey(), the registration is handled for you automatically. If you instantiate `Crawler` directly, you must compose the pipeline manually.
## FAQ
**Q: Can I run the crawler on multiple nodes?**
A: The default scheduler is single-node only. For distributed crawling, implement a custom scheduler (e.g., using a database or message queue) to coordinate work between instances.
## Build Process
Requirements:
- Visual Studio 2022
Clone the project and open the solution (`Spidey.sln`) in Visual Studio to build.
## License
See [LICENSE](LICENSE) for details.