https://github.com/jacraig/spidey

A multi threaded web crawler library that is generic enough to allow different engines to be swapped in.
https://github.com/jacraig/spidey

crawler webcrawler

Last synced: 12 months ago
JSON representation

A multi threaded web crawler library that is generic enough to allow different engines to be swapped in.

Host: GitHub
URL: https://github.com/jacraig/spidey
Owner: JaCraig
License: apache-2.0
Created: 2017-09-27T12:30:00.000Z (almost 9 years ago)
Default Branch: master
Last Pushed: 2024-04-12T00:57:33.000Z (over 2 years ago)
Last Synced: 2024-04-12T09:21:41.197Z (over 2 years ago)
Topics: crawler, webcrawler
Language: C#
Homepage: https://jacraig.github.io/Spidey/
Size: 17 MB
Stars: 11
Watchers: 5
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md

Awesome Lists containing this project

README

          #  Spidey

[![.NET Publish](https://github.com/JaCraig/Spidey/actions/workflows/dotnet-publish.yml/badge.svg)](https://github.com/JaCraig/Spidey/actions/workflows/dotnet-publish.yml) [![NuGet](https://img.shields.io/nuget/v/Spidey.svg)](https://www.nuget.org/packages/Spidey/)

Spidey is a flexible and extensible .NET library for crawling web content. It is designed for .NET Core applications and provides a modular architecture, allowing you to customize or extend any part of the crawling pipeline.

## Features

- Simple API for crawling websites

- Highly configurable via the `Options` class

- Dependency injection support (IoC/DI)

- Easily replaceable subsystems (engine, parser, scheduler, etc.)

- Callback-based result handling

- NuGet package available

## Quick Start

Install the NuGet package:

```powershell

dotnet add package Spidey

```

## Setting up the Library

Register Spidey in your app's service collection using the `RegisterSpidey` extension method:

```csharp

using Microsoft.Extensions.DependencyInjection;

using Spidey;

var services = new ServiceCollection();

services.RegisterSpidey();

// Optionally, register your Options configuration

services.AddSingleton(new Options

{

    ItemFound = result => Console.WriteLine($"Found: {result.Url}"),

    Allow = new List { "http://mywebsite", "http://mywebsite2" },

    FollowOnly = new List { /* regex patterns */ },

    Ignore = new List { /* regex patterns */ },

    StartLocations = new List { "http://mywebsite", "http://mywebsite2" },

    UrlReplacements = new Dictionary { /* { "old", "new" } */ },

    // Other options as needed

});

var provider = services.BuildServiceProvider();

var crawler = provider.GetRequiredService();

```

Alternatively, you can instantiate `Crawler` and `Options` directly without DI:

```csharp

var options = new Options

{

    ItemFound = result => Console.WriteLine($"Found: {result.Url}"),

    // ...other options

};

var crawler = new Crawler(options);

```

## Options Configuration

The `Options` class configures the crawler's behavior. Key properties include:

- `ItemFound` (`Action`): Callback invoked when a new page is discovered.

- `Allow` (`List`): Regex patterns for URLs allowed to be crawled.

- `FollowOnly` (`List`): Regex patterns for pages whose links should be followed.

- `Ignore` (`List`): Regex patterns for URLs to ignore.

- `StartLocations` (`List`): Initial URLs to start crawling from.

- `UrlReplacements` (`Dictionary`): URL replacements during crawling.

- `NetworkCredentials` (`NetworkCredential`): Optional credentials for authentication.

- `UseDefaultCredentials` (`bool`): Use default system credentials.

- `Proxy` (`IWebProxy`): Optional proxy settings.

Example callback method:

```csharp

void OnItemFound(ResultFile result)

{

    Console.WriteLine($"Discovered: {result.Url} (Status: {result.StatusCode})");

    // Additional processing...

}

```

## Basic Usage

Once configured, start the crawl process:

```csharp

crawler.StartCrawl();

```

The library will handle link discovery, content downloading, and result parsing. Your callback will be invoked for each discovered item.

## Customization

Spidey is built with extensibility in mind. The system is divided into the following subsystems, each replaceable via DI:

1. **Content Parser (`IContentParser`)** – Parses downloaded data into `ResultFile` objects.

2. **Engine (`IEngine`)** – Handles HTTP requests and content downloading.

3. **Link Discoverer (`ILinkDiscoverer`)** – Extracts links from content.

4. **Processor (`IProcessor`)** – Processes parsed content (default: invokes your callback).

5. **Scheduler (`IScheduler`)** – Manages work distribution.

6. **Pipeline (`IPipeline`)** – Orchestrates the crawling process.

To customize, implement the relevant interface from `Spidey.Engines.Interfaces` and register your implementation in the service provider. Note that if you call RegisterSpidey(), the registration is handled for you automatically. If you instantiate `Crawler` directly, you must compose the pipeline manually.

## FAQ

**Q: Can I run the crawler on multiple nodes?**

A: The default scheduler is single-node only. For distributed crawling, implement a custom scheduler (e.g., using a database or message queue) to coordinate work between instances.

## Build Process

Requirements:

- Visual Studio 2022

Clone the project and open the solution (`Spidey.sln`) in Visual Studio to build.

## License

See [LICENSE](LICENSE) for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jacraig/spidey

Awesome Lists containing this project

README