Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/prophetlamb/scrapeaas
Mordern toolkit for implementing web scrapers as a ASP.NET service.
https://github.com/prophetlamb/scrapeaas
Last synced: about 2 months ago
JSON representation
Mordern toolkit for implementing web scrapers as a ASP.NET service.
- Host: GitHub
- URL: https://github.com/prophetlamb/scrapeaas
- Owner: ProphetLamb
- License: mit
- Created: 2023-09-25T05:55:08.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2023-12-31T13:39:49.000Z (about 1 year ago)
- Last Synced: 2024-01-01T12:24:36.451Z (about 1 year ago)
- Language: C#
- Size: 502 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# Scrape as a service
ScrapeAAS integrates existing packages and ASP.NET features into a toolstack enabling you, the developer, to design your scraping service using a fammilar environment.
## Quickstart
Add `ASP.NET Hosting`, `ScrapeAAS`, a validator of your choice (here [Dawn.Guard](https://github.com/safakgur/guard) RIP), and a object mapper of your choice (here [AutoMapper](https://automapper.org/)), and the database/messagequeue you feel most comftable with (here [EFcore](https://learn.microsoft.com/en-us/ef/core/get-started/overview/first-app?tabs=netcore-cli) with SQLite).
```bash
dotnet add package Microsoft.Extensions.Hosting
dotnet add package ScrapeAAS
dotnet add package Dawn.Guard
dotnet add package AutoMapper.Extensions.Microsoft.DependencyInjection
```**[Full example](./examples/RedditDotnetScraper/) of scraping the [r/dotnet subreddit](https://old.reddit.com/r/dotnet).**
Create a crawler, a that service periodically triggers scraping
```csharp
var builder = Host.CreateApplicationBuilder(args);
builder.Services
.AddAutoMapper()
.AddScrapeAAS()
.AddHostedService()
.AddDataflow()
.AddDataflow()sealed class RedditSubredditCrawler : BackgroundService {
private readonly IAngleSharpBrowserPageLoader _browserPageLoader;
private readonly IDataflowPublisher _publisher;
...
protected override async Task ExecuteAsync(CancellationToken stoppingToken) {
... execute service scope periotically
}private async Task CrawlAsync(IDataflowPublisher publisher, CancellationToken stoppingToken)
{
_logger.LogInformation("Crawling /r/dotnet");
await publisher.PublishAsync(new("dotnet", new("https://old.reddit.com/r/dotnet")), stoppingToken);
_logger.LogInformation("Crawling complete");
}
}
```Implement your spiders, services that collect, and normalize data.
```csharp
sealed class RedditPostSpider : IDataflowHandler {
private readonly IAngleSharpBrowserPageLoader _browserPageLoader;
private readonly IDataflowPublisher _publisher;
...private async Task ParseRedditTopLevelPosts(RedditSubreddit subreddit, CancellationToken stoppingToken)
{
Url root = new("https://old.reddit.com/");
_logger.LogInformation("Parsing top level posts from {RedditSubreddit}", subreddit);
var document = await _browserPageLoader.LoadAsync(subreddit.Url, stoppingToken);
_logger.LogInformation("Request complete");
var queriedContent = document
.QuerySelectorAll("div.thing")
.AsParallel()
.Select(div => new
{
PostUrl = div.QuerySelector("a.title")?.GetAttribute("href"),
Title = div.QuerySelector("a.title")?.TextContent,
Upvotes = div.QuerySelector("div.score.unvoted")?.GetAttribute("title"),
Comments = div.QuerySelector("a.comments")?.TextContent,
CommentsUrl = div.QuerySelector("a.comments")?.GetAttribute("href"),
PostedAt = div.QuerySelector("time")?.GetAttribute("datetime"),
PostedBy = div.QuerySelector("a.author")?.TextContent,
})
.Select(queried => new RedditPost(
new(root, Guard.Argument(queried.PostUrl).NotEmpty()),
Guard.Argument(queried.Title).NotEmpty(),
long.Parse(queried.Upvotes.AsSpan()),
Regex.Match(queried.Comments ?? "", "^\\d+") is { Success: true } commentCount ? long.Parse(commentCount.Value) : 0,
new(queried.CommentsUrl),
DateTimeOffset.Parse(queried.PostedAt.AsSpan()),
new(Guard.Argument(queried.PostedBy).NotEmpty())
), IExceptionHandler.Handle((ex, item) => _logger.LogInformation(ex, "Failed to parse {RedditTopLevelPostBrief}", item)));
foreach (var item in queriedContent)
{
await _publisher.PublishAsync(item, stoppingToken);
}
_logger.LogInformation("Parsing complete");
}
}
```Add a sink, a service that commits the scraped data disk/network.
```csharp
sealed class RedditSqliteSink : IAsyncDisposable, IDataflowHandler, IDataflowHandler
{
private readonly RedditPostSqliteContext _context;
private readonly IMapper _mapper;
...
public async ValueTask DisposeAsync()
{
await _context.Database.EnsureCreatedAsync();
await _context.SaveChangesAsync();
}public async ValueTask HandleAsync(RedditSubreddit message, CancellationToken cancellationToken = default)
{
var messageDto = _mapper.Map(message);
await _context.Database.EnsureCreatedAsync(cancellationToken);
await _context.Subreddits.AddAsync(messageDto, cancellationToken);
}public async ValueTask HandleAsync(RedditPost message, CancellationToken cancellationToken = default)
{
var messageDto = _mapper.Map(message);
if (await _context.Users.FindAsync(new object[] { message.PostedBy.Id }, cancellationToken) is { } existingUser)
{
messageDto.PostedById = existingUser.Id;
messageDto.PostedBy = existingUser;
}
await _context.Database.EnsureCreatedAsync(cancellationToken);
await _context.Posts.AddAsync(messageDto, cancellationToken);
}
}
```## Why not [WebReaper](https://github.com/pavlovtech/WebReaper) or [DotnetSpider](https://github.com/dotnetcore/DotnetSpider)?
I have tried both toolstacks, and found them wanting. So I tried to make it better by delegating as much work as reasonable to existing projects.
In addition to my own goals; from evaluating both libraries I wish to keep all thier pros, and discard all their cons.
The verbocity of this library sits comtably between WebReaper and DotnetSpider, but more towards the DotnetSpider end of things.- Integration into ASP.NET Hosting.
- No dependencies at the core of the project. Instead package a reasonable set of addons by default.
- Use and expose integrated NuGet packages in addons when possible to allow develops to benefit form existing ecosystems.### Evaluation of [DotnetSpider](https://github.com/dotnetcore/DotnetSpider)
The overall data flow in `ScrapeAAS` is adopted from `DotnetSpider`: Crawler --> Spider --> Sink .
- Pro: Pub/Sub event handling for decoupled data flow.
- Pro: Easy extendibility by tapping events.
- Con: Terrible debugging experience using model annotations.
- Con: Smelly `dynamic` riddeled design when storing to a database.
- Con: Retry policies missing.
- Con: Much boilerplate nessessary.### Evaluation of [WebReaper](https://github.com/pavlovtech/WebReaper)
The [Puppeteer](https://pptr.dev/) browser handling is a mixture of the [lifetime tracking http handler](https://source.dot.net/#Microsoft.Extensions.Http/DefaultHttpClientFactory.cs) and the [WebReaper Puppeteer integration](https://github.com/pavlovtech/WebReaper/blob/master/WebReaper/Core/Loaders/Concrete/PuppeteerPageLoader.cs).
- Pro: Simple declarative builder API. No boilderplate needed.
- Pro: Easy extendibility by implementing interfaces.
- Pro: Puppeteer browser.
- Con: Unable to control data flow.
- Con: Unable to parse data.
- Con: No ASP.NET or **any** DI integration possible.
- Con: Dependencies for optional extendibilites, such as `Redis`, `MySql`, `RabbitMq`, are always included in the package.