https://github.com/coverified/spider

A microservice with web-crawler/spider capabilities which only follows and indexes urls of the provided host domain(s)
https://github.com/coverified/spider

akka crawler graphql hacktoberfest microservice spider

Last synced: about 2 months ago
JSON representation

A microservice with web-crawler/spider capabilities which only follows and indexes urls of the provided host domain(s)

Host: GitHub
URL: https://github.com/coverified/spider
Owner: coverified
License: bsd-3-clause
Created: 2021-08-01T17:30:53.000Z (almost 5 years ago)
Default Branch: main
Last Pushed: 2021-09-30T08:25:27.000Z (over 4 years ago)
Last Synced: 2025-04-23T17:42:04.173Z (about 1 year ago)
Topics: akka, crawler, graphql, hacktoberfest, microservice, spider
Language: Scala
Homepage:
Size: 279 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # spider

A microservice to crawl a set of sites by following links to pages of the relevant domains.

Only the relevant host urls of the provided host(s) are considered.

New URLs are entered into a GraphQL database.

## Used Frameworks / Libraries

_(not comprehensive, but the most important ones)_

-   [akka](https://akka.io/)

-   [Caliban Client](https://ghostdogpr.github.io/caliban/) to talk to GraphQL endpoint

-   [Sentry](https://sentry.io/welcome/) (error reporting)

## Configuration

Configuration is done using environment variables.

The following configuration parameters are available.

Environment config values:

- `API_URL` - GraphQL API URL (**required**)

- `AUTH_SECRET` - GraphQL authentication secret (**required**)

- `SCRAPE_PARALLELISM` - number of pages that crawler visits in parallel (default: 100)

- `SCRAPE_INTERVAL` - time interval between page hits (default: 500ms)

- `SCRAPE_TIMEOUT` - timeout of each page load attempt (default: 20.000ms)

- `SHUTDOWN_TIMEOUT` - time after which spider exits, if no new URLs have been found (default: 15.000ms)

- `MAX_RETRIES` - max number of retries after attempts to load a page failed (default: 0)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/coverified/spider

Awesome Lists containing this project

README