An open API service indexing awesome lists of open source software.

https://github.com/coverified/spider

A microservice with web-crawler/spider capabilities which only follows and indexes urls of the provided host domain(s)
https://github.com/coverified/spider

akka crawler graphql hacktoberfest microservice spider

Last synced: 11 months ago
JSON representation

A microservice with web-crawler/spider capabilities which only follows and indexes urls of the provided host domain(s)

Awesome Lists containing this project

README

          

# spider
A microservice to crawl a set of sites by following links to pages of the relevant domains.
Only the relevant host urls of the provided host(s) are considered.
New URLs are entered into a GraphQL database.

## Used Frameworks / Libraries
_(not comprehensive, but the most important ones)_

- [akka](https://akka.io/)
- [Caliban Client](https://ghostdogpr.github.io/caliban/) to talk to GraphQL endpoint
- [Sentry](https://sentry.io/welcome/) (error reporting)

## Configuration
Configuration is done using environment variables.
The following configuration parameters are available.

Environment config values:
- `API_URL` - GraphQL API URL (**required**)
- `AUTH_SECRET` - GraphQL authentication secret (**required**)
- `SCRAPE_PARALLELISM` - number of pages that crawler visits in parallel (default: 100)
- `SCRAPE_INTERVAL` - time interval between page hits (default: 500ms)
- `SCRAPE_TIMEOUT` - timeout of each page load attempt (default: 20.000ms)
- `SHUTDOWN_TIMEOUT` - time after which spider exits, if no new URLs have been found (default: 15.000ms)
- `MAX_RETRIES` - max number of retries after attempts to load a page failed (default: 0)