https://github.com/coverified/spider
A microservice with web-crawler/spider capabilities which only follows and indexes urls of the provided host domain(s)
https://github.com/coverified/spider
akka crawler graphql hacktoberfest microservice spider
Last synced: 11 months ago
JSON representation
A microservice with web-crawler/spider capabilities which only follows and indexes urls of the provided host domain(s)
- Host: GitHub
- URL: https://github.com/coverified/spider
- Owner: coverified
- License: bsd-3-clause
- Created: 2021-08-01T17:30:53.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2021-09-30T08:25:27.000Z (over 4 years ago)
- Last Synced: 2025-02-16T15:34:07.077Z (about 1 year ago)
- Topics: akka, crawler, graphql, hacktoberfest, microservice, spider
- Language: Scala
- Homepage:
- Size: 279 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# spider
A microservice to crawl a set of sites by following links to pages of the relevant domains.
Only the relevant host urls of the provided host(s) are considered.
New URLs are entered into a GraphQL database.
## Used Frameworks / Libraries
_(not comprehensive, but the most important ones)_
- [akka](https://akka.io/)
- [Caliban Client](https://ghostdogpr.github.io/caliban/) to talk to GraphQL endpoint
- [Sentry](https://sentry.io/welcome/) (error reporting)
## Configuration
Configuration is done using environment variables.
The following configuration parameters are available.
Environment config values:
- `API_URL` - GraphQL API URL (**required**)
- `AUTH_SECRET` - GraphQL authentication secret (**required**)
- `SCRAPE_PARALLELISM` - number of pages that crawler visits in parallel (default: 100)
- `SCRAPE_INTERVAL` - time interval between page hits (default: 500ms)
- `SCRAPE_TIMEOUT` - timeout of each page load attempt (default: 20.000ms)
- `SHUTDOWN_TIMEOUT` - time after which spider exits, if no new URLs have been found (default: 15.000ms)
- `MAX_RETRIES` - max number of retries after attempts to load a page failed (default: 0)