Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/montferret/worker

Containerized Ferret worker
https://github.com/montferret/worker

chrome crawler docker dsl ferret go hacktoberfest hacktoberfest2020 scraping scraping-websites service worker

Last synced: 2 months ago
JSON representation

Containerized Ferret worker

Host: GitHub
URL: https://github.com/montferret/worker
Owner: MontFerret
License: apache-2.0
Created: 2020-05-08T12:22:00.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2023-03-29T01:59:10.000Z (almost 2 years ago)
Last Synced: 2024-11-04T15:52:29.394Z (3 months ago)
Topics: chrome, crawler, docker, dsl, ferret, go, hacktoberfest, hacktoberfest2020, scraping, scraping-websites, service, worker
Language: Go
Homepage:
Size: 1.68 MB
Stars: 14
Watchers: 4
Forks: 7
Open Issues: 12
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

# Worker

**Worker** is a simple HTTP server that accepts FQL queries, executes them and returns their results.
OpenAPI v2 schema can be found [here](https://raw.githubusercontent.com/MontFerret/cli/master/reference/ferret-worker.yaml).

## Quick start

The Worker is shipped with dedicated Docker image that contains headless Google Chrome, so feel free to run queries using `cdp` driver:

DockerHub
```sh
docker run -d -p 8080:8080 montferret/worker
```
GitHub
```sh
docker run -d -p 8080:8080 ghcr.io/montferret/worker
```

Alternatively, if you want to use your own version of Chrome, you can run the Worker locally.

By installing the binary:

```shell
curl https://raw.githubusercontent.com/MontFerret/worker/master/install.sh | sh
worker
```

Or by building locally:

```sh
make
```

And then just make a POST request:

![worker](https://raw.githubusercontent.com/MontFerret/worker/master/assets/postman.png)

## System Resource Requirements
- 2 CPU
- 2 Gb of RAM

## Usage

### Endpoints

#### POST /
Executes a given query. The payload must have the following shape:

```
Query {
text: String!
params: Map
}
```

#### GET /info
Returns a worker information that contains details about Chrome, Ferret and itself. Has the following shape:

```
Info {
ip: String!
version: Version! {
worker: String!
chrome: ChromeVersion! {
browser: String!
protocol: String!
v8: String!
webkit: String!
}
ferret: String!
}
}
```

#### GET /health
Health check endpoint (for Kubernetes, e.g.). Returns empty 200.

### Run commands

```bash
-log-level="debug"
log level
-port=8080
port to listen
-body-limit=1000
maximum size of request body in kb. 0 means no limit.
-request-limit=20
amount of requests per second for each IP. 0 means no limit.
-request-limit-time-window=180
amount of seconds for request rate limit time window.
-cache-size=100
amount of cached queries. 0 means no caching.
-chrome-ip="127.0.0.1"
Google Chrome remote IP address
-chrome-port=9222
Google Chrome remote debugging port
-no-chrome=false
disable Chrome driver
-version=false
show version
-help=false
show this list
```