Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/68publishers/crawler
:spider_web: Awesome scenario based crawler
https://github.com/68publishers/crawler
crawlee crawler crawling node nodejs scraper scraping
Last synced: 17 days ago
JSON representation
:spider_web: Awesome scenario based crawler
- Host: GitHub
- URL: https://github.com/68publishers/crawler
- Owner: 68publishers
- License: mit
- Created: 2023-04-01T02:43:42.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-03-06T23:58:01.000Z (10 months ago)
- Last Synced: 2024-05-20T23:06:27.126Z (7 months ago)
- Topics: crawlee, crawler, crawling, node, nodejs, scraper, scraping
- Language: JavaScript
- Homepage:
- Size: 1.73 MB
- Stars: 7
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.md
Awesome Lists containing this project
README
Crawler
Scenario-based crawler written in Node.js
## Table of Contents
* [About](#about)
* [Development setup](#development-setup)
* [Production setup](#production-setup)
* [Environment variables](#environment-variables)
* [Rest API and Queues board](#rest-api-and-queues-board)
* [Working with scenarios](#working-with-scenarios)
* [Working with scenario schedulers](#working-with-scenario-schedulers)
* [Tutorial: Creating the fist scenario](#tutorial-creating-the-first-scenario)
* [Integrations](#integrations)
* [License](#license)## About
Crawler is a standalone application written in Node.js built on top of [Express.js](https://github.com/expressjs/express), [Crawlee](https://github.com/apify/crawlee), [Puppeteer](https://github.com/puppeteer/puppeteer) and [BullMQ](https://github.com/taskforcesh/bullmq), allowing you to crawl data from web pages by defining scenarios. This is all controlled through the Rest API.
![Crawler admin](docs/images/main.png)
## Development setup
### Prerequisites
- Docker compose
- Make### Installation
```sh
$ git clone https://github.com/68publishers/crawler.git crawler
$ cd crawler
$ make init
```### Creating a user
HTTP Basic authorization is required for API access and administration. Here we need to create a user to access the application.
```sh
$ docker exec -it crawler-app npm run user:create
```## Production setup
### Prerequisites
- Docker
- Postgres `>=14.6`
- Redis `>=7`For production use, the following Redis settings must be made:
1. Configuring persistence with `Append-only-file` strategy - https://redis.io/docs/management/persistence/#aof-advantages
1. Set `Max memory policy` to `noeviction` - https://redis.io/docs/reference/eviction/#eviction-policies### Installation
Firstly, you need to run the database migrations with the following command:
```sh
$ docker run \
--network \
-e DB_URL=postgres://:@:/ \
--entrypoint '/bin/sh' \
-it \
--rm \
68publishers/crawler:latest \
-c 'npm run migrations:up'
```Then download the `seccomp` file, which is required to run chrome:
```sh
$ curl -C - -O https://raw.githubusercontent.com/68publishers/crawler/main/.docker/chrome/chrome.json
```And run the application:
```sh
$ docker run \
-- init \
--network \
-e APP_URL= \
-e DB_URL=postgres://:@:/ \
-e REDIS_HOST= \
-e REDIS_PORT= \
-e REDIS_AUTH= \
-p 3000:3000 \
--security-opt seccomp=$(pwd)/chrome.json \
-d \
--name 68publishers_crawler \
68publishers/crawler:latest
```### Creating a user
HTTP Basic authorization is required for API access and administration. Here we need to create a user to access the application.
```sh
$ docker exec -it 68publishers_crawler npm run user:create
```## Environment variables
| Name | Required | Default | Description |
|---------------------|----------|-----------------------------|-------------------------------------------------------------------------------------------------------------------------|
| APP_URL | yes | - | Full origin of the application e.g. `https://www.example.com`. The variable is used to create links to screenshots etc. |
| APP_PORT | no | `3000` | Port to which the application listens |
| DB_URL | yes | - | Connection string to postgres database e.g. postgres://root:root@localhost:5432/crawler |
| REDIS_HOST | yes | - | Redis hostname |
| REDIS_PORT | yes | - | Redis port |
| REDIS_AUTH | no | - | Optional redis password |
| REDIS_DB | no | `0` | Redis database number |
| WORKER_PROCESSES | no | `5` | Number of workers that process the queue of running scenarios |
| CRAWLEE_STORAGE_DIR | no | `./var/crawlee` | Directory where crawler stores runtime data |
| CHROME_PATH | no | `/usr/bin/chromium-browser` | Path to Chromium executable file |
| SENTRY_DSN | no | - | Logging into the Sentry is enabled if the variable is passed |
| SENTRY_SERVER_NAME | no | `crawler` | Server name that is passed into the Sentry logger |## Rest API and Queues board
The specification of the Rest API (Swagger UI) can be found at endpoint `/api-docs`. Usually `http://localhost:3000/api-docs` in case of development setup. You can try to call all endpoints here.
Alternatively, the specification can be viewed [online](https://petstore.swagger.io/?url=https://raw.githubusercontent.com/68publishers/crawler/main/public/openapi.json).
BullBoard is located at `/admin/queues`. Here you can see all the scenarios that are currently running or have already run.
## Working with scenarios
@todo
## Working with scenario schedulers
@todo
## Tutorial: Creating the first scenario
@todo
## Integrations
- PHP Client for Crawler's API - [68publishers/crawler-client-php](https://github.com/68publishers/crawler-client-php)
## License
The package is distributed under the MIT License. See [LICENSE](LICENSE.md) for more information.