An open API service indexing awesome lists of open source software.

https://github.com/mevljas/gov.si-crawler-playwright

A standalone crawler that crawls only .gov.si web sites using Playwright.
https://github.com/mevljas/gov.si-crawler-playwright

crawler multithreading playwright sqlachemy

Last synced: 5 months ago
JSON representation

A standalone crawler that crawls only .gov.si web sites using Playwright.

Awesome Lists containing this project

README

          

# Gov.si crawler playwright

A standalone crawler that crawls only .gov.si web sites using [Playwright](https://playwright.dev/python/).

## Project setup

### Setup environment variables

```bash
cp .env.example .env
```

Edit **.env** file if necessary. Number of threads can be set using the *N_THREADS* parameter.

### Run Docker Postgres database

```bash
docker-compose up -d ieps-db
```

### Create and use virtual env

```bash
pip install virtualenv
python -m venv
source env/bin/activate
```

Alternatively you can set it up using Pycharm.

### Install requirements

```bash
pip install -r requirements.txt
```

### Install Playwright browsers (chromium, firefox, webkit)

```bash
playwright install
```

### Run database migrations

```bash
python migrate.py
```

## Run the crawler

```bash
python main.py
```

## PgAdmin (optional)

You can run PgAdmin Docker container with the following command:

```bash
docker-compose up -d pgadmin
```

Access the pgadmin4 via your favorite web browser by visiting the [URL](http://localhost:5050/).
Use the admin@admin.com as the email address and root as the password to log in.