https://github.com/mevljas/gov.si-crawler-playwright
A standalone crawler that crawls only .gov.si web sites using Playwright.
https://github.com/mevljas/gov.si-crawler-playwright
crawler multithreading playwright sqlachemy
Last synced: 5 months ago
JSON representation
A standalone crawler that crawls only .gov.si web sites using Playwright.
- Host: GitHub
- URL: https://github.com/mevljas/gov.si-crawler-playwright
- Owner: mevljas
- License: mit
- Created: 2023-03-05T15:19:25.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2025-01-17T17:12:46.000Z (over 1 year ago)
- Last Synced: 2025-01-17T18:28:18.903Z (over 1 year ago)
- Topics: crawler, multithreading, playwright, sqlachemy
- Language: Python
- Homepage:
- Size: 107 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Gov.si crawler playwright
A standalone crawler that crawls only .gov.si web sites using [Playwright](https://playwright.dev/python/).
## Project setup
### Setup environment variables
```bash
cp .env.example .env
```
Edit **.env** file if necessary. Number of threads can be set using the *N_THREADS* parameter.
### Run Docker Postgres database
```bash
docker-compose up -d ieps-db
```
### Create and use virtual env
```bash
pip install virtualenv
python -m venv
source env/bin/activate
```
Alternatively you can set it up using Pycharm.
### Install requirements
```bash
pip install -r requirements.txt
```
### Install Playwright browsers (chromium, firefox, webkit)
```bash
playwright install
```
### Run database migrations
```bash
python migrate.py
```
## Run the crawler
```bash
python main.py
```
## PgAdmin (optional)
You can run PgAdmin Docker container with the following command:
```bash
docker-compose up -d pgadmin
```
Access the pgadmin4 via your favorite web browser by visiting the [URL](http://localhost:5050/).
Use the admin@admin.com as the email address and root as the password to log in.