https://github.com/julinium/pmmp-scraper
Python application to scrape the PMMP website (Moroccan Public Procurement Market) and save cleaned data into a postgresql database.
https://github.com/julinium/pmmp-scraper
automation beautifulsoup chromedriver chromium python requests scraping selenium
Last synced: about 2 months ago
JSON representation
Python application to scrape the PMMP website (Moroccan Public Procurement Market) and save cleaned data into a postgresql database.
- Host: GitHub
- URL: https://github.com/julinium/pmmp-scraper
- Owner: Julinium
- License: gpl-3.0
- Created: 2025-05-27T12:23:40.000Z (about 1 year ago)
- Default Branch: master
- Last Pushed: 2025-06-14T16:52:19.000Z (about 1 year ago)
- Last Synced: 2025-06-14T17:38:57.754Z (about 1 year ago)
- Topics: automation, beautifulsoup, chromedriver, chromium, python, requests, scraping, selenium
- Language: Python
- Homepage: https://emarches.com
- Size: 112 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# pmmp-scraper
pmmp-scraper is a Python application aiming to scrape the PMMP website and store cleaned data in a separate database.
It's intended for learning pusposes only.
# Legal ?
Always remember: scraping may be illegal or cause you to be banned or blacklisted.
Before scraping a website, please make sure the owner of the website allows it.
# Docker ?
This application uses Chromium web browser and relies on Cron jobs to update database frequently. It's not pretty practical to run it as a docker container. Instead, it runs well on the server/hypervisor.
# Scraping a different website ?
This won't probably work out of the box. Because scraping depends on the target website technology and replies ...
# How to use ?
1. Clone the repo, extract and cd...
2. Setup your settings in .env file.
3. Run crony/worker.sh --level debug --links crawl --found ignore.
Please refer to app/setting.py for more info about the args.
4. Optionally, setup a cron job to run the script periodically.
# .env files
1. .env:
SITE_ROOT = "https://www.xxx.tld/" # Target website constants
SITE_INDEX = "https://www.xxx.tld/index.php"
LINK_PREFIX = 'https://www.xxx.tld/index.php?page=yyy&zzz='
LINK_STITCH = '&aaa='
DB_SERVER = '0.0.0.0' # Postgresql Database engine
DB_PORT = 9999
DB_NAME = "dbname"
DB_USER = "dbuser"
DB_PASS = "$trongP@ssw0rd-999"
MEDIA_ROOT = '/var/opt/path/to/media' # Preferably absolute paths. Make sure they exist and are writeable.
SELENO_DIR = '/var/opt/main/path'
2. .env.creds.json: # Credentials to use to download DCE files. Use as many as possible. They are randomly shuffled.
[
{"fname": "John", "lname": "Doe", "email": "john@doe"},
...
{"fname": "Jean", "lname": "Dupont", "email": "jean@dupont"}
]
3. .env.ua.json: # User-agents strings to use to navigate the target website. Use as many as possible. DO NOT include modile devices. They are randomly shuffled.
[
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36",
...
"Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148"
]