{"id":28447500,"url":"https://github.com/julinium/pmmp-scraper","last_synced_at":"2026-05-04T23:33:33.705Z","repository":{"id":295808443,"uuid":"991299480","full_name":"Julinium/pmmp-scraper","owner":"Julinium","description":"Python application to scrape the PMMP website (Moroccan Public Procurement Market) and save cleaned data into a postgresql database.","archived":false,"fork":false,"pushed_at":"2025-06-14T16:52:19.000Z","size":115,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-06-14T17:38:57.754Z","etag":null,"topics":["automation","beautifulsoup","chromedriver","chromium","python","requests","scraping","selenium"],"latest_commit_sha":null,"homepage":"https://emarches.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Julinium.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-27T12:23:40.000Z","updated_at":"2025-06-14T16:52:22.000Z","dependencies_parsed_at":"2025-05-27T13:44:06.876Z","dependency_job_id":"81131175-f3d6-43fc-872e-d850ece96075","html_url":"https://github.com/Julinium/pmmp-scraper","commit_stats":null,"previous_names":["julinium/pmmp-scraper"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/Julinium/pmmp-scraper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Julinium%2Fpmmp-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Julinium%2Fpmmp-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Julinium%2Fpmmp-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Julinium%2Fpmmp-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Julinium","download_url":"https://codeload.github.com/Julinium/pmmp-scraper/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Julinium%2Fpmmp-scraper/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261940661,"owners_count":23233573,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automation","beautifulsoup","chromedriver","chromium","python","requests","scraping","selenium"],"created_at":"2025-06-06T12:00:43.663Z","updated_at":"2026-05-04T23:33:33.673Z","avatar_url":"https://github.com/Julinium.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pmmp-scraper\npmmp-scraper is a Python application aiming to scrape the PMMP website and store cleaned data in a separate database.\nIt's intended for learning pusposes only.\n\n# Legal ?\nAlways remember: scraping may be illegal or cause you to be banned or blacklisted.\nBefore scraping a website, please make sure the owner of the website allows it.\n\n# Docker ?\nThis application uses Chromium web browser and relies on Cron jobs to update database frequently. It's not pretty practical to run it as a docker container. Instead, it runs well on the server/hypervisor.\n\n# Scraping a different website ?\nThis won't probably work out of the box. Because scraping depends on the target website technology and replies ...\n\n# How to use ?\n1. Clone the repo, extract and cd...\n2. Setup your settings in .env file.\n3. Run crony/worker.sh --level debug --links crawl --found ignore. \nPlease refer to app/setting.py for more info about the args.\n4. Optionally, setup a cron job to run the script periodically.\n\n# .env files\n1. .env:\n    SITE_ROOT = \"https://www.xxx.tld/\" # Target website constants\n    SITE_INDEX = \"https://www.xxx.tld/index.php\"\n    LINK_PREFIX = 'https://www.xxx.tld/index.php?page=yyy\u0026zzz='\n    LINK_STITCH = '\u0026aaa='\n\n    DB_SERVER = '0.0.0.0' # Postgresql Database engine\n    DB_PORT = 9999\n    DB_NAME = \"dbname\"\n    DB_USER = \"dbuser\"\n    DB_PASS = \"$trongP@ssw0rd-999\"\n\n    MEDIA_ROOT  = '/var/opt/path/to/media' # Preferably absolute paths. Make sure they exist and are writeable.\n    SELENO_DIR = '/var/opt/main/path'\n\n2. .env.creds.json: # Credentials to use to download DCE files. Use as many as possible. They are randomly shuffled.\n    [\n        {\"fname\": \"John\", \"lname\": \"Doe\", \"email\": \"john@doe\"},\n        ...\n        {\"fname\": \"Jean\", \"lname\": \"Dupont\", \"email\": \"jean@dupont\"}\n    ]\n\n3. .env.ua.json: # User-agents strings to use to navigate the target website. Use as many as possible. DO NOT include modile devices. They are randomly shuffled.\n    [\n        \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36\",\n        ...\n        \"Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148\"\n    ]","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjulinium%2Fpmmp-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjulinium%2Fpmmp-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjulinium%2Fpmmp-scraper/lists"}