{"id":30820786,"url":"https://github.com/oksana-vtk/consultants-scraper","last_synced_at":"2026-05-08T14:04:25.957Z","repository":{"id":313159009,"uuid":"1050247193","full_name":"oksana-vtk/Consultants-Scraper","owner":"oksana-vtk","description":"Consultant Scraper - Dynamic web scraping using Selenium for paginated data","archived":false,"fork":false,"pushed_at":"2025-09-04T07:26:38.000Z","size":9,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-04T09:35:14.638Z","etag":null,"topics":["beautifulsoup","pandas","python","selenium","webscraper"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oksana-vtk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-04T06:49:58.000Z","updated_at":"2025-09-04T07:26:41.000Z","dependencies_parsed_at":"2025-09-04T09:49:42.981Z","dependency_job_id":null,"html_url":"https://github.com/oksana-vtk/Consultants-Scraper","commit_stats":null,"previous_names":["oksana-vtk/consultants-scraper"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/oksana-vtk/Consultants-Scraper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oksana-vtk%2FConsultants-Scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oksana-vtk%2FConsultants-Scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oksana-vtk%2FConsultants-Scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oksana-vtk%2FConsultants-Scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oksana-vtk","download_url":"https://codeload.github.com/oksana-vtk/Consultants-Scraper/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oksana-vtk%2FConsultants-Scraper/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273887826,"owners_count":25185759,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-06T02:00:13.247Z","response_time":2576,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["beautifulsoup","pandas","python","selenium","webscraper"],"created_at":"2025-09-06T10:02:38.970Z","updated_at":"2026-05-08T14:04:20.921Z","avatar_url":"https://github.com/oksana-vtk.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Consultant Scraper - Dynamic web scraping using Selenium for paginated data\nThis project demonstrates how to dynamically scrape listing data from a paginated directory \nusing Selenium, with filters, scroll interaction, and contact detail extraction \nfrom profile pages.\n\n### Project Execution Steps\nThis project automates the extraction of consultant listings and their details from a website using Selenium, BeautifulSoup, and pandas.\nIt consists of two main scripts that run sequentially:\n\n- `show_more.py` – collects all consultant names and profile links for selected countries.\n- `consultants_details.py` – visits each profile and scrapes additional details such as About, Headquarters, Website, Email, and Phone.\n\nThe output is a clean dataset of consultants with structured information.\n\n### 1. Run `show_more.py` - Scraping listings for Country_1\nA Python script was developed to select certain country *Country_1* in the Country Location filter and \nperform dynamic, paginated scraping of listing cards. \nIt automatically loads additional results by interacting with the **“Show more”** button. \n\nFor each card, it collects: \n- ✅ Name\n- ✅ Link to the listings’s profile\n- ✅ Country (based on the selected filter)\n\nTotal listings found under Country_1: 2,016\nData saved: listings_country_1.csv   (separator: *)\n\n### 2. Run `show_more.py` - Scraping listings for Country_2\nA Python script was developed to select *Country_2* in the Country Location filter and perform dynamic, \npaginated scraping of listing cards. \nIt automatically loads additional results by interacting with the **“Show more”** button. \n\nFor each card, it collects: \n\n- ✅ Name\n- ✅ Link to the listing’s profile\n- ✅ Country (based on the selected filter)\n\nTotal listings found under Country_2: 762\nData saved to: listings_country_2.csv (separator: *)\n\n### 3. Data Cleaning \u0026 Deduplication\nUsing Power Query, both datasets were analyzed and merged by matching listings profile Links \nto remove duplicates (some consultants appear in multiple countries).\n\n- Country_1, but not Country_2 listings: 1,410\n- Both Country_1 and Country_2 listings: 606\n- Country_2, but not Country_1 listings: 156\n\nTotal unique Links across both countries: 2,172\nData saved to: listings_both_countries.csv (separator: *)\n\nThe merged dataset contains the following columns:\n- ✅ Name\n- ✅ Link to the listing’s profile\n- ✅ Country_1 (listed under Country_1 filter)\n- ✅ Country_2 (listed under Country_2 filter)\n- ✅ Index (for internal merging and reference purposes)\n\n### 4. Run consultants_details.py - Collecting consultant's details\n\nUsing the list of internal Links to the listing’s profile, a second Python script \nwas developed to visit every profile page and extract the available details \n(HQ, Website, email, phone if listed).\n\n**Total unique internal Links for scraping: 2,172**\n\nThe final output includes 2,172 rows with the following fields:\n- ✅ Name\n- ✅ Link to the listing’s profile\n- ✅ Country_1 (listed under the Country_1 filter)\n- ✅ Country_2 (listed under the Country_2 filter)\n- ✅ Index (for internal merging and reference purposes)\n- ✅ About (Short Name) (Short listing’s Name from section “About”)\n- ✅ Headquarters (if listed) \n- ✅ Website (if listed)\n- ✅ Email (if listed)\n- ✅ Phone number (if listed)\n\nFinal dataset  saved to: listings_both_countries_details.csv (separator: *)\n\n### Features\n\n- Headless browsing with Selenium (Chrome).\n- Support for multiple countries (configured via .env).\n- Continuous \"Show More\" scrolling and clicking to load all results.\n- Logging to both console and file (per country).\n- Backup saves to prevent data loss during scraping.\n- Auto-restart of browser sessions to avoid memory issues.\n- CSV export with UTF-8 encoding and * delimiter for compatibility.\n- Deduplication of multi-country results in Power Query.\n\n### ⚙️ Technologies Used\n\n- Python\n- Selenium (with headless Chrome)\n- BeautifulSoup\n- pandas\n- .env configuration\n- Power Query (for merging and deduplication)\n\n### Notes\n\n- Run scripts sequentially: first show_more.py, then consultants_details.py.\n- Logs are saved to .log files specified in .env.\n- ChromeDriver must match your local Chrome version. Place it in your PATH.\n- To avoid blocking, random delays are added between requests.\n- For more countries, extend .env and duplicate function calls in show_more.py.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foksana-vtk%2Fconsultants-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foksana-vtk%2Fconsultants-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foksana-vtk%2Fconsultants-scraper/lists"}