{"id":31782054,"url":"https://github.com/farhadmohmand66/opportunity_scraper","last_synced_at":"2026-05-14T21:34:52.613Z","repository":{"id":318006400,"uuid":"1069675497","full_name":"farhadmohmand66/opportunity_scraper","owner":"farhadmohmand66","description":"web scraper tool, scraping fours different website","archived":false,"fork":false,"pushed_at":"2025-10-04T12:51:13.000Z","size":30,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-04T14:17:21.163Z","etag":null,"topics":["python","selenium","webautomation","webscraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/farhadmohmand66.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-04T11:59:46.000Z","updated_at":"2025-10-04T12:51:17.000Z","dependencies_parsed_at":"2025-10-04T14:17:24.118Z","dependency_job_id":"a8b0f9cc-fe40-4e9d-83ec-90beb3a53c62","html_url":"https://github.com/farhadmohmand66/opportunity_scraper","commit_stats":null,"previous_names":["farhadmohmand66/opportunity_scraper"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/farhadmohmand66/opportunity_scraper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/farhadmohmand66%2Fopportunity_scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/farhadmohmand66%2Fopportunity_scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/farhadmohmand66%2Fopportunity_scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/farhadmohmand66%2Fopportunity_scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/farhadmohmand66","download_url":"https://codeload.github.com/farhadmohmand66/opportunity_scraper/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/farhadmohmand66%2Fopportunity_scraper/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279003387,"owners_count":26083579,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-10T02:00:06.843Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["python","selenium","webautomation","webscraping"],"created_at":"2025-10-10T09:14:14.067Z","updated_at":"2025-10-10T09:14:15.495Z","avatar_url":"https://github.com/farhadmohmand66.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# Opportunity Scraper 🎯\n\nA comprehensive web scraping system that extracts opportunities (volunteering, events, scholarships, etc.) from multiple websites, specifically filtering for Bulgaria-eligible opportunities.\n\n## 📁 Project Structure\n\n```\nopportunity_scraper/\n├── scrapers/\n│   ├── __init__.py\n│   ├── opportunit4u_scraper.py\n│   ├── european_youth_scraper.py\n│   ├── smokinya_scraper.py\n│   └── eurodesk_scraper.py\n├── data/\n│   ├── opportunit4u_data.json\n│   ├── european_youth_portal_bulgaria_eligible.json\n│   ├── smokinya_bulgaria_eligible.json\n│   └── eurodesk_learning.json\n├── config/\n│   ├── __init__.py\n│   ├── config.py\n│   ├── category_keywords.json\n│   ├── country.json\n│   └── world_cities.json\n├── main.py\n├── requirements.txt\n├── translator.py\n└── README.md\n```\n\n## 🚀 Quick Start\n\n### ⚙️ Installation\n\nClone the repo and install dependencies:\n\n```bash\n# git clone https://github.com/yourusername/opportunity_scraper.git\ncd opportunity_scraper\npip install -r requirements.txt\n```\n\n### Configuration\n\nIn `config/config.py`, set your OpenAI API key:\n\n```python\nOPENAI_API_KEY = \"your-openai-api-key-here\"\n```\n\n### Running the Scrapers\n\nRun All Scrapers:\n\n```bash\npython main.py\n```\n### or run individual scraper then run merg_all_json.py\n\nRun specific scraper:\n\n```bash\npython scrapers/eurodesk_scraper.py\npython scrapers/european_youth_scraper.py\npython scrapers/opportunit4u_scraper.py\npython scrapers/smokinya_scraper.py\n```\nIncase you run invidual srapers then you have to run the merg_all_json.py\n\n```\npython merge_all_json.py\n\n```\n\n## 🛠️ Scrapers Overview\n\n1. **Eurodesk Scraper**\n   - **Website:** [Eurodesk Learning](https://programmes.eurodesk.eu/learning)\n   - **Features:**\n     - Filters by Bulgaria eligibility\n     - Extracts online/onsite opportunities\n     - Uses category keyword matching\n     - Skips \"UPCOMING\" opportunities\n   - **Output:** `data/eurodesk_learning.json`\n\n2. **European Youth Portal Scraper**\n   - **Website:** [European Youth Portal](https://youth.europa.eu/go-abroad/volunteering/opportunities_en)\n   - **Features:**\n     - Specifically for volunteering opportunities\n     - Bulgaria eligibility filtering\n     - Load more functionality\n     - Automatic category detection\n   - **Output:** `data/european_youth_portal_bulgaria_eligible.json`\n\n3. **Opportunit4u Scraper**\n   - **Website:** [Opportunit4u](https://www.opportunit4u.com/)\n   - **Features:**\n     - Load more pagination\n     - Bulgaria eligibility based on description analysis\n     - Location extraction from titles\n     - Multiple opportunity types\n   - **Output:** `data/opportunit4u_data.json`\n\n4. **Smokinya Scraper**\n   - **Website:** [Smokinya](https://smokinya.com/)\n   - **Features:**\n     - Uses OpenAI GPT for intelligent data extraction\n     - Advanced entity recognition\n     - Automatic category classification\n     - Smart location detection\n   - **Output:** `data/smokinya_bulgaria_eligible.json`\n\n---\n\n## 🚀 Features\n\n- Scrapes opportunities from:\n  - [Opportunit4u](https://www.opportunit4u.com/)\n  - [European Youth Portal](https://youth.europa.eu/)\n  - [Smokinya Foundation](https://smokinya.com/)\n  - [Eurodesk Learning](https://programmes.eurodesk.eu/learning)\n  \n- Normalizes and merges different schemas into a single dataset.\n- Outputs a combined JSON file: **`data/all_opportunities.json`**\n\n---\n\n## Proposed Unified Schema\n\n```json\n{\n  \"postNo\": 1,\n  \"title\": \"string\",\n  \"title_bg\": \"string\",\n  \"city\": \"string or null\",\n  \"country\": \"string or null\",\n  \"description\": \"string\",\n  \"description_bg\": \"string\",\n  \"validUntil\": \"string (date or CURRENT)\",\n  \"originalDate\": raw_date,\n  \"type\": \"string\",\n  \"modeOfWork\": \"string\",\n  \"categories\": [\"list of strings\"],\n  \"applicationUrl\": \"string\",\n  \"bannerImage\": \"string\",\n  \"bulgariaEligible\": true/false (optional, default false)\n  \"source\": source\n}\n```\n\n### Field Descriptions:\n- **postNo:** Sequential number of the opportunity\n- **title:** Opportunity title/name\n- **title_bg:** Opportunity title/name in Bulgarian language\n- **city:** Location city (extracted from text)\n- **country:** Location country (extracted from text)\n- **description:** Full opportunity description\n- **description_bg:** Full opportunity description bulgarian language\n- **validUntil:** Application deadline date\n- **originalDate\"** raw_date,\n- **type:** Opportunity type (volunteering, event, scholarship, etc.)\n- **modeOfWork:** remote, on-site, or hybrid\n- **categories:** List of relevant categories\n- **applicationUrl:** URL to apply/learn more\n- **bannerImage:** URL to the banner Image\n- **bulgariaEligible:** Boolean indicating Bulgaria eligibility\n- **sourc:** the thd dns of the website\n\n### 🎯 Opportunity Types\n- **volunteering:** Volunteer programs and opportunities\n- **event:** Conferences, workshops, seminars\n- **scholarship:** Funding and financial aid\n- **competition:** Contests and challenges\n- **exchange:** Cultural and youth exchanges\n- **erasmus:** Erasmus+ programs\n- **training:** Skill development programs\n- **internship:** Professional internships\n\n### 🏢 Categories\nThe system recognizes these categories:\n- Programming, Business, Marketing, Journalism\n- Trade, Psychology, Cinema, Finance\n- Design, Music, Social Causes, Medicine\n- Ecology, Languages, Career Guidance, Science\n- Politics, Architecture, Health, Environment\n\n## ⚙️ Configuration Files\n\n### `category_keywords.json`\n```json\n{\n  \"Programming\": [\"programming\", \"coding\", \"software\", \"developer\"],\n  \"Business\": [\"business\", \"entrepreneurship\", \"startup\"],\n  ...\n}\n```\n\n### `country.json`\n```json\n{\n  \"countries\": [\"Bulgaria\", \"Germany\", \"France\", ...]\n}\n```\n\n### `world_cities.json`\n```json\n{\n  \"cities\": [\"Sofia\", \"Berlin\", \"Paris\", ...]\n}\n```\n\n## 🔧 Technical Details\n\n### Dependencies\n- **selenium:** Web browser automation\n- **undetected-chromedriver:** Anti-detection Chrome driver\n- **openai:** AI-powered data extraction (Smokinya scraper)\n- **beautifulsoup4:** HTML parsing (backup)\n\n### Browser Requirements\n- Chrome browser installed\n- Automatic ChromeDriver management via undetected-chromedriver\n\n### Error Handling\n- Individual scraper failures don't stop the entire system\n- Detailed error logging for debugging\n- Automatic retry mechanisms\n\n## 🚨 Important Notes\n- **CAPTCHA Handling:** Eurodesk may show CAPTCHA - manual solving required\n- **Rate Limiting:** Built-in delays between requests to be respectful\n- **API Key:** OpenAI API key required for Smokinya scraper\n- **Browser Windows:** Scrapers open visible browser windows for interaction\n\n## 📈 Output Management\n- Each scraper saves to its own JSON file\n- Data is automatically deduplicated\n- Only Bulgaria-eligible opportunities are saved\n- Consistent data structure across all sources\n\n## 🆘 Troubleshooting\n### Common Issues:\n- **Import errors:** Make sure you're in the project root directory\n- **Chrome not found:** Install Google Chrome browser\n- **API key errors:** Check `config/config.py` file exists with a valid OpenAI key\n- **CAPTCHA blocks:** Manually solve CAPTCHA when the browser opens\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffarhadmohmand66%2Fopportunity_scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffarhadmohmand66%2Fopportunity_scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffarhadmohmand66%2Fopportunity_scraper/lists"}