https://github.com/farhadmohmand66/opportunity_scraper

web scraper tool, scraping fours different website
https://github.com/farhadmohmand66/opportunity_scraper

python selenium webautomation webscraping

Last synced: about 1 month ago
JSON representation

web scraper tool, scraping fours different website

Host: GitHub
URL: https://github.com/farhadmohmand66/opportunity_scraper
Owner: farhadmohmand66
Created: 2025-10-04T11:59:46.000Z (9 months ago)
Default Branch: main
Last Pushed: 2025-10-04T12:51:13.000Z (9 months ago)
Last Synced: 2025-10-04T14:17:21.163Z (9 months ago)
Topics: python, selenium, webautomation, webscraping
Language: Python
Homepage:
Size: 29.3 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Opportunity Scraper 🎯

A comprehensive web scraping system that extracts opportunities (volunteering, events, scholarships, etc.) from multiple websites, specifically filtering for Bulgaria-eligible opportunities.

## 📁 Project Structure

```
opportunity_scraper/
├── scrapers/
│ ├── __init__.py
│ ├── opportunit4u_scraper.py
│ ├── european_youth_scraper.py
│ ├── smokinya_scraper.py
│ └── eurodesk_scraper.py
├── data/
│ ├── opportunit4u_data.json
│ ├── european_youth_portal_bulgaria_eligible.json
│ ├── smokinya_bulgaria_eligible.json
│ └── eurodesk_learning.json
├── config/
│ ├── __init__.py
│ ├── config.py
│ ├── category_keywords.json
│ ├── country.json
│ └── world_cities.json
├── main.py
├── requirements.txt
├── translator.py
└── README.md
```

## 🚀 Quick Start

### ⚙️ Installation

Clone the repo and install dependencies:

```bash
# git clone https://github.com/yourusername/opportunity_scraper.git
cd opportunity_scraper
pip install -r requirements.txt
```

### Configuration

In `config/config.py`, set your OpenAI API key:

```python
OPENAI_API_KEY = "your-openai-api-key-here"
```

### Running the Scrapers

Run All Scrapers:

```bash
python main.py
```
### or run individual scraper then run merg_all_json.py

Run specific scraper:

```bash
python scrapers/eurodesk_scraper.py
python scrapers/european_youth_scraper.py
python scrapers/opportunit4u_scraper.py
python scrapers/smokinya_scraper.py
```
Incase you run invidual srapers then you have to run the merg_all_json.py

```
python merge_all_json.py

```

## 🛠️ Scrapers Overview

1. **Eurodesk Scraper**
- **Website:** [Eurodesk Learning](https://programmes.eurodesk.eu/learning)
- **Features:**
- Filters by Bulgaria eligibility
- Extracts online/onsite opportunities
- Uses category keyword matching
- Skips "UPCOMING" opportunities
- **Output:** `data/eurodesk_learning.json`

2. **European Youth Portal Scraper**
- **Website:** [European Youth Portal](https://youth.europa.eu/go-abroad/volunteering/opportunities_en)
- **Features:**
- Specifically for volunteering opportunities
- Bulgaria eligibility filtering
- Load more functionality
- Automatic category detection
- **Output:** `data/european_youth_portal_bulgaria_eligible.json`

3. **Opportunit4u Scraper**
- **Website:** [Opportunit4u](https://www.opportunit4u.com/)
- **Features:**
- Load more pagination
- Bulgaria eligibility based on description analysis
- Location extraction from titles
- Multiple opportunity types
- **Output:** `data/opportunit4u_data.json`

4. **Smokinya Scraper**
- **Website:** [Smokinya](https://smokinya.com/)
- **Features:**
- Uses OpenAI GPT for intelligent data extraction
- Advanced entity recognition
- Automatic category classification
- Smart location detection
- **Output:** `data/smokinya_bulgaria_eligible.json`

---

## 🚀 Features

- Scrapes opportunities from:
- [Opportunit4u](https://www.opportunit4u.com/)
- [European Youth Portal](https://youth.europa.eu/)
- [Smokinya Foundation](https://smokinya.com/)
- [Eurodesk Learning](https://programmes.eurodesk.eu/learning)

- Normalizes and merges different schemas into a single dataset.
- Outputs a combined JSON file: **`data/all_opportunities.json`**

---

## Proposed Unified Schema

```json
{
"postNo": 1,
"title": "string",
"title_bg": "string",
"city": "string or null",
"country": "string or null",
"description": "string",
"description_bg": "string",
"validUntil": "string (date or CURRENT)",
"originalDate": raw_date,
"type": "string",
"modeOfWork": "string",
"categories": ["list of strings"],
"applicationUrl": "string",
"bannerImage": "string",
"bulgariaEligible": true/false (optional, default false)
"source": source
}
```

### Field Descriptions:
- **postNo:** Sequential number of the opportunity
- **title:** Opportunity title/name
- **title_bg:** Opportunity title/name in Bulgarian language
- **city:** Location city (extracted from text)
- **country:** Location country (extracted from text)
- **description:** Full opportunity description
- **description_bg:** Full opportunity description bulgarian language
- **validUntil:** Application deadline date
- **originalDate"** raw_date,
- **type:** Opportunity type (volunteering, event, scholarship, etc.)
- **modeOfWork:** remote, on-site, or hybrid
- **categories:** List of relevant categories
- **applicationUrl:** URL to apply/learn more
- **bannerImage:** URL to the banner Image
- **bulgariaEligible:** Boolean indicating Bulgaria eligibility
- **sourc:** the thd dns of the website

### 🎯 Opportunity Types
- **volunteering:** Volunteer programs and opportunities
- **event:** Conferences, workshops, seminars
- **scholarship:** Funding and financial aid
- **competition:** Contests and challenges
- **exchange:** Cultural and youth exchanges
- **erasmus:** Erasmus+ programs
- **training:** Skill development programs
- **internship:** Professional internships

### 🏢 Categories
The system recognizes these categories:
- Programming, Business, Marketing, Journalism
- Trade, Psychology, Cinema, Finance
- Design, Music, Social Causes, Medicine
- Ecology, Languages, Career Guidance, Science
- Politics, Architecture, Health, Environment

## ⚙️ Configuration Files

### `category_keywords.json`
```json
{
"Programming": ["programming", "coding", "software", "developer"],
"Business": ["business", "entrepreneurship", "startup"],
...
}
```

### `country.json`
```json
{
"countries": ["Bulgaria", "Germany", "France", ...]
}
```

### `world_cities.json`
```json
{
"cities": ["Sofia", "Berlin", "Paris", ...]
}
```

## 🔧 Technical Details

### Dependencies
- **selenium:** Web browser automation
- **undetected-chromedriver:** Anti-detection Chrome driver
- **openai:** AI-powered data extraction (Smokinya scraper)
- **beautifulsoup4:** HTML parsing (backup)

### Browser Requirements
- Chrome browser installed
- Automatic ChromeDriver management via undetected-chromedriver

### Error Handling
- Individual scraper failures don't stop the entire system
- Detailed error logging for debugging
- Automatic retry mechanisms

## 🚨 Important Notes
- **CAPTCHA Handling:** Eurodesk may show CAPTCHA - manual solving required
- **Rate Limiting:** Built-in delays between requests to be respectful
- **API Key:** OpenAI API key required for Smokinya scraper
- **Browser Windows:** Scrapers open visible browser windows for interaction

## 📈 Output Management
- Each scraper saves to its own JSON file
- Data is automatically deduplicated
- Only Bulgaria-eligible opportunities are saved
- Consistent data structure across all sources

## 🆘 Troubleshooting
### Common Issues:
- **Import errors:** Make sure you're in the project root directory
- **Chrome not found:** Install Google Chrome browser
- **API key errors:** Check `config/config.py` file exists with a valid OpenAI key
- **CAPTCHA blocks:** Manually solve CAPTCHA when the browser opens

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/farhadmohmand66/opportunity_scraper

Awesome Lists containing this project

README