Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/cinex10/real-time-job-board-etl-pipeline
An automated job scraping and notification system that fetches job listings, processes them asynchronously, and sends customized email digests based on user preferences.
https://github.com/cinex10/real-time-job-board-etl-pipeline
Last synced: 7 days ago
JSON representation
An automated job scraping and notification system that fetches job listings, processes them asynchronously, and sends customized email digests based on user preferences.
- Host: GitHub
- URL: https://github.com/cinex10/real-time-job-board-etl-pipeline
- Owner: Cinex10
- Created: 2024-11-06T09:07:30.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2024-11-06T09:12:30.000Z (3 months ago)
- Last Synced: 2024-11-14T02:10:08.990Z (2 months ago)
- Language: Python
- Size: 35.2 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Job Board Pipeline
An automated job scraping and notification system powered by Apache Kafka. Fetches job listings asynchronously, processes them through a streaming pipeline, and sends customized email digests based on user preferences.
## 🚀 Features
- **Streaming Architecture**: Real-time job processing using Apache Kafka
- **Async Job Fetching**: Efficiently fetches jobs from source
- **Email Notifications**:
- Daily digest of matching jobs
- Responsive HTML email templates
- Support for multiple email providers
- **Fault Tolerance**: Built-in message persistence and retry mechanisms## 🛠️ Tech Stack
- Python 3.9+
- Apache Kafka
- kafka-python
- pydantic
- Jinja2## 📋 Prerequisites
- Docker Engine 20.10+
- Docker Compose V2
- Git## ⚙️ Configuration
1. **Kafka Settings** (.env file):
```env
KAFKA_BOOTSTRAP_SERVERS=localhost:9092
KAFKA_TOPIC=job_listings
KAFKA_GROUP_ID=job_processorSMTP_SERVER=smtp.gmail.com
SMTP_PORT=587
[email protected]
SMTP_PASSWORD=your-app-password
```## 🚀 Quick Start
1. Clone the repository:
```bash
git clone https://github.com/yourusername/job-board-pipeline.git
cd job-board-pipeline
```2. Configure environment variables:
```bash
cp .env.example .env
# Edit .env with your settings
```3. Start the services:
```bash
docker compose up -d
```4. Check services status:
```bash
docker compose ps
```## ⚙️ Configuration
1. **Environment Variables** (.env):
```env
# Kafka
KAFKA_BOOTSTRAP_SERVERS=kafka:9092
KAFKA_TOPIC=job_listings
KAFKA_GROUP_ID=job_processor# SMTP
SMTP_SERVER=smtp.gmail.com
SMTP_PORT=587
[email protected]
SMTP_PASSWORD=your-app-password# Job Source
API_URL=your-job-api-url
API_KEY=your-api-key# Schedule
SCHEDULE_START_HOUR=8
SCHEDULE_END_HOUR=1
SCHEDULE_INTERVAL_MINUTES=5# Notification
[email protected],[email protected]
```2. **Job Matching Criteria** (config/criteria.json):
```json
{
"title_keywords": ["python", "data engineer"],
"required_skills": ["python", "sql"],
"locations": ["remote", "new york"]
}
```## 🐳 Docker Commands
Start services:
```bash
# Start all services
docker compose up -d# Start specific service
docker compose up -d pipeline# View logs
docker compose logs -f pipeline# Stop services
docker compose down# Rebuild after changes
docker compose build pipeline
docker compose up -d pipeline
```Check Kafka:
```bash
# List topics
docker compose exec kafka kafka-topics --list --bootstrap-server localhost:9092# Create topic manually if needed
docker compose exec kafka kafka-topics \
--create \
--bootstrap-server localhost:9092 \
--replication-factor 1 \
--partitions 3 \
--topic job_listings
```## 📊 Monitoring
View logs:
```bash
# All services
docker compose logs -f# Specific service
docker compose logs -f pipeline
docker compose logs -f kafka
```Check containers:
```bash
# List containers
docker compose ps# Container stats
docker stats
```## 🐛 Troubleshooting
1. **Kafka Connection Issues**:
```bash
# Check Kafka is running
docker compose ps kafka# Check Kafka logs
docker compose logs kafka# Verify topic
docker compose exec kafka kafka-topics \
--describe \
--bootstrap-server localhost:9092 \
--topic job_listings
```2. **Pipeline Issues**:
```bash
# Check pipeline logs
docker compose logs pipeline# Restart pipeline
docker compose restart pipeline# Check environment variables
docker compose config
```## 🔄 Development
1. Local development with Docker:
```bash
# Build with development configuration
docker compose -f docker-compose.yml -f docker-compose.dev.yml up -d# Hot reload for code changes
docker compose -f docker-compose.yml -f docker-compose.dev.yml up --build pipeline
```2. Update dependencies:
```bash
# Update requirements.txt
pip freeze > docker/requirements.txt# Rebuild container
docker compose build pipeline
```## 📝 Maintenance
Backup data:
```bash
# Create backup directory
mkdir -p backups# Backup Kafka topics
docker compose exec kafka kafka-topics \
--describe \
--bootstrap-server localhost:9092 \
> backups/topics_$(date +%Y%m%d).txt
```Clean up:
```bash
# Remove unused containers
docker compose down# Remove volumes
docker compose down -v# Clean all
docker compose down -v --rmi all --remove-orphans
```## 🔒 Security Notes
1. Always use `.env` for sensitive data
2. Never commit `.env` to repository
3. Use secrets management in production
4. Regularly update dependencies
5. Monitor container logs for suspicious activity## 🤝 Contributing
1. Fork the repository
2. Create feature branch
3. Commit changes
4. Push to branch
5. Create Pull Request## 📝 License
[MIT](https://choosealicense.com/licenses/mit/)
---
*Built with Docker for easy deployment and scalability.*