https://github.com/mazzasaverio/data-job-parser
Job posting parser with structured outputs
https://github.com/mazzasaverio/data-job-parser
data-engineering job-parser openai pydantic-v2 structured-outputs uv
Last synced: about 2 months ago
JSON representation
Job posting parser with structured outputs
- Host: GitHub
- URL: https://github.com/mazzasaverio/data-job-parser
- Owner: mazzasaverio
- License: mit
- Created: 2025-05-25T13:16:20.000Z (about 1 year ago)
- Default Branch: master
- Last Pushed: 2025-05-26T07:37:10.000Z (about 1 year ago)
- Last Synced: 2025-06-23T00:39:06.926Z (12 months ago)
- Topics: data-engineering, job-parser, openai, pydantic-v2, structured-outputs, uv
- Language: Python
- Homepage:
- Size: 177 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Data Job Parser
[](https://pypi.org/project/data-job-parser/)
[](https://www.python.org/downloads/)
[](https://pypi.org/project/data-job-parser/)
[](https://opensource.org/licenses/MIT)
[](https://github.com/mazzasaverio/data-job-parser/actions)
Extract structured data from job postings using OpenAI's structured output capabilities.
## Features
- 🎯 **Smart Extraction**: Extract structured information from any job posting URL
- 🧠 **AI-Powered**: Uses OpenAI's GPT models with structured output for accurate parsing
- 📊 **Comprehensive Data**: Covers all job posting aspects including salary, skills, requirements
- 🌐 **Advanced Scraping**: Playwright-based scraping handles JavaScript-heavy sites
- 💾 **File Storage**: Save as markdown and JSON with SHA-1 hash filenames for deduplication
- 🔄 **Reliability**: Automatic retries with exponential backoff for robust operation
- 📝 **Observability**: Detailed logging with Logfire integration
- 🐍 **Modern Python**: Full type hints and Python 3.8+ support
## Installation
```bash
pip install data-job-parser
```
After installation, install Playwright browsers:
```bash
playwright install chromium
```
## Quick Start
### Basic Usage
```python
from data_job_parser import JobPostingParser
# Initialize with OpenAI API key
parser = JobPostingParser(api_key="your-openai-api-key")
# Parse a job posting
job_data = parser.parse("https://example.com/job-posting")
# Access structured data
print(f"Title: {job_data.title}")
print(f"Company: {job_data.company}")
print(f"Location: {job_data.location.city}, {job_data.location.country}")
print(f"Salary: {job_data.salary.min_amount}-{job_data.salary.max_amount} {job_data.salary.currency}")
print(f"Skills: {', '.join(job_data.required_skills)}")
```
### Save Files
```python
# Parse and save both markdown and JSON
job_data, markdown_path, json_path = await parser.parse_async(
"https://jobs.pradagroup.com/job/Milan-Data-Engineer/1199629101/",
save_markdown=True,
save_json=True
)
print(f"Markdown: {markdown_path}")
print(f"JSON: {json_path}")
```
### Batch Processing
```python
from data_job_parser import JobPostingParser
parser = JobPostingParser(api_key="your-api-key")
urls = ["https://job1.com", "https://job2.com", "https://job3.com"]
for url in urls:
try:
job_data, md_path, json_path = parser.parse(
url,
save_markdown=True,
save_json=True
)
print(f"✅ {job_data.title} at {job_data.company}")
except Exception as e:
print(f"❌ Failed to parse {url}: {e}")
```
## Configuration
### Environment Variables
Create a `.env` file in your project root:
```bash
# Required
OPENAI_API_KEY=your-openai-api-key
# Optional - Logging
LOGFIRE_TOKEN=your-logfire-token
# Optional - Model Configuration
OPENAI_MODEL=gpt-4-turbo-preview
# Optional - Playwright Settings
PLAYWRIGHT_HEADLESS=true
PLAYWRIGHT_TIMEOUT=60000
```
### API Key Setup
**Option 1: Parameter**
```python
parser = JobPostingParser(api_key="your-api-key")
```
**Option 2: Environment Variable**
```bash
export OPENAI_API_KEY="your-api-key"
```
```python
parser = JobPostingParser() # Auto-loads from env
```
### Model Selection
```python
# Use different OpenAI model
parser = JobPostingParser(
api_key="your-api-key",
model="gpt-4o" # or gpt-3.5-turbo, etc.
)
```
### File Storage
Files are saved with SHA-1 hash filenames to prevent duplicates:
```
data/
├── markdown/
│ └── a1b2c3d4e5f6789.md
└── json/
└── a1b2c3d4e5f6789.json
```
## Data Model
The parser extracts comprehensive job information:
**Core Information**
- Title, company, location, description
- Work type (full-time, part-time, contract)
- Work mode (remote, hybrid, on-site)
- Experience level required
**Compensation & Benefits**
- Salary range with currency
- Benefits and perks
- Stock options, bonuses
**Skills & Requirements**
- Required technical skills
- Preferred/nice-to-have skills
- Education requirements
- Years of experience needed
**Additional Details**
- Team size and department
- Application process
- Company culture information
## Error Handling
The parser includes robust error handling:
```python
from data_job_parser import JobPostingParser
from data_job_parser.exceptions import ParsingError, ScrapingError
parser = JobPostingParser(api_key="your-api-key")
try:
job_data = parser.parse("https://example.com/job")
except ScrapingError as e:
print(f"Failed to scrape URL: {e}")
except ParsingError as e:
print(f"Failed to parse content: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
```
## Development
### Setup Development Environment
```bash
# Clone repository
git clone https://github.com/mazzasaverio/data-job-parser.git
cd data-job-parser
# Install with uv (recommended)
uv sync --dev
# Or with pip
pip install -e ".[dev]"
```
### Run Tests
```bash
# Run all tests
uv run pytest
# With coverage report
uv run pytest --cov=src/data_job_parser --cov-report=html
# Run specific test file
uv run pytest tests/test_parser.py -v
```
### Code Quality
```bash
# Format code
uv run ruff format .
# Lint code
uv run ruff check .
# Type checking
uv run mypy src/
```
### Release Process
1. **Update version** in both files:
- `src/data_job_parser/__init__.py`
- `pyproject.toml`
2. **Run quality checks**:
```bash
uv run pytest
uv run ruff check .
uv run mypy src/
```
3. **Commit and tag**:
```bash
git add .
git commit -m "chore: bump version to X.Y.Z"
git push origin main
git tag vX.Y.Z
git push origin vX.Y.Z
```
4. **Automated deployment**: GitHub Actions will automatically:
- Run tests
- Build package
- Publish to PyPI
- Create GitHub release
## Contributing
We welcome contributions! Please follow these steps:
1. **Fork** the repository
2. **Create** feature branch: `git checkout -b feature/amazing-feature`
3. **Make** your changes with tests
4. **Run** quality checks: `uv run pytest && uv run ruff check .`
5. **Commit** changes: `git commit -m 'feat: add amazing feature'`
6. **Push** branch: `git push origin feature/amazing-feature`
7. **Open** a Pull Request
### Development Guidelines
- Write tests for new features
- Follow existing code style
- Update documentation as needed
- Use conventional commit messages
## Requirements
- **Python**: 3.8+
- **OpenAI API Key**: Required for parsing
- **Internet Connection**: For web scraping and API calls
## Changelog
See [CHANGELOG.md](CHANGELOG.md) for version history and changes.
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Acknowledgments
- **OpenAI** for structured output capabilities
- **Playwright** for robust web scraping
- **Pydantic** for data validation
- **Logfire** for observability
---
**Made with ❤️ by [Saverio Mazza](https://github.com/mazzasaverio)**