{"id":38683031,"url":"https://github.com/michellepellon/jobx","last_synced_at":"2026-01-17T10:17:09.659Z","repository":{"id":305967147,"uuid":"1015094112","full_name":"michellepellon/jobx","owner":"michellepellon","description":"A modern, powerful job scraper for LinkedIn, Indeed and beyond.","archived":false,"fork":false,"pushed_at":"2025-09-12T00:41:37.000Z","size":462,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-12T03:02:19.301Z","etag":null,"topics":["compensation","data","data-analysis","indeed","indeed-scraping","jobs","jobsearch","linkedin","linkedin-scraper"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/michellepellon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-07-07T02:03:09.000Z","updated_at":"2025-09-12T00:41:40.000Z","dependencies_parsed_at":"2025-07-23T00:16:30.036Z","dependency_job_id":"4fac906c-bf4a-4bf8-bbc8-53cabc641681","html_url":"https://github.com/michellepellon/jobx","commit_stats":null,"previous_names":["michellepellon/jobx"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/michellepellon/jobx","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michellepellon%2Fjobx","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michellepellon%2Fjobx/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michellepellon%2Fjobx/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michellepellon%2Fjobx/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/michellepellon","download_url":"https://codeload.github.com/michellepellon/jobx/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michellepellon%2Fjobx/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28506040,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-17T06:57:29.758Z","status":"ssl_error","status_checked_at":"2026-01-17T06:56:03.931Z","response_time":85,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["compensation","data","data-analysis","indeed","indeed-scraping","jobs","jobsearch","linkedin","linkedin-scraper"],"created_at":"2026-01-17T10:17:09.533Z","updated_at":"2026-01-17T10:17:09.613Z","avatar_url":"https://github.com/michellepellon.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# jobx\n\nA modern, powerful job scraper for LinkedIn, Indeed and beyond.\n\n## ✨ Features\n\n- 🚀 **Concurrent scraping** from multiple job boards\n- 🎯 **Advanced filtering** by location, salary, job type, and more\n- 🔍 **Confidence scoring** to rank job relevance based on search terms\n- 📊 **Pandas integration** for data analysis and export\n- 💾 **Multiple output formats** including CSV and Apache Parquet\n- 🔒 **Type-safe** with full mypy compatibility\n- 📈 **Structured logging** with JSON output support\n- ⚡ **High performance** with async/await patterns\n- 🛡️ **Robust error handling** and retry mechanisms\n\n## 🚀 Quick Start\n\n### Installation\n\n```bash\n# Using uv (recommended)\nuv add jobx\n\n# Using pip\npip install jobx\n```\n\n### Basic Usage\n\n```python\nfrom jobx import scrape_jobs\n\n# Search for Python developer jobs\njobs = scrape_jobs(\n    search_term=\"python developer\",\n    location=\"New York, NY\",\n    results_wanted=50\n)\n\nprint(f\"Found {len(jobs)} jobs\")\nprint(jobs[[\"title\", \"company\", \"location\", \"confidence_score\"]].head())\n\n# Save to different formats\njobs.to_csv(\"jobs.csv\", index=False)\njobs.to_parquet(\"jobs.parquet\", index=False)\n```\n\n### Command Line Usage\n\n```bash\n# Save as CSV (default)\njobx -q \"data engineer\" -l \"San Francisco\" -o results.csv\n\n# Save as Parquet for better performance and compression\njobx -q \"data engineer\" -l \"San Francisco\" -o results.parquet -f parquet\n\n# Scrape from specific sites and save as Parquet\njobx -s linkedin indeed -q \"ML engineer\" -l \"Remote\" -n 100 -o ml_jobs.parquet -f parquet\n\n# Filter by confidence score (0.0-1.0) to get only highly relevant results\njobx -q \"python developer\" -l \"New York\" -c 0.7 -o relevant_jobs.csv\n\n# Track SERP position and identify competitor postings vs your company\njobx -q \"software engineer\" -l \"Seattle\" --track-serp --my-company \"Acme Corp\" \"Acme Inc\" -o serp_tracked.csv\n\n# Use environment variable for company names\nexport JOBX_MY_COMPANY=\"Acme Corp,Acme Inc\"\njobx -q \"data scientist\" -l \"Remote\" --track-serp -o tracked_jobs.parquet -f parquet\n```\n\n## 📊 Market Analysis Tool\n\n### Comprehensive Compensation Analysis\n\njobx includes a powerful market analysis tool for analyzing compensation data across geographic regions with support for:\n- **Multi-location job searches** across configured markets and centers\n- **Center-level payband comparison** with actual market data\n- **Tufte-style visualization** comparing your paybands to market statistics\n- **Statistical analysis** with percentiles, IQR, and distribution metrics\n\n```bash\n# Run market analysis with visualization\nsource venv/bin/activate\npython -m jobx.market_analysis.cli config.yaml --visualize --min-sample 10 -v -o output_dir\n\n# Options:\n#   --visualize         Generate compensation comparison charts\n#   --visualize-only    Only generate charts from existing data (skip job searches)\n#   --min-sample N      Minimum salary data points required (default: 30)\n#   -v, --verbose       Show detailed progress\n#   -o, --output        Output directory name\n```\n\n### Configuration Structure\n\nThe market analysis tool uses a YAML configuration with hierarchical structure:\n\n```yaml\nroles:\n  - id: rbt\n    name: \"Registered Behavior Technician (RBT)\"\n    pay_type: hourly\n    search_terms: [\"RBT\", \"behavior technician\", \"ABA therapist\"]\n  - id: bcba\n    name: \"Board Certified Behavior Analyst (BCBA)\"\n    pay_type: salary\n    search_terms: [\"BCBA\", \"behavior analyst\", \"clinical supervisor\"]\n\nregions:\n  - name: \"Southeast\"\n    markets:\n      - name: \"Florida\"\n        centers:\n          - name: \"Miami\"\n            location: \"Miami, FL\"\n            paybands:\n              rbt: {min: 18.00, max: 22.00}\n              bcba: {min: 70000, max: 85000}\n          - name: \"Orlando\"\n            location: \"Orlando, FL\"\n            paybands:\n              rbt: {min: 17.00, max: 21.00}\n              bcba: {min: 68000, max: 82000}\n```\n\n### Visualization Output\n\nThe tool generates Tufte-style comparison charts showing:\n- **Your payband range** (light green background area)\n- **Market IQR** (25th-75th percentile box in gray)\n- **Market median** (bold black line)\n- **Min/max whiskers** from actual job data\n- **Gap analysis** showing if market median is within/above/below your band\n\n## 🎯 Advanced Usage\n\n### Confidence Scoring\n\njobx includes a confidence scoring system that ranks job relevance based on how well they match your search terms and location:\n\n```python\nfrom jobx import scrape_jobs\n\njobs = scrape_jobs(\n    search_term=\"python developer\",\n    location=\"New York\",\n    results_wanted=100\n)\n\n# Jobs are automatically sorted by confidence score (highest first)\nprint(jobs[['title', 'company', 'confidence_score']].head(10))\n\n# Filter for highly relevant jobs (70%+ confidence)\nrelevant_jobs = jobs[jobs['confidence_score'] \u003e= 0.7]\nprint(f\"Found {len(relevant_jobs)} highly relevant jobs out of {len(jobs)} total\")\n\n# Analyze confidence distribution\nprint(f\"Average confidence: {jobs['confidence_score'].mean():.2%}\")\nprint(f\"Jobs with 80%+ confidence: {len(jobs[jobs['confidence_score'] \u003e= 0.8])}\")\n```\n\nThe confidence score (0.0-1.0) is calculated based on:\n- **Title match (50%)**: How well the job title matches your search terms\n- **Description match (30%)**: Keyword matching in the job description\n- **Location match (20%)**: Proximity to your specified location (remote jobs always score 1.0)\n\n### SERP Position Tracking\n\nTrack where job postings appear in search results (page and rank) to understand visibility and compare your company's postings against competitors:\n\n```python\nfrom jobx import scrape_jobs\n\n# Track SERP positions and identify your company's postings\njobs = scrape_jobs(\n    search_term=\"machine learning engineer\",\n    location=\"Boston\",\n    track_serp=True,\n    my_company_names=[\"Acme Corp\", \"Acme Inc.\"],\n    results_wanted=100\n)\n\n# Analyze SERP visibility\nprint(f\"Average position: {jobs['serp_absolute_rank'].mean():.1f}\")\nprint(f\"Page 1 listings: {len(jobs[jobs['serp_page_index'] == 0])}\")\nprint(f\"Sponsored posts: {jobs['serp_is_sponsored'].sum()}\")\n\n# Compare your company vs competitors\nmy_company_jobs = jobs[jobs['is_my_company'] == True]\ncompetitor_jobs = jobs[jobs['is_my_company'] == False]\n\nprint(f\"Your company: {len(my_company_jobs)} postings\")\nprint(f\"Average rank: {my_company_jobs['serp_absolute_rank'].mean():.1f}\")\nprint(f\"Competitors: {len(competitor_jobs)} postings\")\n```\n\nSERP tracking adds these columns to your results:\n- **serp_page_index**: 0-based page number (0 = first page)\n- **serp_index_on_page**: Position on the page (0-based)\n- **serp_absolute_rank**: Overall rank across all pages (1-based)\n- **serp_page_size_observed**: Number of organic results on the page\n- **serp_is_sponsored**: Whether the posting is promoted/sponsored\n- **company_normalized**: Normalized company name for matching\n- **is_my_company**: Whether it matches your configured company names\n\n### Multi-Site Concurrent Scraping\n\n```python\nfrom jobx import scrape_jobs\nfrom jobx.model import Site, JobType\n\njobs = scrape_jobs(\n    site_name=[Site.LINKEDIN, Site.INDEED],\n    search_term=\"software engineer\",\n    location=\"San Francisco, CA\",\n    distance=50,\n    job_type=JobType.FULL_TIME,\n    is_remote=True,\n    easy_apply=True,  # LinkedIn only\n    results_wanted=200,\n    enforce_annual_salary=True,\n    linkedin_fetch_description=True,\n    verbose=2\n)\n```\n\n### Salary Analysis and Filtering\n\n```python\n# Filter high-paying remote positions\nhigh_paying_remote = jobs[\n    (jobs['is_remote'] == True) \u0026\n    (jobs['min_amount'] \u003e= 120000) \u0026\n    (jobs['currency'] == 'USD') \u0026\n    (jobs['interval'] == 'yearly')\n]\n\n# Group by company and analyze\ncompany_stats = high_paying_remote.groupby('company').agg({\n    'min_amount': ['count', 'mean'],\n    'max_amount': 'mean',\n    'title': lambda x: ', '.join(x.unique()[:3])\n}).round(0)\n\nprint(company_stats)\n\n# Save filtered results as Parquet for efficient storage\nhigh_paying_remote.to_parquet(\"high_paying_remote_jobs.parquet\", \n                              compression='snappy',\n                              index=False)\n```\n\n## 📊 Data Structure\n\nEach job entry contains comprehensive information:\n\n```python\n# Core fields\njob_columns = [\n    'title', 'company', 'location', 'job_url', 'description',\n    'date_posted', 'is_remote', 'job_type', 'site',\n    'min_amount', 'max_amount', 'currency', 'interval',\n    'salary_source', 'emails', 'easy_apply', 'confidence_score'\n]\n\n# Location details\nlocation_fields = ['city', 'state', 'country']\n\n# Compensation analysis\nsalary_analysis = jobs.groupby('site').agg({\n    'min_amount': ['count', 'mean', 'median'],\n    'max_amount': ['mean', 'median'],\n    'is_remote': 'sum'\n})\n```\n\n## 🔧 Configuration\n\n### Environment Variables\n\n```bash\n# Enable structured JSON logging\nexport JOBX_LOG_JSON=true\n\n# Set logging level\nexport JOBX_LOG_LEVEL=DEBUG\n\n# Configure logging context\nexport JOBX_LOG_CONTEXT=true\n```\n\n## 🛠️ Development\n\n### Prerequisites\n\n- Python 3.10+\n- [uv](https://docs.astral.sh/uv/) (recommended) or pip\n\n### Development Setup\n\n```bash\n# Clone the repository\ngit clone https://github.com/michellepellon/jobx.git\ncd jobx\n\n# Install with development dependencies\nuv install -e \".[dev]\"\n\n# Run tests\nuv run pytest\n\n# Run linting\nuv run ruff check jobx/\nuv run ruff format jobx/\n\n# Run type checking\nuv run mypy jobx/\n\n# Run security scanning\nuv run bandit -r jobx/\n```\n\n### Running Tests\n\n```bash\n# Run all tests\nuv run pytest\n\n# Run with coverage\nuv run pytest --cov=jobx --cov-report=html\n\n# Run integration tests only\nuv run pytest -m integration\n\n# Run excluding slow tests\nuv run pytest -m \"not slow\"\n```\n\n### Code Quality\n\nThe project maintains high code quality standards:\n\n- **Formatting**: `ruff format` and `black`\n- **Linting**: `ruff` with comprehensive rule set\n- **Type checking**: `mypy` in strict mode\n- **Security**: `bandit` and `safety`\n- **Testing**: `pytest` with 90%+ coverage requirement\n\n## 🤝 Contributing\n\nContributions are welcome! Please read our [Contributing Guide](docs/contributing.md) for details on:\n\n- Code of conduct\n- Development workflow\n- Testing requirements\n- Documentation standards\n\n### Quick Contribution Setup\n\n```bash\n# Fork and clone the repo\ngit clone https://github.com/yourusername/jobx.git\ncd jobx\n\n# Create a feature branch\ngit checkout -b feature/your-feature-name\n\n# Install development dependencies\nuv install -e \".[dev]\"\n\n# Make your changes and run tests\nuv run pytest\nuv run ruff check jobx/\nuv run mypy jobx/\n\n# Commit and push\ngit commit -m \"feat: add your feature description\"\ngit push origin feature/your-feature-name\n```\n\n## 📄 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## 📞 Support\n\n- 🐛 [Report bugs](https://github.com/michellepellon/jobx/issues)\n- 💬 [Request features](https://github.com/michellepellon/jobx/issues)\n- 📧 Contact: mgracepellon@gmail.com\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmichellepellon%2Fjobx","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmichellepellon%2Fjobx","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmichellepellon%2Fjobx/lists"}