{"id":13432715,"url":"https://github.com/PaulMcInnis/JobFunnel","last_synced_at":"2025-03-17T10:32:27.587Z","repository":{"id":45764886,"uuid":"101350165","full_name":"PaulMcInnis/JobFunnel","owner":"PaulMcInnis","description":"Scrape job websites into a single spreadsheet with no duplicates.","archived":false,"fork":false,"pushed_at":"2024-10-15T17:14:13.000Z","size":2674,"stargazers_count":1881,"open_issues_count":8,"forks_count":218,"subscribers_count":37,"default_branch":"master","last_synced_at":"2024-12-04T15:47:10.941Z","etag":null,"topics":["automated","beautifulsoup","beautifulsoup4","csv","glassdoor","indeed","international","job","jobs","monster","python","scraper","search","tfidf","waterloo","yaml"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PaulMcInnis.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-08-25T00:51:25.000Z","updated_at":"2024-12-04T04:02:31.000Z","dependencies_parsed_at":"2024-11-19T08:16:30.947Z","dependency_job_id":null,"html_url":"https://github.com/PaulMcInnis/JobFunnel","commit_stats":{"total_commits":363,"total_committers":20,"mean_commits":18.15,"dds":0.6776859504132231,"last_synced_commit":"72faea29b0d737f0d5eaf0faa4ea33c9981df913"},"previous_names":[],"tags_count":20,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PaulMcInnis%2FJobFunnel","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PaulMcInnis%2FJobFunnel/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PaulMcInnis%2FJobFunnel/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PaulMcInnis%2FJobFunnel/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PaulMcInnis","download_url":"https://codeload.github.com/PaulMcInnis/JobFunnel/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244016806,"owners_count":20384210,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automated","beautifulsoup","beautifulsoup4","csv","glassdoor","indeed","international","job","jobs","monster","python","scraper","search","tfidf","waterloo","yaml"],"created_at":"2024-07-31T02:01:15.623Z","updated_at":"2025-03-17T10:32:27.577Z","avatar_url":"https://github.com/PaulMcInnis.png","language":"Python","readme":"\u003cimg src=\"logo/jobfunnel_banner.svg\" alt=\"JobFunnel Banner\" width=400/\u003e\u003cbr/\u003e\n[![Code Coverage](https://codecov.io/gh/PaulMcInnis/JobFunnel/branch/master/graph/badge.svg)](https://codecov.io/gh/PaulMcInnis/JobFunnel)\n\nAutomated tool for scraping job postings into a `.csv` file.\n\n### Benefits over job search sites:\n\n* Never see the same job twice!\n* No advertising.\n* See jobs from multiple job search websites all in one place.\n\n![masterlist.csv][masterlist]\n\n\n# Installation\n\n_JobFunnel requires [Python][python] 3.11 or later._\n\n```\npip install git+https://github.com/PaulMcInnis/JobFunnel.git\n```\n\n# Usage\nBy performing regular scraping and reviewing, you can cut through the noise of even the busiest job markets.\n\n## Configure\nYou can search for jobs with YAML configuration files or by passing command arguments.\n\nDownload the demo [settings.yaml][demo_yaml] by running the below command:\n\n```\nwget https://git.io/JUWeP -O my_settings.yaml\n```\n\n_NOTE:_\n* _It is recommended to provide as few search keywords as possible (i.e. `Python`, `AI`)._\n\n* _JobFunnel currently supports `CANADA_ENGLISH`, `USA_ENGLISH`, `UK_ENGLISH`, `FRANCE_FRENCH`, and `GERMANY_GERMAN` locales._\n\n## Scrape\n\nRun `funnel` with your settings YAML to populate your master CSV file with jobs from available providers:\n\n```\nfunnel load -s my_settings.yaml\n```\n\n## Review\n\nOpen the master CSV file and update the per-job `status`:\n\n* Set to `interested`, `applied`, `interview` or `offer` to reflect your progression on the job.\n\n* Set to `archive`, `rejected` or `delete` to remove a job from this search. You can review 'blocked' jobs within your `block_list_file`.\n\n# Advanced Usage\n\n* **Automating Searches** \u003cbr /\u003e\n  JobFunnel can be easily automated to run nightly with [crontab][cron] \u003cbr /\u003e\n  For more information see the [crontab document][cron_doc].\n\n* **Writing your own Scrapers** \u003cbr /\u003e\n  If you have a job website you'd like to write a scraper for, you are welcome to implement it, Review the [Base Scraper][basescraper] for implementation details.\n\n* **Remote Work** \u003cbr /\u003e\n  Bypass a frustrating user experience looking for remote work by setting the search parameter `remoteness` to match your desired level, i.e. `FULLY_REMOTE`.\n\n* **Adding Support for X Language / Job Website** \u003cbr /\u003e\n  JobFunnel supports scraping jobs from the same job website across locales \u0026 domains. If you are interested in adding support, you may only need to define session headers and domain strings, Review the [Base Scraper][basescraper] for further implementation details.\n\n* **Blocking Companies** \u003cbr /\u003e\n  Filter undesired companies by adding them to your `company_block_list` in your YAML or pass them by command line as `-cbl`.\n\n* **Job Age Filter** \u003cbr /\u003e\n  You can configure the maximum age of scraped listings (in days) by configuring `max_listing_days`.\n\n* **Reviewing Jobs in Terminal** \u003cbr /\u003e\n  You can review the job list in the command line:\n  ```\n  column -s, -t \u003c master_list.csv | less -#2 -N -S\n  ```\n\n* **Respectful Delaying** \u003cbr /\u003e\n  Respectfully scrape your job posts with our built-in delaying algorithms.\n\n  To better understand how to configure delaying, check out [this Jupyter Notebook][delay_jp] which breaks down the algorithm step by step with code and visualizations.\n\n* **Recovering Lost Data** \u003cbr /\u003e\n  JobFunnel can re-build your master CSV from your `cache_folder` where all the historic scrape data is located:\n  ```\n  funnel --recover\n  ```\n\n* **Running by CLI** \u003cbr /\u003e\n  You can run JobFunnel using CLI only, review the command structure via:\n  ```\n  funnel inline -h\n  ```\n \n# CAPTCHA\n  JobFunnel does not solve CAPTCHA. If, while scraping, you receive a \n  `Unable to extract jobs from initial search result page:\\` error. \n  Then open that url on your browser and solve the CAPTCHA manually.\n\n# Developer Guide\n\nFor contributors and developers who want to work on JobFunnel, this section will guide you through setting up the development environment and the tools we use to maintain code quality and consistency.\n\n## Developer Mode Installation\n\nTo get started, install JobFunnel in **developer mode**. This will install all necessary dependencies, including development tools such as testing, linting, and formatting utilities.\n\nTo install JobFunnel in developer mode, use the following command:\n\n```bash\npip install -e '.[dev]'\n```\n\nThis command not only installs the package in an editable state but also sets up pre-commit hooks for automatic code quality checks.\n\n## Pre-Commit Hooks\n\nThe following pre-commit hooks are configured to run automatically when you commit changes to ensure the code follows consistent style and quality guidelines:\n\n- `Black`: Automatically formats Python code to ensure consistency.\n- `isort`: Sorts and organizes imports according to the Black style.\n- `Prettier`: Formats non-Python files such as YAML and JSON.\n- `Flake8`: Checks Python code for style guide violations.\n\nWhile the pre-commit package is installed when you run `pip install -e '.[dev]'`, you still need to initialize the hooks by running the following command once:\n\n```bash\npre-commit install\n```\n\n### How Pre-Commit Hooks Work\n\nThe pre-commit hooks will automatically run when you attempt to make a commit. If any formatting issues are found, the hooks will fix them (for Black and isort), or warn you about style violations (for Flake8). This ensures that all committed code meets the project’s quality standards.\n\nYou can also manually run the pre-commit hooks at any time with:\n\n```bash\npre-commit run --all-files\n```\n\nThis is useful to check the entire codebase before committing or as part of a larger code review. Please fix all style guide violations (or provide a reason to ignore) before committing to the repository.\n\n## Running Tests\n\nWe use `pytest` to run tests and ensure that the code behaves as expected. Code coverage is automatically generated every time you run the tests.\n\nTo run all tests, use the following command:\n\n```bash\npytest\n```\n\nThis will execute the test suite and automatically generate a code coverage report.\n\nIf you want to see a detailed code coverage report, you can run:\n\n```bash\npytest --cov-report=term-missing\n```\n\nThis will display which lines of code were missed in the test coverage directly in your terminal output.\n\n\n\n\u003c!-- links --\u003e\n[requirements]:requirements.txt\n[masterlist]:demo/demo.png \"masterlist.csv\"\n[demo_yaml]:demo/settings.yaml\n[python]:https://www.python.org/\n[basescraper]:jobfunnel/backend/scrapers/base.py\n[cron]:https://en.wikipedia.org/wiki/Cron\n[cron_doc]:docs/crontab/readme.md\n[conc_fut]:https://docs.python.org/dev/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor\n[thread]: https://docs.python.org/3.11/library/threading.html\n[delay_jp]:https://github.com/bunsenmurder/Notebooks/blob/master/jobFunnel/delay_algorithm.ipynb\n","funding_links":[],"categories":["Python","HarmonyOS"],"sub_categories":["Windows Manager"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FPaulMcInnis%2FJobFunnel","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FPaulMcInnis%2FJobFunnel","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FPaulMcInnis%2FJobFunnel/lists"}