Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/PaulMcInnis/JobFunnel
Scrape job websites into a single spreadsheet with no duplicates.
https://github.com/PaulMcInnis/JobFunnel
automated beautifulsoup beautifulsoup4 csv glassdoor indeed international job jobs monster python scraper search tfidf waterloo yaml
Last synced: 17 days ago
JSON representation
Scrape job websites into a single spreadsheet with no duplicates.
- Host: GitHub
- URL: https://github.com/PaulMcInnis/JobFunnel
- Owner: PaulMcInnis
- License: mit
- Created: 2017-08-25T00:51:25.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2022-01-05T04:22:00.000Z (almost 3 years ago)
- Last Synced: 2024-02-24T07:32:47.389Z (9 months ago)
- Topics: automated, beautifulsoup, beautifulsoup4, csv, glassdoor, indeed, international, job, jobs, monster, python, scraper, search, tfidf, waterloo, yaml
- Language: Python
- Homepage:
- Size: 2.38 MB
- Stars: 1,730
- Watchers: 37
- Forks: 204
- Open Issues: 8
-
Metadata Files:
- Readme: readme.md
- License: LICENSE
Awesome Lists containing this project
README
[![Build Status](https://travis-ci.com/PaulMcInnis/JobFunnel.svg?branch=master)](https://travis-ci.com/PaulMcInnis/JobFunnel)
[![Code Coverage](https://codecov.io/gh/PaulMcInnis/JobFunnel/branch/master/graph/badge.svg)](https://codecov.io/gh/PaulMcInnis/JobFunnel)Automated tool for scraping job postings into a `.csv` file.
_[Since this project was developed, CAPTCHA has clamped down hard, help us re-build the backend and make this tool useful again!](https://github.com/PaulMcInnis/JobFunnel/discussions/148)_
### Benefits over job search sites:
* Never see the same job twice!
* No advertising.
* See jobs from multiple job search websites all in one place.![masterlist.csv][masterlist]
# Installation
_JobFunnel requires [Python][python] 3.8 or later._
```
pip install git+https://github.com/PaulMcInnis/JobFunnel.git
```# Usage
By performing regular scraping and reviewing, you can cut through the noise of even the busiest job markets.## Configure
You can search for jobs with YAML configuration files or by passing command arguments.Download the demo [settings.yaml][demo_yaml] by running the below command:
```
wget https://git.io/JUWeP -O my_settings.yaml
```_NOTE:_
* _It is recommended to provide as few search keywords as possible (i.e. `Python`, `AI`)._* _JobFunnel currently supports `CANADA_ENGLISH`, `USA_ENGLISH`, `UK_ENGLISH`, `FRANCE_FRENCH`, and `GERMANY_GERMAN` locales._
## Scrape
Run `funnel` with your settings YAML to populate your master CSV file with jobs from available providers:
```
funnel load -s my_settings.yaml
```## Review
Open the master CSV file and update the per-job `status`:
* Set to `interested`, `applied`, `interview` or `offer` to reflect your progression on the job.
* Set to `archive`, `rejected` or `delete` to remove a job from this search. You can review 'blocked' jobs within your `block_list_file`.
# Advanced Usage
* **Automating Searches**
JobFunnel can be easily automated to run nightly with [crontab][cron]
For more information see the [crontab document][cron_doc].* **Writing your own Scrapers**
If you have a job website you'd like to write a scraper for, you are welcome to implement it, Review the [Base Scraper][basescraper] for implementation details.* **Remote Work**
Bypass a frustrating user experience looking for remote work by setting the search parameter `remoteness` to match your desired level, i.e. `FULLY_REMOTE`.* **Adding Support for X Language / Job Website**
JobFunnel supports scraping jobs from the same job website across locales & domains. If you are interested in adding support, you may only need to define session headers and domain strings, Review the [Base Scraper][basescraper] for further implementation details.* **Blocking Companies**
Filter undesired companies by adding them to your `company_block_list` in your YAML or pass them by command line as `-cbl`.* **Job Age Filter**
You can configure the maximum age of scraped listings (in days) by configuring `max_listing_days`.* **Reviewing Jobs in Terminal**
You can review the job list in the command line:
```
column -s, -t < master_list.csv | less -#2 -N -S
```* **Respectful Delaying**
Respectfully scrape your job posts with our built-in delaying algorithms.To better understand how to configure delaying, check out [this Jupyter Notebook][delay_jp] which breaks down the algorithm step by step with code and visualizations.
* **Recovering Lost Data**
JobFunnel can re-build your master CSV from your `cache_folder` where all the historic scrape data is located:
```
funnel --recover
```* **Running by CLI**
You can run JobFunnel using CLI only, review the command structure via:
```
funnel inline -h
```
# CAPTCHA
JobFunnel does not solve CAPTCHA. If, while scraping, you receive a
`Unable to extract jobs from initial search result page:\` error.
Then open that url on your browser and solve the CAPTCHA manually.[requirements]:requirements.txt
[masterlist]:demo/demo.png "masterlist.csv"
[demo_yaml]:demo/settings.yaml
[python]:https://www.python.org/
[basescraper]:jobfunnel/backend/scrapers/base.py
[cron]:https://en.wikipedia.org/wiki/Cron
[cron_doc]:docs/crontab/readme.md
[conc_fut]:https://docs.python.org/dev/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor
[thread]: https://docs.python.org/3.8/library/threading.html
[delay_jp]:https://github.com/bunsenmurder/Notebooks/blob/master/jobFunnel/delay_algorithm.ipynb