Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kayx23/indeed-scraper
Scrape job posts off Indeed Canada (ca.indeed.com)
https://github.com/kayx23/indeed-scraper
bs4 scrapy selenium webscraping
Last synced: 5 days ago
JSON representation
Scrape job posts off Indeed Canada (ca.indeed.com)
- Host: GitHub
- URL: https://github.com/kayx23/indeed-scraper
- Owner: kayx23
- Created: 2021-01-22T01:02:10.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2021-12-05T21:49:59.000Z (about 3 years ago)
- Last Synced: 2024-12-06T09:41:59.662Z (2 months ago)
- Topics: bs4, scrapy, selenium, webscraping
- Language: Jupyter Notebook
- Homepage:
- Size: 469 KB
- Stars: 2
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Indeed Job Scraper (Canada)
Built three scrapers to scrape Indeed jobs in Ontario, Canada, with:
1. requests & bs4
2. selenium & bs4
3. scrapy### How many jobs are we expecting to scrape?
1500 jobs. Indeed displays 15 jobs a page so we have 100 pages to get through.
As of Jan 23, 2021, around 82,000 jobs in Ontario were listed on Indeed. A lot of these posting are considered highly similar by Indeed and therefore are not displayed in a regular research:
I decided to not scrape these similar postings in this excercise.
### Why are there three scrapers?
Because my initial attempts were countered by anti-scraping mechanism, such as [Google reCAPTCHA](https://www.google.com/recaptcha/about/).Google reCAPTCHA throws 5 to 10 reCAPTCHAs in one setting when a large amount of requests are detected from the same address, same user agent etc.
I first wrote the scraper with **Requests** and **bs4**, which was stopped by reCAPTCHA about 900 jobs/10 mins in. Hoping to manually resolve the reCAPTCHAs, I switched to the browser automation route with **Selenium**, adding a logic so that when Google reCAPTCHA is thrown, the program pauses and waits for the user input. The program did pause about 1000 jobs in and I was able to manually resolve the reCAPTCHAs, but for some unknown reasons, the scraper always stopped after the resolution of reCAPTCHAs.
At this stage, there are several solutions I considered:
* Continue to debug to figure out why the scraper was stopped after the manual resolution of reCAPTCHAs;
* Get past the reCAPTCHA with speech-to-text transcribing the audio file in the accessability option (but this is clearly an abuse of features even if it works); or
* Rotate user agents and/or proxies to avoid triggering anti-scraping mechanismI decided to go with the last option.
Instead of manually setting up user agent rotation, I found out that this could be easily set up with [Scrapy](https://scrapy.org), which is also asynchronous. I refactored my script to use Scrapy and used [scrapy user agent middleware](https://pypi.org/project/scrapy-user-agents/). The script successfully scraped all 1500 job posts in Ontario and took about 3 mins.
### How to use
`requests + bs4` and `selenium + bs4` scrapers are in Jupyter Notebook.To run the `scrapy` scraper and save output in a json file (can also be csv or xml):
```
$ cd scrapy
$ scrapy crawl indeedSpider -O output.json
```