{"id":15333538,"url":"https://github.com/kayx23/indeed-scraper","last_synced_at":"2026-04-27T21:32:03.637Z","repository":{"id":105170950,"uuid":"331798191","full_name":"kayx23/Indeed-Scraper","owner":"kayx23","description":"Scrape job posts off Indeed Canada (ca.indeed.com)","archived":false,"fork":false,"pushed_at":"2021-12-05T21:49:59.000Z","size":480,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-27T14:52:51.049Z","etag":null,"topics":["bs4","scrapy","selenium","webscraping"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kayx23.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-01-22T01:02:10.000Z","updated_at":"2025-02-07T12:55:39.000Z","dependencies_parsed_at":null,"dependency_job_id":"2a6ab0c8-22b7-47c9-8ffc-9c5826ac2e9d","html_url":"https://github.com/kayx23/Indeed-Scraper","commit_stats":{"total_commits":6,"total_committers":2,"mean_commits":3.0,"dds":"0.16666666666666663","last_synced_commit":"45207ba1c66b133942d05ecf9f350bdbcb3d1649"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/kayx23/Indeed-Scraper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kayx23%2FIndeed-Scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kayx23%2FIndeed-Scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kayx23%2FIndeed-Scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kayx23%2FIndeed-Scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kayx23","download_url":"https://codeload.github.com/kayx23/Indeed-Scraper/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kayx23%2FIndeed-Scraper/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32356598,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-27T20:07:02.737Z","status":"ssl_error","status_checked_at":"2026-04-27T20:07:00.910Z","response_time":128,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bs4","scrapy","selenium","webscraping"],"created_at":"2024-10-01T10:04:22.391Z","updated_at":"2026-04-27T21:32:03.622Z","avatar_url":"https://github.com/kayx23.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Indeed Job Scraper (Canada)\n\nBuilt three scrapers to scrape Indeed jobs in Ontario, Canada, with: \n1. requests \u0026 bs4\n2. selenium \u0026 bs4\n3. scrapy\n\n### How many jobs are we expecting to scrape? \n\n1500 jobs. Indeed displays 15 jobs a page so we have 100 pages to get through. \n\nAs of Jan 23, 2021, around 82,000 jobs in Ontario were listed on Indeed. A lot of these posting are considered highly similar by Indeed and therefore are not displayed in a regular research: \n\n\u003cdiv style=\"text-align:center\"\u003e\u003cimg src=\"https://user-images.githubusercontent.com/39619599/105637543-2236d680-5e3c-11eb-9156-b4457da3cda0.png\"\u003e\u003c/div\u003e\n\nI decided to not scrape these similar postings in this excercise. \n\n### Why are there three scrapers? \nBecause my initial attempts were countered by anti-scraping mechanism, such as [Google reCAPTCHA](https://www.google.com/recaptcha/about/). \n\nGoogle reCAPTCHA throws 5 to 10 reCAPTCHAs in one setting when a large amount of requests are detected from the same address, same user agent etc. \n\nI first wrote the scraper with **Requests** and **bs4**, which was stopped by reCAPTCHA about 900 jobs/10 mins in. Hoping to manually resolve the reCAPTCHAs, I switched to the browser automation route with **Selenium**, adding a logic so that when Google reCAPTCHA is thrown, the program pauses and waits for the user input. The program did pause about 1000 jobs in and I was able to manually resolve the reCAPTCHAs, but for some unknown reasons, the scraper always stopped after the resolution of reCAPTCHAs. \n\n\nAt this stage, there are several solutions I considered:      \n* Continue to debug to figure out why the scraper was stopped after the manual resolution of reCAPTCHAs;\n* Get past the reCAPTCHA with speech-to-text transcribing the audio file in the accessability option (but this is clearly an abuse of features even if it works); or \n* Rotate user agents and/or proxies to avoid triggering anti-scraping mechanism \n\nI decided to go with the last option. \n\nInstead of manually setting up user agent rotation, I found out that this could be easily set up with [Scrapy](https://scrapy.org), which is also asynchronous. I refactored my script to use Scrapy and used [scrapy user agent middleware](https://pypi.org/project/scrapy-user-agents/). The script successfully scraped all 1500 job posts in Ontario and took about 3 mins.\n\n### How to use\n`requests + bs4` and `selenium + bs4` scrapers are in Jupyter Notebook. \n\nTo run the `scrapy` scraper and save output in a json file (can also be csv or xml): \n```\n$ cd scrapy\n$ scrapy crawl indeedSpider -O output.json\n```\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkayx23%2Findeed-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkayx23%2Findeed-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkayx23%2Findeed-scraper/lists"}