https://github.com/oxylabs/how-to-scrape-indeed
A tutorial for collecting job postings from Indeed using Python and Oxylabs Web Scraper API.
https://github.com/oxylabs/how-to-scrape-indeed
api job-posting python scraper-api web-scraper web-scraping
Last synced: about 1 year ago
JSON representation
A tutorial for collecting job postings from Indeed using Python and Oxylabs Web Scraper API.
- Host: GitHub
- URL: https://github.com/oxylabs/how-to-scrape-indeed
- Owner: oxylabs
- Created: 2024-01-19T08:50:56.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-02-11T13:17:02.000Z (over 1 year ago)
- Last Synced: 2025-03-29T13:10:00.163Z (about 1 year ago)
- Topics: api, job-posting, python, scraper-api, web-scraper, web-scraping
- Language: Python
- Homepage: https://oxylabs.io/products/scraper-api/web
- Size: 49.8 KB
- Stars: 151
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
# How to Scrape Indeed
[](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=877&url_id=112)
Here's the process of extracting job postings from [Indeed](https://www.indeed.com/) with the help of Oxylabs [Web Scraper API](https://oxylabs.io/products/scraper-api/web) (**1-week free trial**) and Python.
For the complete guide with in-depth explanations and visuals, check our [blog post](https://oxylabs.io/blog/how-to-scrape-indeed).
## Project setup
### Creating a virtual environment
```python
python -m venv indeed_env #Windows
python3 -m venv indeed_env #Macand Linux
```
### Activating the virtual environment
```python
.\indeed_env\Scripts\Activate#Windows
source indeed_env/bin/activate #Macand Linux
```
### Installing libraries
```python
$ pip install requests
```
## Overview of Web Scraper API
The following is an example that shows how Web Scraper API works.
```python
# scraper_api_demo.py
import requests
payload = {
"source": "universal",
"url": "https://www.indeed.com"
}
response = requests.post(
url="https://realtime.oxylabs.io/v1/queries",
json=payload,
auth=(username,password),
)
print(response.json())
```
## Web Scraper API parameters
### Parsing the page title and retrieving results in JSON
```python
"title": {
"_fns": [
{
"_fn": "xpath_one",
"_args": ["//title/text()"]
}
]
}
},
```
If you send this as `parsing_instructions`, the output would be the following JSON.
```python
{ "title": "Job Search | Indeed", "parse_status_code": 12000 }
```
Note that the `parse_status_code` means a successful response.
The following code prints the title of the Indeed page.
```python
# indeed_title.py
import requests
payload = {
"source": "universal",
"url": "https://www.indeed.com",
"parse": True,
"parsing_instructions": {
"title": {
"\_fns": [
{
"\_fn": "xpath_one",
"\_args": [
"//title/text()"
]
}
]
}
},
}
response = requests.post(
url="https://realtime.oxylabs.io/v1/queries",
json=payload,
auth=('username', 'password'),
)
print(response.json()['results'][0]['content'])
```
## Scraping Indeed job postings
### Selecting a job listing
```python
`.job_seen_beacon`
```
### Creating the placeholder for a job listing
```
"job_listings": {
"_fns": [
{
"_fn": "css",
"_args": [".job_seen_beacon"]
}
],
"_items": {
"job_title": {
"_fns": [
{
"_fn": "xpath_one",
"_args": [".//h2[contains(@class,'jobTitle')]/a/span/text()"]
}
]
},
"company_name": {
"_fns": [
{
"_fn": "xpath_one",
"_args": [".//span[@data-testid='company-name']/text()"]
}
]
},
```
### Adding other selectors
```json
{
"source": "universal",
"url": "https://www.indeed.com/jobs?q=work+from+home&l=San+Francisco%2C+CA",
"parse": true,
"parsing_instructions": {
"job_listings": {
"_fns": [
{
"_fn": "css",
"_args": [".job_seen_beacon"]
}
],
"_items": {
"job_title": {
"_fns": [
{
"_fn": "xpath_one",
"_args": [".//h2[contains(@class,'jobTitle')]/a/span/text()"]
}
]
},
"company_name": {
"_fns": [
{
"_fn": "xpath_one",
"_args": [".//span[@data-testid='company-name']/text()"]
}
]
}
}
}
}
}
```
For other data points, see the file [here](src/job_search_payload.json).
### Saving the payload as a separator JSON file
```python
# parse_jobs.py
import requests
import json
payload = {}
with open("job_search_payload.json") as f:
payload = json.load(f)
response = requests.post(
url="https://realtime.oxylabs.io/v1/queries",
json=payload,
auth=("username", "password"),
)
print(response.status_code)
with open("result.json", "w") as f:
json.dump(response.json(), f, indent=4)
```
## Exporting to JSON and CSV
```python
# parse_jobs.py
with open("results.json", "w") as f:
json.dump(data, f, indent=4)
df = pd.DataFrame(data["results"][0]["content"]["job_listings"])
df.to_csv("job_search_results.csv", index=False)
```
## Final word
Check our [documentation](https://developers.oxylabs.io/scraper-apis/web-scraper-api) for more API parameters and variables found in this tutorial.
If you have any questions, feel free to contact us at support@oxylabs.io.