https://github.com/tylerlittlefield/glassdoor-scraper
:door: Scrape Glassdoor reviews with rvest
https://github.com/tylerlittlefield/glassdoor-scraper
rstats rvest scraper scraping-websites
Last synced: 7 months ago
JSON representation
:door: Scrape Glassdoor reviews with rvest
- Host: GitHub
- URL: https://github.com/tylerlittlefield/glassdoor-scraper
- Owner: tylerlittlefield
- Created: 2020-08-08T02:39:35.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2020-08-09T20:06:38.000Z (about 5 years ago)
- Last Synced: 2025-01-08T08:16:44.831Z (9 months ago)
- Topics: rstats, rvest, scraper, scraping-websites
- Language: R
- Homepage:
- Size: 16.6 KB
- Stars: 3
- Watchers: 2
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```# glassdoor-scraper
A demonstration of scraping glassdoor reviews using `rvest`. Note that the underlying functions rely on xpath's that I copied by simply clicking what I wanted and inspecting the element. These will probably change over time and consequently, the scripts will fail. As of `r Sys.Date()`, it seems to work pretty well.
```{r}
source("R/scrape.R")# example urls, we'll go with Google
tesla_url <- "https://www.glassdoor.com/Reviews/Tesla-Reviews-E43129"
apple_url <- "https://www.glassdoor.com/Reviews/Apple-Reviews-E1138"
google_url <- "https://www.glassdoor.com/Reviews/Google-Reviews-E9079"# loop through n pages
pages <- 1:5
out <- lapply(pages, function(page) {
Sys.sleep(1)
try_scrape_reviews(google_url, page)
})# filter for stuff we successfully extracted
reviews <- bind_rows(Filter(Negate(is.null), out), .id = "page")# remove any duplicates, parse the review time
reviews %>%
distinct() %>%
mutate(
review_time = clean_review_datetime(review_time_raw),
page = as.numeric(page)
) %>%
select(
page,
review_id,
review_time_raw,
review_time,
review_title,
employee_role,
employee_history,
employeer_pros,
employeer_cons,
employeer_rating,
work_life_balance,
culture_values,
career_opportunities,
compensation_and_benefits,
senior_management
) %>%
glimpse()
```## Session Info
```{r}
sessioninfo::session_info()
```