Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/zahradnik-ondrej/jobscz-scraper
A simple data scraper of Jobs.cz written in multiple JS/TS libraries.
https://github.com/zahradnik-ondrej/jobscz-scraper
javascript jobs playwright puppeteer selenium typescript web-scraping
Last synced: 1 day ago
JSON representation
A simple data scraper of Jobs.cz written in multiple JS/TS libraries.
- Host: GitHub
- URL: https://github.com/zahradnik-ondrej/jobscz-scraper
- Owner: zahradnik-ondrej
- Created: 2023-08-29T12:34:08.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-10-13T12:27:42.000Z (about 1 year ago)
- Last Synced: 2024-11-10T15:41:17.493Z (about 2 months ago)
- Topics: javascript, jobs, playwright, puppeteer, selenium, typescript, web-scraping
- Language: TypeScript
- Homepage:
- Size: 6.4 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 💼 [Jobs.cz](https://www.jobs.cz/prace/) Scraper
### A simple data scraper of [Jobs.cz](https://www.jobs.cz/prace/) written in multiple JS/TS libraries.
A programming exercise and an experiment to determine which **JavaScript / TypeScript** library is the best option for web scraping.
*The libraries used are Puppeteer [^1], Playwright and Selenium.*
*(The instructions below have been made to work on **Linux** operating systems, specifically on **Ubuntu** (20.04 and 22.04) along with the prerequisite of having **Git** and **npm** installed on your system.)*
***
### Installation:
`git clone https://github.com/zahradnik-ondrej/jobscz-scraper.git`
`cd jobscz-scraper`
`cd puppeteer` or `cd playwright` or `cd selenium`
`./run.sh`
Go to `http://localhost:3000/` to access the input form for the **Puppeteer** [^1] script.
### Output:
You will find the scraped job postings in the `job-posts.json` file in the current project's directory or in the subdirectory named `scraper` in the case of the **Puppeteer** script. [^1]
***
### Observations:
**Puppeteer** and **Selenium** are equally fast in this specific case.
**Puppeteer** and **Selenium** are **~3.6944..** times faster than **Playwright** in this specific case.**Playwright** offers the most intuitive built-in functions for interacting with the web browser making it most suitable for beginners.
**Selenium** also offers many built-in functions but they are not as intuitive.
**Puppeteer** offers very little in this case and it's best to write your own wrapper functions which suit your specific needs but it offers the most modularity making this process easier compared to the others. [^2]Both **Playwright** and **Selenium** offer a support for multiple browsers aside from **Chrome** *(unlike **Puppeteer** which has only experimental support for **Edge** via [puppeteer-core](https://www.npmjs.com/package/puppeteer-core) and **Firefox** via [puppeteer-firefox](https://www.npmjs.com/package/puppeteer-firefox))*.
[^1]: Note that the **Puppeteer** script also provides a graphical web interface through `http://localhost:3000/` with the option to specify parameters of which job listings to scrape because it's the library that I chose to go with in my project.
[^2]: You can check out my [🧰 puppethelper - A Puppeteer helper package for automated QA web testing](https://github.com/zahradnik-ondrej/puppethelper) which has many useful functions for interacting with the web browser out-of-the-box plus a little extra.