Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sonyarianto/webscrapingjs
Scraping the web with confident.
https://github.com/sonyarianto/webscrapingjs
javascript playwright scraper scraping typescript web-scraping
Last synced: 25 days ago
JSON representation
Scraping the web with confident.
- Host: GitHub
- URL: https://github.com/sonyarianto/webscrapingjs
- Owner: sonyarianto
- License: mit
- Created: 2023-10-04T14:30:50.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2023-10-11T15:06:25.000Z (about 1 year ago)
- Last Synced: 2023-10-12T15:12:47.065Z (about 1 year ago)
- Topics: javascript, playwright, scraper, scraping, typescript, web-scraping
- Language: TypeScript
- Homepage:
- Size: 90.8 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# webscrapingjs
Scraping the web with confident.## Introduction
`webscrapingjs` goals is to create as many as scraper scripts from various topics/categories and web sources. It's like a collection of scraper scripts targeting various web.
## How to run
On this directory, run the following command:
```bash
npm install
```This will install all the dependencies needed to run the project and create a `node_modules` directory.
We provide quick example `example.ts` and you can try that by running this command.
```js
npx jiti example.ts
```Note: jiti is my favorite runtime TypeScript and ESM support for Node.
Here is the sample of the results.
```json
[
{
"_internal_page": 1,
"title": "Bertemu PM Azerbaijan, MenPAN-RB: Kita Akan Terus Perkuat Kolaborasi",
"link": "https://news.detik.com/berita/d-6975988/bertemu-pm-azerbaijan-menpan-rb-kita-akan-terus-perkuat-kolaborasi",
"image_url_on_list": "https://awsimages.detik.net.id/community/media/visual/2023/10/11/kemenpan-rb-1_43.jpeg?w=210&q=90",
"image_url_on_list_2": "https://awsimages.detik.net.id/community/media/visual/2023/10/11/kemenpan-rb-1_43.jpeg",
"image_url_on_detail": "https://awsimages.detik.net.id/api/wm/2023/10/11/kemenpan-rb-1_169.jpeg?wid=54&w=650&v=1&t=jpeg",
"image_url_on_detail_2": "https://awsimages.detik.net.id/api/wm/2023/10/11/kemenpan-rb-1_169.jpeg",
"local_category": "Berita",
"local_tags": [
"kemenpanrb",
"kemenpan rb"
],
"authors": [
"Hana Nushratu Uzma"
],
"short_description": "Menteri Pendayagunaan Aparatur Negara dan Reformasi Birokrasi (MenPAN-RB) Abdullah Azwar Anas bertemu Perdana Menteri (PM) Azerbaijan Ali Asadov di Baku.",
"published_datetime": "2023-10-11T09:15:30+07:00",
"published_datetime_utc": "2023-10-11T02:15:30.000Z",
"_internal_index": 8
}
]
```As you can see the script basically call the `scrape()` function on the imported module. You can extend the logic by saving the results to database or any further processing logic.
## Testing
We are using Vitest for running test. The test purpose is very crucial here to detect any possible problem on each scraper script. If there are failed tests means we have to pay attention to that problem because maybe there are changes on the source website (DOM structure, class name changes, selector ID changes etc).
```bash
npm run test# or target specific directory that contains phrase
# npm run test -- detik_com
```## Scraping techniques
Each script usually will use various technique to do the scraping. Here are the list of techniques that we use:
- [x] Scraping using fetch API and JSDOM (for non JavaScript rendered website)
## Questions and professional services
If you have any questions, please drop an issue on this repository. Professional support and consulting is also available, please contact me at <>.
## Sponsor
If you like this project, please consider to sponsor me on this repository. Your sponsorship will help us to maintain this project and create more open source projects in the future. Thank you.
## License
MIT
Maintained by Sony Arianto Kurniawan <> and contributors.