https://github.com/daveanthonyc/webscraper-test

Webscraper using Puppeteer to scrape 7000+ rows of data.
https://github.com/daveanthonyc/webscraper-test

dataminig nodejs puppeteer webscraping

Last synced: 3 months ago
JSON representation

Webscraper using Puppeteer to scrape 7000+ rows of data.

Host: GitHub
URL: https://github.com/daveanthonyc/webscraper-test
Owner: daveanthonyc
Created: 2024-07-17T07:53:42.000Z (12 months ago)
Default Branch: main
Last Pushed: 2024-08-03T08:19:56.000Z (12 months ago)
Last Synced: 2025-02-15T03:14:29.453Z (5 months ago)
Topics: dataminig, nodejs, puppeteer, webscraping
Language: JavaScript
Homepage:
Size: 10.4 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Webscraper using puppeteer

This is a project to scrape the paginated table of the website of registered marriage celebrants in Australia filtered by those in NSW. There were issues with scraping the table and having an appropriate exit condition to successfully stop the loop, so I just hard coded a fixed number of loops to scrape the table and paginate to the next table page.

The website it scrapes is 'https://marriage.ag.gov.au/statecelebrants/state'.
However, the HTML structure of the table isn't so straightforward to scrape as it has a series of table row headers inside the table and different columns that have no data.

The resulting scrape outputs **7000+** rows of data. :)

# Installation
*Clone repo*
```bash
git clone [email protected]:daveanthonyc/Webscraper-Test.git
```

*Install dependencies*
```bash
npm install
```

# Run the script
```bash
node scrape.js
```

You should expect a browser instance to run, move to the 'NSW' filter, then it proceeds to paginate to the end of the results.
You should be able to find an output.xlsx file in your project directory after running the script.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/daveanthonyc/webscraper-test

Awesome Lists containing this project

README