https://github.com/martinxcvi/node-web-scraper

Web scraper developed with Node.js and the Puppeteer library.
https://github.com/martinxcvi/node-web-scraper

Last synced: 2 months ago
JSON representation

Web scraper developed with Node.js and the Puppeteer library.

Host: GitHub
URL: https://github.com/martinxcvi/node-web-scraper
Owner: MartinXCVI
Created: 2025-01-03T22:36:54.000Z (5 months ago)
Default Branch: main
Last Pushed: 2025-01-10T21:09:43.000Z (4 months ago)
Last Synced: 2025-01-24T07:13:03.414Z (4 months ago)
Language: JavaScript
Size: 14.6 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

# Web Scraper with Puppeteer

## 📄 Introduction
This project is a web scraper built using Node.js and Puppeteer. It extracts information about books from the [Books to Scrape](https://books.toscrape.com) website, including:

- **Title**
- **Price**
- **Stock availability**
- **Rating**
- **Link**

Optionally, it also scrapes additional details for each book:

- **Description**
- **Genre**
- **UPC (Universal Product Code)**

Scraped data is saved as a JSON file for further processing or analysis.

## 📑 Features
- Scrapes multiple pages of books with pagination support.
- Option to include detailed information for each book.
- Saves data in a structured JSON format.
- Command-line arguments and environment variables for flexibility.

## 🛠️ Installation
1. Clone the repository:
```bash
git clone https://github.com/MartinXCVI/node-web-scraper.git
cd node-web-scraper
```

2. Navigate to the project directory:
```bash
cd node-web-scraper
```
3. Install dependencies:
```bash
npm install
```

## 📄 Usage
Run the script with Node.js:

```bash
node scrape.js [maxPages] [scrapeDetails]
```

### Arguments
- `maxPages` (optional): Number of pages to scrape (default: `10`).
- `scrapeDetails` (optional): Set to `true` to include additional book details (default: `false`).

### Environment Variables
Alternatively, you can use environment variables:

- `MAX_PAGES`: Number of pages to scrape.
- `SCRAPE_DETAILS`: Set to `true` to include additional book details.

### Example
To scrape 5 pages with detailed book information:
```bash
node scrape.js 5 true
```

Or using environment variables:
```bash
MAX_PAGES=5 SCRAPE_DETAILS=true node scrape.js
```

### Output
The scraped data is saved to a file named `books.json` in the project directory. The data format looks like this:

```json
[
{
"title": "A Light in the Attic",
"price": "£51.77",
"stock": "In stock",
"rating": "Three",
"link": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"description": "It's hard to imagine a world without A Light in the Attic...",
"genre": "Poetry",
"upc": "a897fe39b1053632"
},
...
]
```

## 🚫 Errors Handling
- If a page fails to load, the script logs an error and moves to the next page.
- If additional details for a book fail to scrape, the error is logged, and the script continues with the next book.

## 📚 Learn More
- [Node.js latest documentation](https://nodejs.org/docs/latest/api/)
- [Puppeteer official documentation](https://pptr.dev/category/introduction)
- [Books to Scrape website](https://books.toscrape.com/)

## 🧑‍💻 Developer:

- [**MartinXCVI**](https://github.com/MartinXCVI)

---

Feel free to contribute, modify or adapt this project to your needs. 🤝

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/martinxcvi/node-web-scraper

Awesome Lists containing this project

README