https://github.com/martinxcvi/node-web-scraper
Web scraper developed with Node.js and the Puppeteer library.
https://github.com/martinxcvi/node-web-scraper
Last synced: 2 months ago
JSON representation
Web scraper developed with Node.js and the Puppeteer library.
- Host: GitHub
- URL: https://github.com/martinxcvi/node-web-scraper
- Owner: MartinXCVI
- Created: 2025-01-03T22:36:54.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2025-01-10T21:09:43.000Z (4 months ago)
- Last Synced: 2025-01-24T07:13:03.414Z (4 months ago)
- Language: JavaScript
- Size: 14.6 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
# Web Scraper with Puppeteer
## 📄 Introduction
This project is a web scraper built using Node.js and Puppeteer. It extracts information about books from the [Books to Scrape](https://books.toscrape.com) website, including:- **Title**
- **Price**
- **Stock availability**
- **Rating**
- **Link**Optionally, it also scrapes additional details for each book:
- **Description**
- **Genre**
- **UPC (Universal Product Code)**Scraped data is saved as a JSON file for further processing or analysis.
## 📑 Features
- Scrapes multiple pages of books with pagination support.
- Option to include detailed information for each book.
- Saves data in a structured JSON format.
- Command-line arguments and environment variables for flexibility.## 🛠️ Installation
1. Clone the repository:
```bash
git clone https://github.com/MartinXCVI/node-web-scraper.git
cd node-web-scraper
```2. Navigate to the project directory:
```bash
cd node-web-scraper
```
3. Install dependencies:
```bash
npm install
```## 📄 Usage
Run the script with Node.js:```bash
node scrape.js [maxPages] [scrapeDetails]
```### Arguments
- `maxPages` (optional): Number of pages to scrape (default: `10`).
- `scrapeDetails` (optional): Set to `true` to include additional book details (default: `false`).### Environment Variables
Alternatively, you can use environment variables:- `MAX_PAGES`: Number of pages to scrape.
- `SCRAPE_DETAILS`: Set to `true` to include additional book details.### Example
To scrape 5 pages with detailed book information:
```bash
node scrape.js 5 true
```Or using environment variables:
```bash
MAX_PAGES=5 SCRAPE_DETAILS=true node scrape.js
```### Output
The scraped data is saved to a file named `books.json` in the project directory. The data format looks like this:```json
[
{
"title": "A Light in the Attic",
"price": "£51.77",
"stock": "In stock",
"rating": "Three",
"link": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"description": "It's hard to imagine a world without A Light in the Attic...",
"genre": "Poetry",
"upc": "a897fe39b1053632"
},
...
]
```## 🚫 Errors Handling
- If a page fails to load, the script logs an error and moves to the next page.
- If additional details for a book fail to scrape, the error is logged, and the script continues with the next book.## 📚 Learn More
- [Node.js latest documentation](https://nodejs.org/docs/latest/api/)
- [Puppeteer official documentation](https://pptr.dev/category/introduction)
- [Books to Scrape website](https://books.toscrape.com/)## 🧑💻 Developer:
- [**MartinXCVI**](https://github.com/MartinXCVI)
---
Feel free to contribute, modify or adapt this project to your needs. 🤝