https://github.com/venkatarangan/productsdigest
A Python-based web scraper that fetches details from specified product webpages, especially Amazon product pages.
https://github.com/venkatarangan/productsdigest
amazon beautifulsoup4 pdf-generation pymupdf selenium-python
Last synced: about 1 year ago
JSON representation
A Python-based web scraper that fetches details from specified product webpages, especially Amazon product pages.
- Host: GitHub
- URL: https://github.com/venkatarangan/productsdigest
- Owner: venkatarangan
- License: mit
- Created: 2024-11-05T15:57:15.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-11-12T06:42:38.000Z (over 1 year ago)
- Last Synced: 2025-01-18T21:19:22.590Z (over 1 year ago)
- Topics: amazon, beautifulsoup4, pdf-generation, pymupdf, selenium-python
- Language: Python
- Homepage: https://venkatarangan.com/blog/category/technology/coding/
- Size: 962 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ProductDigest
## Objective
This project provides a tool to automatically extract and compile webpage details into a well-formatted PDF document. It is particularly tailored to handle Amazon product pages, capturing prices and other key details, but also works with general URLs to gather metadata and generate page previews.
## What It Does
- Reads a list of URLs from a file (`urls.txt`).
- Fetches the title, timestamp, and thumbnail of each webpage.
- Specifically for Amazon URLs, retrieves additional details such as product pricing and description.
- Compiles these details into a structured PDF file (`webpage_details.pdf`), making it convenient for users to view key webpage information offline.
## Sample Output

The image above shows a sample page with product information from Amazon India, generated during a trial run.
## How It Works
1. **URL Processing**: The script reads URLs from a text file (`urls.txt`), handling each URL line-by-line.
2. **Data Extraction**:
- Uses `Selenium` for automated browsing and scraping.
- For Amazon pages, specialized routines extract product pricing and details.
- General URLs are parsed for titles and preview images.
3. **PDF Generation**: Combines the extracted data, arranging each entry with a title, thumbnail, and timestamp, and generates a PDF using `PyMuPDF`.
4. **Error Handling**: Incorporates retry mechanisms for failed URL loads to improve reliability.
## Required Packages
To run this script, the following Python packages are required:
- `PyMuPDF (fitz)` for PDF creation.
- `selenium` for web scraping and page automation.
- `webdriver_manager` to manage the Edge WebDriver.
- `Pillow (PIL)` for image processing.
- `requests` for HTTP requests.
- `beautifulsoup4` for HTML parsing.
Additionally, ensure that:
- Microsoft Edge is installed on your system.
- An internet connection is available.
## Installation
1. Clone the repository:
```bash
git clone https://github.com/venkatarangan/ProductsDigest.git
cd ProductsDigest
```
2. Install the required Python packages:
```bash
pip install PyMuPDF selenium webdriver_manager Pillow requests beautifulsoup4
```
3. Ensure Microsoft Edge is installed and up-to-date for compatibility with `Selenium`.
## Usage
1. Create a text file named `urls.txt` in the project directory, listing the URLs to process, with one URL per line.
2. Run the script:
```bash
python ProductDigest.py
```
3. The output PDF, `webpage_details.pdf`, will be generated in the project directory.
## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
## Acknowledgement
The basic code was generated from several prompts using GPT-4o and Claude Sonnet 3.5 in Abacus.AI, with further adjustments made to improve accuracy and customize functionality.
## Disclaimer
All product information, price details, and images are the property of their respective owners, including Amazon India. This project uses such information solely for educational and personal purposes, with no commercial intent.