Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/faisal-fida/officemonster-scraper

This project is a comprehensive web scraping application designed to collect product data from a specified e-commerce website. The project consists of two main Jupyter notebooks: URLS Scraper.ipynb and Product Scraper.ipynb.
https://github.com/faisal-fida/officemonster-scraper

eccomerce-platform python3 scraper

Last synced: about 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/faisal-fida/officemonster-scraper
Owner: faisal-fida
Created: 2023-10-21T08:14:58.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-10-06T08:15:12.000Z (3 months ago)
Last Synced: 2024-11-10T21:16:28.698Z (about 2 months ago)
Topics: eccomerce-platform, python3, scraper
Language: Jupyter Notebook
Homepage:
Size: 1.08 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Officemonster

## Table of Contents

- [Installation](#installation)
- [Usage](#usage)
- [Complexities](#complexities)
- [Solutions](#solutions)
- [Challenges](#challenges)
- [License](#license)

## Installation

1. Clone the repository:

```sh
git clone https://github.com/faisal-fida/officemonster.git
cd officemonster
```

2. Install the dependencies:

```sh
pip install -r requirements.txt
```

3. Make sure you have the required CSV files in the `URLS` directory.

## Usage

### URLS Scraper

1. Open `URLS Scraper.ipynb` to run the script that processes multiple CSV files in the `URLS` directory, concatenates them, and saves the combined URLs to `combined_urls.csv`.
2. The script then extracts URLs from `combined_urls.csv` and saves them to `urls.csv`.

### Product Scraper

1. Open `Product Scraper.ipynb` to run the script that reads URLs from `urls.csv` and scrapes product details.
2. The script utilizes BeautifulSoup to parse HTML content and extract relevant product information such as title, price, images, and descriptions.

## Features

- **Data Aggregation**: Combining multiple CSV files into a single DataFrame while maintaining data integrity.
- **Web Scraping**: Handling dynamic web content and possible changes in website structure.
- **Error Handling**: Managing HTTP errors and ensuring the script continues to run smoothly.

- **Efficient Data Handling**: Used `pandas` to efficiently read, concatenate, and save CSV files.
- **Robust Web Scraping**: Implemented functions to handle HTTP requests and parse HTML with BeautifulSoup, ensuring data is extracted even if some pages do not follow the expected structure.
- **Error Management**: Added try-except blocks to catch and log HTTP errors, allowing the script to skip problematic URLs and continue processing others.

- **Data Consistency**: Ensuring all CSV files have a consistent format and contain valid URLs.
- **Website Variability**: Handling variations in web page design that could affect scraping logic.
- **Performance**: Optimizing the script to handle large datasets and multiple HTTP requests efficiently.