Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/faisal-fida/officemonster-scraper
This project is a comprehensive web scraping application designed to collect product data from a specified e-commerce website. The project consists of two main Jupyter notebooks: URLS Scraper.ipynb and Product Scraper.ipynb.
https://github.com/faisal-fida/officemonster-scraper
eccomerce-platform python3 scraper
Last synced: about 2 months ago
JSON representation
This project is a comprehensive web scraping application designed to collect product data from a specified e-commerce website. The project consists of two main Jupyter notebooks: URLS Scraper.ipynb and Product Scraper.ipynb.
- Host: GitHub
- URL: https://github.com/faisal-fida/officemonster-scraper
- Owner: faisal-fida
- Created: 2023-10-21T08:14:58.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-10-06T08:15:12.000Z (3 months ago)
- Last Synced: 2024-11-10T21:16:28.698Z (about 2 months ago)
- Topics: eccomerce-platform, python3, scraper
- Language: Jupyter Notebook
- Homepage:
- Size: 1.08 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Officemonster
This project is a comprehensive web scraping application designed to collect product data from a specified e-commerce website. The project consists of two main Jupyter notebooks: `URLS Scraper.ipynb` and `Product Scraper.ipynb`.
## Table of Contents
- [Installation](#installation)
- [Usage](#usage)
- [Complexities](#complexities)
- [Solutions](#solutions)
- [Challenges](#challenges)
- [License](#license)## Installation
1. Clone the repository:
```sh
git clone https://github.com/faisal-fida/officemonster.git
cd officemonster
```2. Install the dependencies:
```sh
pip install -r requirements.txt
```3. Make sure you have the required CSV files in the `URLS` directory.
## Usage
### URLS Scraper
1. Open `URLS Scraper.ipynb` to run the script that processes multiple CSV files in the `URLS` directory, concatenates them, and saves the combined URLs to `combined_urls.csv`.
2. The script then extracts URLs from `combined_urls.csv` and saves them to `urls.csv`.### Product Scraper
1. Open `Product Scraper.ipynb` to run the script that reads URLs from `urls.csv` and scrapes product details.
2. The script utilizes BeautifulSoup to parse HTML content and extract relevant product information such as title, price, images, and descriptions.## Features
- **Data Aggregation**: Combining multiple CSV files into a single DataFrame while maintaining data integrity.
- **Web Scraping**: Handling dynamic web content and possible changes in website structure.
- **Error Handling**: Managing HTTP errors and ensuring the script continues to run smoothly.- **Efficient Data Handling**: Used `pandas` to efficiently read, concatenate, and save CSV files.
- **Robust Web Scraping**: Implemented functions to handle HTTP requests and parse HTML with BeautifulSoup, ensuring data is extracted even if some pages do not follow the expected structure.
- **Error Management**: Added try-except blocks to catch and log HTTP errors, allowing the script to skip problematic URLs and continue processing others.- **Data Consistency**: Ensuring all CSV files have a consistent format and contain valid URLs.
- **Website Variability**: Handling variations in web page design that could affect scraping logic.
- **Performance**: Optimizing the script to handle large datasets and multiple HTTP requests efficiently.