https://github.com/danielwte/shein-scraper
Shein Scraper: Enter category URL, get product URLs, details, reviews, and images. Data stored in JSON.
https://github.com/danielwte/shein-scraper
json mongodb proxies python scraper selenium shein user-agents
Last synced: 2 months ago
JSON representation
Shein Scraper: Enter category URL, get product URLs, details, reviews, and images. Data stored in JSON.
- Host: GitHub
- URL: https://github.com/danielwte/shein-scraper
- Owner: DanielWTE
- Created: 2023-07-29T20:29:32.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2025-01-01T18:00:13.000Z (6 months ago)
- Last Synced: 2025-05-05T19:58:49.158Z (2 months ago)
- Topics: json, mongodb, proxies, python, scraper, selenium, shein, user-agents
- Language: Python
- Homepage:
- Size: 102 KB
- Stars: 39
- Watchers: 1
- Forks: 7
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Shein Scraper Tool
A modular web scraping tool designed to collect product information from Shein, including product URLs, detailed product information, and reviews. Built with Python and Playwright, featuring advanced anti-detection measures and easy deployment options.
## Features
- Interactive CLI menu interface
- Modular scraping architecture
- Advanced anti-detection features:
- Dynamic browser fingerprinting
- Automated popup handling
- Cookie management
- Geolocation spoofing
- Request header customization
- Configurable scraping parameters
- Docker support for easy deployment
- JSON-based data storage## Project Structure
```
shein-scraper/
├── scraper/
│ ├── product_urls.py # Category page scraping
│ ├── product_details.py # Product information scraping
│ └── reviews.py # Review collection (WIP)
├── utils/
│ ├── browser_config.py # Anti-detection configuration
│ ├── popup_handler.py # Popup management
│ ├── captcha_handler.py # Captcha handling
│ ├── user_agents.py # User agent rotation
│ └── validator.py # URL validation
├── docker-compose.yml # Docker configuration
├── Dockerfile # Container definition
├── requirements.txt # Python dependencies
├── main.py # CLI entry point
└── README.md # Documentation
```## Quick Start
### Using Docker (Recommended)
1. Prerequisites:
- Install [Docker](https://docs.docker.com/get-docker/)2. Clone the repository:
```bash
git clone https://github.com/DanielWTE/shein-scraper.git
cd shein-scraper
```3. Create a local output directory:
```bash
mkdir output
```4. Build the Docker image:
```bash
docker build -t shein-scraper .
```5. Run the scraper in interactive mode with data persistence:
```bash
docker run -it --init -v $(pwd)/output:/app/output shein-scraper
```Note: The -v flag maps your local output directory to the container's output directory:
- Without volume mapping, data will be lost when the container stops
- With volume mapping, all scraped data is saved to your local ./output folder### Local Installation
1. Prerequisites:
- Python 3.12 or higher
- pip (Python package manager)
- Chrome browser2. Clone and setup:
```bash
git clone https://github.com/DanielWTE/shein-scraper.git
cd shein-scraper
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
playwright install chromium
```3. Run the scraper:
```bash
python main.py menu
```## Usage Guide
The tool offers three main functions through an interactive CLI menu:
### 1. Product URL Collector
- Extracts product URLs from category pages
- Input: Category URL (e.g., https://shein.com/women-dresses-c-1727.html)
- Output: JSON file with collected URLs in `output/product_urls_[timestamp].json`### 2. Product Details Extractor
- Gathers detailed product information
- Two modes:
- Single URL mode: Process one product
- Batch mode: Process multiple products from a previous URL collection
- Output: JSON file with product details in `output/product_details_[timestamp].json`### 3. Review Collector
- Currently under development## Data Format
### Product URLs JSON Structure
```json
{
"category_url": "https://shein.com/category",
"total_pages_scraped": 5,
"product_count": 120,
"urls": [
"https://shein.com/product1",
"https://shein.com/product2"
]
}
```### Product Details JSON Structure
```json
{
"total_products": 50,
"scrape_timestamp": 1709142400,
"products": [
{
"url": "https://shein.com/product",
"sku": "sw2401234567890",
"title": "Product Name",
"images": [
"https://shein.com/image1.jpg",
"https://shein.com/image2.jpg"
],
"scraped_at": 1709142400
}
]
}
```## Limitations & Known Issues
- Review collection functionality is under development
- No built-in proxy support
- Basic captcha handling that may require manual intervention
- Some anti-bot detection systems might still detect the scraper