{"id":16170481,"url":"https://github.com/danielwte/shein-scraper","last_synced_at":"2025-05-12T21:11:06.268Z","repository":{"id":184721265,"uuid":"672364648","full_name":"DanielWTE/shein-scraper","owner":"DanielWTE","description":"Shein Scraper: Enter category URL, get product URLs, details, reviews, and images. Data stored in JSON.","archived":false,"fork":false,"pushed_at":"2025-01-01T18:00:13.000Z","size":104,"stargazers_count":39,"open_issues_count":1,"forks_count":7,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-05T19:58:49.158Z","etag":null,"topics":["json","mongodb","proxies","python","scraper","selenium","shein","user-agents"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DanielWTE.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-07-29T20:29:32.000Z","updated_at":"2025-04-10T16:36:38.000Z","dependencies_parsed_at":"2024-11-02T13:41:28.208Z","dependency_job_id":"ef97dd07-9d1d-414d-835b-3c234502d0ff","html_url":"https://github.com/DanielWTE/shein-scraper","commit_stats":null,"previous_names":["danielwte/shein-scraper"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielWTE%2Fshein-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielWTE%2Fshein-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielWTE%2Fshein-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielWTE%2Fshein-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DanielWTE","download_url":"https://codeload.github.com/DanielWTE/shein-scraper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253823453,"owners_count":21969848,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["json","mongodb","proxies","python","scraper","selenium","shein","user-agents"],"created_at":"2024-10-10T03:18:53.029Z","updated_at":"2025-05-12T21:11:06.247Z","avatar_url":"https://github.com/DanielWTE.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Shein Scraper Tool\n\nA modular web scraping tool designed to collect product information from Shein, including product URLs, detailed product information, and reviews. Built with Python and Playwright, featuring advanced anti-detection measures and easy deployment options.\n\n## Features\n\n- Interactive CLI menu interface\n- Modular scraping architecture\n- Advanced anti-detection features:\n  - Dynamic browser fingerprinting\n  - Automated popup handling\n  - Cookie management\n  - Geolocation spoofing\n  - Request header customization\n- Configurable scraping parameters\n- Docker support for easy deployment\n- JSON-based data storage\n\n## Project Structure\n\n```\nshein-scraper/\n├── scraper/\n│   ├── product_urls.py     # Category page scraping\n│   ├── product_details.py  # Product information scraping\n│   └── reviews.py         # Review collection (WIP)\n├── utils/\n│   ├── browser_config.py   # Anti-detection configuration\n│   ├── popup_handler.py    # Popup management\n│   ├── captcha_handler.py  # Captcha handling\n│   ├── user_agents.py      # User agent rotation\n│   └── validator.py        # URL validation\n├── docker-compose.yml      # Docker configuration\n├── Dockerfile             # Container definition\n├── requirements.txt       # Python dependencies\n├── main.py               # CLI entry point\n└── README.md             # Documentation\n```\n\n## Quick Start\n\n### Using Docker (Recommended)\n\n1. Prerequisites:\n   - Install [Docker](https://docs.docker.com/get-docker/)\n\n2. Clone the repository:\n   ```bash\n   git clone https://github.com/DanielWTE/shein-scraper.git\n   cd shein-scraper\n   ```\n\n3. Create a local output directory:\n   ```bash\n   mkdir output\n   ```\n\n4. Build the Docker image:\n   ```bash\n   docker build -t shein-scraper .\n   ```\n\n5. Run the scraper in interactive mode with data persistence:\n   ```bash\n   docker run -it --init -v $(pwd)/output:/app/output shein-scraper\n   ```\n\nNote: The -v flag maps your local output directory to the container's output directory:\n- Without volume mapping, data will be lost when the container stops\n- With volume mapping, all scraped data is saved to your local ./output folder\n\n### Local Installation\n\n1. Prerequisites:\n   - Python 3.12 or higher\n   - pip (Python package manager)\n   - Chrome browser\n\n2. Clone and setup:\n   ```bash\n   git clone https://github.com/DanielWTE/shein-scraper.git\n   cd shein-scraper\n   python -m venv venv\n   source venv/bin/activate  # On Windows: venv\\Scripts\\activate\n   pip install -r requirements.txt\n   playwright install chromium\n   ```\n\n3. Run the scraper:\n   ```bash\n   python main.py menu\n   ```\n\n## Usage Guide\n\nThe tool offers three main functions through an interactive CLI menu:\n\n### 1. Product URL Collector\n- Extracts product URLs from category pages\n- Input: Category URL (e.g., https://shein.com/women-dresses-c-1727.html)\n- Output: JSON file with collected URLs in `output/product_urls_[timestamp].json`\n\n### 2. Product Details Extractor\n- Gathers detailed product information\n- Two modes:\n  - Single URL mode: Process one product\n  - Batch mode: Process multiple products from a previous URL collection\n- Output: JSON file with product details in `output/product_details_[timestamp].json`\n\n### 3. Review Collector\n- Currently under development\n\n## Data Format\n\n### Product URLs JSON Structure\n```json\n{\n    \"category_url\": \"https://shein.com/category\",\n    \"total_pages_scraped\": 5,\n    \"product_count\": 120,\n    \"urls\": [\n        \"https://shein.com/product1\",\n        \"https://shein.com/product2\"\n    ]\n}\n```\n\n### Product Details JSON Structure\n```json\n{\n    \"total_products\": 50,\n    \"scrape_timestamp\": 1709142400,\n    \"products\": [\n        {\n            \"url\": \"https://shein.com/product\",\n            \"sku\": \"sw2401234567890\",\n            \"title\": \"Product Name\",\n            \"images\": [\n                \"https://shein.com/image1.jpg\",\n                \"https://shein.com/image2.jpg\"\n            ],\n            \"scraped_at\": 1709142400\n        }\n    ]\n}\n```\n\n## Limitations \u0026 Known Issues\n\n- Review collection functionality is under development\n- No built-in proxy support\n- Basic captcha handling that may require manual intervention\n- Some anti-bot detection systems might still detect the scraper\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanielwte%2Fshein-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdanielwte%2Fshein-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanielwte%2Fshein-scraper/lists"}