https://github.com/theaifutureguy/Etsy-Scraping
https://github.com/theaifutureguy/Etsy-Scraping
Last synced: 9 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/theaifutureguy/Etsy-Scraping
- Owner: danieladdisonorg
- License: gpl-3.0
- Created: 2025-06-20T09:42:30.000Z (12 months ago)
- Default Branch: master
- Last Pushed: 2025-06-20T10:07:10.000Z (12 months ago)
- Last Synced: 2025-06-20T10:51:00.513Z (12 months ago)
- Language: Python
- Size: 0 Bytes
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: License.txt
Awesome Lists containing this project
README
# Etsy Web Scraper
A Python-based web scraper built with Selenium that extracts listing and shop information from Etsy based on user-defined search terms and pagination limits.
## Features
- Extract comprehensive listing data including pricing, ratings, reviews, and shop information
- Configurable search terms and page limits (up to 240 pages per term)
- Built-in IP blocking mitigation
- Shop name anonymization for privacy
- CSV output with 16 data columns
- Detailed progress tracking and summary reporting
## Prerequisites
- **Python:** 3.9+
- **Dependencies:** See `requirements.txt`
- **WebDriver:** Chrome/Firefox WebDriver (path configurable)
## Installation
1. Clone the repository:
```bash
git clone https://github.com/danieladdisonorg/Etsy-Scraping.git
cd Etsy-Scraping
```
2. Create and activate a virtual environment:
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
3. Install dependencies:
```bash
pip install -r requirements.txt
```
## Configuration
Before running the scraper, configure the following in `scraper_options.py`:
1. **WebDriver Path:** Set the path to your WebDriver executable
2. **Search Terms:** Define your list of search terms in the `search_terms` variable
3. **Page Limit:** Set the number of pages to scrape per search term (max: 240)
## Usage
Run the main scraper:
```bash
python main_scraper.py
```
The scraper will:
- Display progress for each page scraped
- Provide summary statistics for each search term
- Output results to `raw_data.csv`
### File Structure
- `main_scraper.py` - Main execution script
- `scraper_functions.py` - Core scraping functions
- `scraper_options.py` - Configuration settings
- `raw_data.csv` - Example output data
## Data Schema
The scraper extracts 16 columns of data per listing:
| Column | Description | Type |
|--------|-------------|------|
| `Title` | Listing title | String |
| `Shop_Name` | Anonymized shop identifier | Integer |
| `Is_Ad` | Advertisement flag (1/0) | Integer |
| `Star_Rating` | Shop's overall star rating | Float |
| `Num_Reviews` | Total shop reviews | Integer |
| `Price` | Item price | Float |
| `Is_Bestseller` | Bestseller tag flag (1/0) | Integer |
| `Num_Sales` | Total shop sales | Integer |
| `Num_Basket` | Items in customer baskets | Integer |
| `Description` | Listing description | String |
| `Days_to_Arrival` | Estimated delivery days | Integer |
| `Cost_Delivery` | Shipping cost | Float |
| `Returns_Accepted` | Return policy flag (1/0) | Integer |
| `Dispatched_From` | Origin country | String |
| `Num_Images` | Number of listing images | Integer |
| `Category` | Search term used | String |
## Performance
- **Current Speed:** ~15,500 records in 24 hours
- **Rate Limiting:** Built-in delays to prevent IP blocking
- **Memory Usage:** Optimized for large datasets
## Customization Options
### Remove Shop Name Anonymization
To retain actual shop names, comment out the anonymization code in `main_scraper.py`.
### Potential Enhancements
- **Performance:** Implement headless browsing and concurrent processing
- **Additional Data:** Sale status, personalization options, bulk availability
- **Review Analysis:** Extract and analyze customer reviews
- **Advanced Filtering:** Category-specific data extraction
## Known Limitations
- Scraping speed is conservative to avoid IP blocking
- Maximum 240 pages per search term (Etsy limitation)
- Delivery costs displayed in user's local currency
- Requires active internet connection and WebDriver maintenance
## Troubleshooting
### Common Issues
- **WebDriver errors:** Ensure WebDriver version matches your browser
- **IP blocking:** Increase delays in scraper settings
- **Missing data:** Check Etsy's page structure for changes
## Contributing
Contributions are welcome! Please feel free to:
- Submit bug reports and feature requests
- Improve scraping efficiency
- Add new data extraction capabilities
- Enhance error handling
## Resources
- [Web Scraping Best Practices](https://medium.com/swlh/improve-your-web-scraper-with-limited-retry-loops-python-35e21730cbf5)
- [Selenium Documentation](https://selenium-python.readthedocs.io/)
## License
This project is provided as-is for educational and research purposes. Please ensure compliance with Etsy's Terms of Service and robots.txt when using this scraper.
## Contact
For questions or support, please open an issue on GitHub.