https://github.com/official-imvoiid/multifetch
A high-performance web scraper for bulk image and GIF extraction from reliable sources — built for AI/ML data pipelines and large-scale media collection
https://github.com/official-imvoiid/multifetch
aiml data dataset gifscraper imagescraper python pythontool tools webscraper windows
Last synced: 3 months ago
JSON representation
A high-performance web scraper for bulk image and GIF extraction from reliable sources — built for AI/ML data pipelines and large-scale media collection
- Host: GitHub
- URL: https://github.com/official-imvoiid/multifetch
- Owner: official-imvoiid
- License: mit
- Created: 2025-06-19T23:31:44.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-06-19T23:39:50.000Z (4 months ago)
- Last Synced: 2025-06-20T00:32:14.544Z (4 months ago)
- Topics: aiml, data, dataset, gifscraper, imagescraper, python, pythontool, tools, webscraper, windows
- Language: Python
- Homepage:
- Size: 46.9 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# WebScap CLI
A powerful command-line image scraping tool designed for AI/ML research, dataset creation, and personal use. WebScap provides easy access to images from multiple platforms to help build comprehensive datasets for AI model training.
## 🚀 Motivation
This project was born from the challenge of finding quality datasets for AI Text-To-Video model training. WebScap solves this problem by providing a simple, efficient way to gather large image datasets from various platforms, enabling:
- **Training Data Collection**: Build robust datasets for AI model training
- **Topic Understanding**: Help AI models understand specific subjects through visual data
- **Vision Capability Enhancement**: Improve model performance by including diverse image datasets## ✨ Features
WebScap supports scraping from 10 different platforms:
- **Pinterest** - Creative inspiration and lifestyle images
- **DeviantArt** - Digital art and creative content
- **Pixiv Art** - Japanese illustration and artwork (Requires PHPSESSID)
- **Civitai** - AI-generated art and models (Requires API)
- **Google Images** - Comprehensive web image search
- **WebScap GIF** - Specialized GIF collection and animation scraping
- **StaticPage** - Extract images from static websites and HTML pages
- **Image Upscaler** - Enhance image quality automatically
- **Image Converter** - Convert images between different formats## 📋 Requirements
### System Requirements
- **Operating System**: Windows 11
- **Python**: Version 8+
- **Browser**: Google Chrome (installed and set as default)### API Requirements
- **Pixiv**: PHPSESSID token required
- **Civitai**: API key required## 📊 Performance
- **Tested Capacity**: Successfully scraped 1,700+ images
- **API Calls**: Handles 200+ API requests efficiently
- **GIF Support**: Optimized for animated content collection
- **Static Pages**: Efficiently extracts images from HTML/CSS structures
- **Scalability**: Potentially supports larger volumes (untested)## 🔒 Content Policy & NSFW Handling
WebScap respects platform-specific content policies and user preferences:
### NSFW Content Management
- **Default Behavior**: Platforms maintain their original NSFW/SFW structure
- **User Control**: Content filtering depends on your platform account settings
- **Safe Mode**: Enable "Safe=ON" in your account settings to avoid NSFW content on supported platforms
- **Platform Respect**: No modification of platform content policies - choice remains with users### Supported Platforms NSFW Policy
- ✅ **Google Images**: Follows your SafeSearch settings
- ✅ **Pinterest**: Default Safe Setting is on
- ✅ **DeviantArt**: Default Safe Setting is on
- ✅ **Pixiv**: Follows account content filters
- ✅ **Civitai**: Respects platform content settings## ⚠️ Important Disclaimer
**Developer Responsibility Notice**:
The developers are not responsible for user actions. Please use this tool responsibly and ethically.### Acceptable Use
✅ **Permitted Uses:**
- AI/ML research and development
- Academic research projects
- Personal dataset creation
- Fair use educational purposes❌ **Prohibited Uses:**
- Commercial redistribution without permission
- Violation of platform Terms of Service
- Copyright infringement
- Malicious or harmful activities### Legal Compliance
- Always follow platform Terms of Service
- Respect copyright and intellectual property rights
- Use scraped content within fair use guidelines
- Ensure compliance with local laws and regulations## 🛠️ Installation
```bash
# Clone the repository
git clone https://github.com/official-imvoiid/MultiFetch.git# Navigate to project directory
cd MultiFetch# Install dependencies
pip install -r requirements.txt
```## 🔧 Configuration
### Required Setup
1. Ensure Chrome is installed and set as default browser
2. Obtain necessary API keys/tokens:
- **Pixiv**: Get your PHPSESSID from browser cookies
- **Civitai**: Register and obtain API key### Platform Account Settings
For optimal results and content filtering:
1. Configure your account settings on each platform
2. Set appropriate content filters (Safe=ON for family-friendly content)
3. Adjust privacy and content preferences as needed## 📜 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
**Remember**: Always scrape responsibly and ethically. Respect platform terms of service and copyright laws.