https://github.com/arutselvan/imgscrapy
A simple and fast CLI for multithreaded image scraping with support for headless scraping of dynamic websites.
https://github.com/arutselvan/imgscrapy
cli downloader image-downloader image-downloader-python image-scraper python scrapper
Last synced: 6 months ago
JSON representation
A simple and fast CLI for multithreaded image scraping with support for headless scraping of dynamic websites.
- Host: GitHub
- URL: https://github.com/arutselvan/imgscrapy
- Owner: Arutselvan
- License: mit
- Created: 2017-02-12T18:32:33.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2023-05-22T23:27:22.000Z (over 2 years ago)
- Last Synced: 2025-04-13T00:34:48.854Z (6 months ago)
- Topics: cli, downloader, image-downloader, image-downloader-python, image-scraper, python, scrapper
- Language: Python
- Homepage:
- Size: 24.4 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# imgscrapy


A simple CLI image scraper written in python with support for headless scraping of dynamic websites.
#### Installation
##### Build from source
+ `git clone https://github.com/arutselvan/ImgScrapy`
+ `cd ImgScrapy`
+ `python setup.py install`##### As a Python package
```
pip install --user imgscrapy
```#### Requirements
python>=3.6#### Usage
```
usage: imgscrapy [-h] [-d DIRECTORY] [-i] [-n NFIRST] [-t NTHREADS] [-hd] [-to TIMEOUT] target_urlDownloads images from the given URL
positional arguments:
target_url URL to scrape images from
optional arguments:
-h, --help show this help message and exit
-d DIRECTORY, --directory DIRECTORY
Directory in which images should be downloaded
-i, --injected Scrape images from a dynamic website and JS injected images
-n NFIRST, --nfirst NFIRST
Scrape the first n images
-t NTHREADS, --nthreads NTHREADS
Maximum number of threads to use
-hd, --head Open chromium for scraping JS injected source/images
-to TIMEOUT, --timeout TIMEOUT
Timeout value for obtaining page source
```
#### Examples+ Download all images from a static website
```
imgscrapy
```
+ Download the first 5 images from a dynamic website
```
imgscrapy -i --nfirst 5
```##### Note
ImgScrapy uses [pyppeteer
](https://github.com/miyakogi/pyppeteer) which uses Chromium for headless scraping. When scraping a dynamic website for the first time, Chromium will be downloaded automatically which might take some time.#### To Do
+ Write tests
+ Add support for Base64 images
+ Add support for embedded/inline svg files
+ Fix issues with headless browsing of dynamic sites with modal/popup
+ Fix issue with missing trailing slash in URL resolution
+ Add option to dump URL of downloaded/failed imagesLicense
----MIT