https://github.com/dansuh17/facecrawler
Distributed, continuous web image crawler.
https://github.com/dansuh17/facecrawler
image selenium webdriver
Last synced: 4 months ago
JSON representation
Distributed, continuous web image crawler.
- Host: GitHub
- URL: https://github.com/dansuh17/facecrawler
- Owner: dansuh17
- Created: 2017-09-07T05:42:12.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2022-12-07T23:44:14.000Z (over 3 years ago)
- Last Synced: 2025-10-06T20:49:01.499Z (9 months ago)
- Topics: image, selenium, webdriver
- Language: Python
- Homepage:
- Size: 179 KB
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 9
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Crawlfish
Distributed, continuous image crawler.
## Requirements
Firefox web browser (gecko driver)
## Running
Creating a virtual environment for python recommended.
`python3 -m venv ./venv`
Then install dependent packages.
`pip3 install -r requirements.txt`
In order to keep the monitoring running, a monitoring server must be set up before crawling node starts.
Start running the monitor server using this command:
`python3 cherryServer.py`
Start crawling using the following command.
`python3 crawler.py --[option] [option_value]`
Avialable options are:
- `--site [site]` target site to crawl (instagram, facebook, etc.)
- `--filter [filter_type]` type of data filter to screen the data (face)
- `--nthread [number_of_threads]` number of threads used to load web driver and start crawling
- `--logpath [folder_name]` folder name to save the logs in
The status of crawling may be monitored using the monitor reader.
`python3 monitor_read.py`