https://github.com/dansuh17/facecrawler

Distributed, continuous web image crawler.
https://github.com/dansuh17/facecrawler

image selenium webdriver

Last synced: 4 months ago
JSON representation

Distributed, continuous web image crawler.

Host: GitHub
URL: https://github.com/dansuh17/facecrawler
Owner: dansuh17
Created: 2017-09-07T05:42:12.000Z (almost 9 years ago)
Default Branch: master
Last Pushed: 2022-12-07T23:44:14.000Z (over 3 years ago)
Last Synced: 2025-10-06T20:49:01.499Z (9 months ago)
Topics: image, selenium, webdriver
Language: Python
Homepage:
Size: 179 KB
Stars: 1
Watchers: 2
Forks: 1
Open Issues: 9
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Crawlfish

Distributed, continuous image crawler.

## Requirements

Firefox web browser (gecko driver)

## Running

Creating a virtual environment for python recommended.

`python3 -m venv ./venv`

Then install dependent packages.

`pip3 install -r requirements.txt`

In order to keep the monitoring running, a monitoring server must be set up before crawling node starts.
Start running the monitor server using this command:

`python3 cherryServer.py`

Start crawling using the following command.

`python3 crawler.py --[option] [option_value]`

Avialable options are:
- `--site [site]` target site to crawl (instagram, facebook, etc.)
- `--filter [filter_type]` type of data filter to screen the data (face)
- `--nthread [number_of_threads]` number of threads used to load web driver and start crawling
- `--logpath [folder_name]` folder name to save the logs in

The status of crawling may be monitored using the monitor reader.

`python3 monitor_read.py`

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dansuh17/facecrawler

Awesome Lists containing this project

README