Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ghostwords/chameleon-crawler
Browser automation for Chameleon.
https://github.com/ghostwords/chameleon-crawler
chromedriver selenium
Last synced: 3 months ago
JSON representation
Browser automation for Chameleon.
- Host: GitHub
- URL: https://github.com/ghostwords/chameleon-crawler
- Owner: ghostwords
- License: mpl-2.0
- Created: 2014-12-03T22:41:22.000Z (about 10 years ago)
- Default Branch: master
- Last Pushed: 2016-09-27T20:34:15.000Z (over 8 years ago)
- Last Synced: 2024-04-14T20:10:25.862Z (10 months ago)
- Topics: chromedriver, selenium
- Language: Python
- Homepage:
- Size: 80.1 KB
- Stars: 19
- Watchers: 8
- Forks: 7
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Chameleon Crawler
Browser automation for [Chameleon](https://github.com/ghostwords/chameleon).
## Setup
- Install Chromium, chromedriver, python3 and xvfb. On Ubuntu:
```
sudo apt-get install chromium-browser chromium-chromedriver python3 xvfb
```- Install the project's Python dependencies (documented in [requirements.txt](requirements.txt)). You might do this with `virtualenv` and `pip`, or maybe Docker. Note this is a Python 3 project.
- Make sure `chromedriver` is in your $PATH. It's not on Ubuntu, so we have to fix that:
```
sudo ln -s /usr/lib/chromium-browser/chromedriver /usr/local/bin/chromedriver
```- If using Ubuntu 14.04, [fix chromedriver's shared libraries error](http://stackoverflow.com/questions/25695299/chromedriver-on-ubuntu-14-04-error-while-loading-shared-libraries-libui-base):
```
echo "/usr/lib/chromium-browser/libs" | sudo tee --append /etc/ld.so.conf.d/chrome_lib.conf >/dev/null
sudo ldconfig
```- Finally, generate a Chameleon CRX package [by following development setup steps 1 and 4 in Chameleon's checkout](https://github.com/ghostwords/chameleon#development-setup).
## Usage
Run `./crawl.py /path/to/chameleon.crx` to perform a crawl, or `./crawl.py -h` to see the optional arguments:
```
usage: crawl.py [-h] [--headless | --no-headless] [-n {1,2,3,4,5,6,7,8}] [-q]
[-t SECONDS] [--urls URL_FILE_PATH]
CHAMELEON_CRX_FILE_PATHpositional arguments:
CHAMELEON_CRX_FILE_PATH
path to Chameleon CRX packageoptional arguments:
-h, --help show this help message and exit
--headless use a virtual display (default)
--no-headless
-n {1,2,3,4,5,6,7,8} how many browsers to use in parallel (default: 4)
-q, --quiet turn off standard output
-t SECONDS, --timeout SECONDS
how many seconds to wait for pages to finish loading
before timing out (default: 20)
--urls URL_FILE_PATH path to URL list file (default: urls.txt)
```Run `./view.py` and visit the displayed URL to review crawl results.
## Roadmap
1. Crawl Alexa Global Top 1,000,000 Sites: http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
2. Analyze results:
- Discover fingerprinters
- Confirm detection of known fingerprinters
3. Tweak the heuristic to minimize false negatives/positives.
4. Create minisite to chart (the growth of?) fingerprinting across the Web.## Code license
Mozilla Public License Version 2.0