https://github.com/ghostwords/chameleon-crawler

Browser automation for Chameleon.
https://github.com/ghostwords/chameleon-crawler

chromedriver selenium

Last synced: 6 months ago
JSON representation

Browser automation for Chameleon.

Host: GitHub
URL: https://github.com/ghostwords/chameleon-crawler
Owner: ghostwords
License: mpl-2.0
Created: 2014-12-03T22:41:22.000Z (almost 11 years ago)
Default Branch: master
Last Pushed: 2016-09-27T20:34:15.000Z (about 9 years ago)
Last Synced: 2025-04-14T20:15:41.718Z (6 months ago)
Topics: chromedriver, selenium
Language: Python
Homepage:
Size: 80.1 KB
Stars: 19
Watchers: 7
Forks: 7
Open Issues: 8
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Chameleon Crawler

Browser automation for [Chameleon](https://github.com/ghostwords/chameleon).

## Setup

- Install Chromium, chromedriver, python3 and xvfb. On Ubuntu:
```
sudo apt-get install chromium-browser chromium-chromedriver python3 xvfb
```

- Install the project's Python dependencies (documented in [requirements.txt](requirements.txt)). You might do this with `virtualenv` and `pip`, or maybe Docker. Note this is a Python 3 project.

- Make sure `chromedriver` is in your $PATH. It's not on Ubuntu, so we have to fix that:
```
sudo ln -s /usr/lib/chromium-browser/chromedriver /usr/local/bin/chromedriver
```

- If using Ubuntu 14.04, [fix chromedriver's shared libraries error](http://stackoverflow.com/questions/25695299/chromedriver-on-ubuntu-14-04-error-while-loading-shared-libraries-libui-base):
```
echo "/usr/lib/chromium-browser/libs" | sudo tee --append /etc/ld.so.conf.d/chrome_lib.conf >/dev/null
sudo ldconfig
```

- Finally, generate a Chameleon CRX package [by following development setup steps 1 and 4 in Chameleon's checkout](https://github.com/ghostwords/chameleon#development-setup).

## Usage

Run `./crawl.py /path/to/chameleon.crx` to perform a crawl, or `./crawl.py -h` to see the optional arguments:

```
usage: crawl.py [-h] [--headless | --no-headless] [-n {1,2,3,4,5,6,7,8}] [-q]
[-t SECONDS] [--urls URL_FILE_PATH]
CHAMELEON_CRX_FILE_PATH

positional arguments:
CHAMELEON_CRX_FILE_PATH
path to Chameleon CRX package

optional arguments:
-h, --help show this help message and exit
--headless use a virtual display (default)
--no-headless
-n {1,2,3,4,5,6,7,8} how many browsers to use in parallel (default: 4)
-q, --quiet turn off standard output
-t SECONDS, --timeout SECONDS
how many seconds to wait for pages to finish loading
before timing out (default: 20)
--urls URL_FILE_PATH path to URL list file (default: urls.txt)
```

Run `./view.py` and visit the displayed URL to review crawl results.

## Roadmap

1. Crawl Alexa Global Top 1,000,000 Sites: http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
2. Analyze results:
- Discover fingerprinters
- Confirm detection of known fingerprinters
3. Tweak the heuristic to minimize false negatives/positives.
4. Create minisite to chart (the growth of?) fingerprinting across the Web.

## Code license

Mozilla Public License Version 2.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ghostwords/chameleon-crawler

Awesome Lists containing this project

README