https://github.com/openwpm/openwpm-crawler
A crawler that uses OpenWPM.
https://github.com/openwpm/openwpm-crawler
Last synced: about 1 year ago
JSON representation
A crawler that uses OpenWPM.
- Host: GitHub
- URL: https://github.com/openwpm/openwpm-crawler
- Owner: openwpm
- License: other
- Created: 2018-09-10T20:09:30.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2021-12-26T20:12:58.000Z (over 4 years ago)
- Last Synced: 2025-04-10T13:43:38.141Z (about 1 year ago)
- Language: Jupyter Notebook
- Size: 119 KB
- Stars: 12
- Watchers: 9
- Forks: 8
- Open Issues: 11
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# OpenWPM Crawler
Launch OpenWPM crawls using Kubernetes [Job](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/) workloads
or stand up some docker-compose services to run the crawl in a distributed fashion.
A Redis work queue is set up and loaded with the list of URLs to crawl.
Containers running either locally
or in the cloud execute the OpenWPM crawler.py script which will continuously fetch sites to run
and exit once there are no additional sites in the queue.
## Preparations
To install all the required tools (using conda)
```bash
./install.sh
conda activate openwpm-crawler
```
## Run a crawl locally (using Kubernetes)
See [./deployment/local/README.md](./deployment/local/README.md).
## Run a crawl in Google Cloud Platform
See [./deployment/gcp/README.md](./deployment/gcp/README.md).
## Run a crawl locally (using docker-compose)
See [./deployment/local-compose/README.md](./deployment/local-compose/README.md).
This is the simplest option, requiring only docker-compose which is shipped with
Docker on both Mac and Windows, however behaviour might slightly differ from
cloud crawls.
## Analyze crawl results
```bash
jupyter notebook
```
After launching Jupyter, navigate to `analysis/Sample Analysis.ipynb` and choose `Kernel -> Change Kernel -> openwpm-crawler` in the menu.