https://github.com/duckduckgo/tracker-radar-collector

🕸 Modular, multithreaded, puppeteer-based crawler
https://github.com/duckduckgo/tracker-radar-collector

crawler puppeteer tracker-radar

Last synced: 4 months ago
JSON representation

🕸 Modular, multithreaded, puppeteer-based crawler

Host: GitHub
URL: https://github.com/duckduckgo/tracker-radar-collector
Owner: duckduckgo
License: other
Created: 2020-02-18T22:02:55.000Z (about 5 years ago)
Default Branch: main
Last Pushed: 2024-04-11T12:58:07.000Z (about 1 year ago)
Last Synced: 2024-05-01T13:37:21.156Z (12 months ago)
Topics: crawler, puppeteer, tracker-radar
Language: JavaScript
Homepage:
Size: 793 KB
Stars: 126
Watchers: 70
Forks: 42
Open Issues: 14
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# DuckDuckGo Tracker Radar Collector
🕸 Modular, multithreaded, [puppeteer](https://github.com/GoogleChrome/puppeteer)-based crawler used to generate third party request data for the [Tracker Radar](https://github.com/duckduckgo/tracker-radar).

## How do I use it?

### Use it from the command line

1. Clone this project locally (`git clone [email protected]:duckduckgo/tracker-radar-collector.git`)
2. Install all dependencies (`npm i`)
3. Run the command line tool:

```sh
npm run crawl -- -u "https://example.com" -o ./data/ -v
```

Available options:

- `-o, --output ` - (required) output folder where output files will be created
- `-u, --url ` - single URL to crawl
- `-i, --input-list ` - path to a text file with list of URLs to crawl (each in a separate line)
- `-d, --data-collectors ` - comma separated list (e.g `-d 'requests,cookies'`) of data collectors that should be used (all by default)
- `-c, --crawlers ` - override the default number of concurrent crawlers (default number is picked based on the number of CPU cores)
- `--reporters ` - comma separated list (e.g. `--reporters 'cli,file,html'`) of reporters to be used ('cli' by default)
- `-v, --verbose` - instructs reporters to log additional information (e.g. for "cli" reporter progress bar will not be shown when verbose logging is enabled)
- `-l, --log-path ` - instructs reporters where all logs should be written to
- `-f, --force-overwrite` - overwrite existing output files (by default entries with existing output files are skipped)
- `-3, --only-3p` - don't save any first-party data (e.g. requests, API calls for the same eTLD+1 as the main document)
- `-m, --mobile` - emulate a mobile device when crawling
- `-p, --proxy-config ` - optional SOCKS proxy host
- `-r, --region-code ` - optional 2 letter region code. For metadata only
- `-a, --disable-anti-bot` - disable simple build-in anti bot detection script injected to every frame
- `--chromium-version ` - use custom version of Chromium (e.g. "843427") instead of using the default
- `--config ` - path to a config file that allows to set all the above settings (and more). Note that CLI flags have a higher priority than settings passed via config. You can find a sample config file in `tests/cli/sampleConfig.json`.
- `--autoconsent-action ` - automatic autoconsent action (requires the `cmps` collector). Possible values: optIn, optOut

### Use it as a module

1. Install this project as a dependency (`npm i git+https://github.com:duckduckgo/tracker-radar-collector.git`).

2. Import it:

```js
// you can either import a "crawlerConductor" that runs multiple crawlers for you
const {crawlerConductor} = require('tracker-radar-collector');
// or a single crawler
const {crawler} = require('tracker-radar-collector');

// you will also need some data collectors (/collectors/ folder contains all build-in collectors)
const {RequestCollector, CookieCollector, …} = require('tracker-radar-collector');
```

3. Use it:

```js
crawlerConductor({
// required ↓
urls: ['https://example.com', {url: 'https://duck.com', dataCollectors: [new ScreenshotCollector()]}, …], // two formats available: first format will use default collectors set below, second format will use custom set of collectors for this one url
dataCallback: (url, result) => {…},
// optional ↓
dataCollectors: [new RequestCollector(), new CookieCollector()],
failureCallback: (url, error) => {…},
numberOfCrawlers: 12,// custom number of crawlers (there is a hard limit of 38 though)
logFunction: (...msg) => {…},// custom logging function
filterOutFirstParty: true,// don't save any first-party data (false by default)
emulateMobile: true,// emulate a mobile device (false by default)
proxyHost: 'socks5://myproxy:8080',// SOCKS proxy host (none by default)
antiBotDetection: true,// if anti bot detection script should be injected (true by default)
chromiumVersion: '843427',// Chromium version that should be downloaded and used instead of the default one
maxLoadTimeMs: 30000,// how long should crawlers wait for the page to load, defaults to 30s
extraExecutionTimeMs: 2500,// how long should crawlers wait after page loads before collecting data, defaults to 2.5s
});
```

**OR** (if you prefer to run a single crawler)

```js
// crawler will throw an exception if crawl fails
const data = await crawler(new URL('https://example.com'), {
// optional ↓
collectors: [new RequestCollector(), new CookieCollector(), …],
log: (...msg) => {…},
urlFilter: (url) => {…},// function that, for each request URL, decides if its data should be stored or not
emulateMobile: false,
emulateUserAgent: false,// don't use the default puppeteer UA (default true)
proxyHost: 'socks5://myproxy:8080',
browserContext: context,// if you prefer to create the browser context yourself (to e.g. use other browser or non-incognito context) you can pass it here (by default crawler will create an incognito context using standard chromium for you)
runInEveryFrame: () => {window.alert('injected')},// function that should be executed in every frame (main + all subframes)
executablePath: '/some/path/Chromium.app/Contents/MacOS/Chromium',// path to a custom Chromium installation that should be used instead of the default one
maxLoadTimeMs: 30000,// how long should the crawler wait for the page to load, defaults to 30s
extraExecutionTimeMs: 2500,// how long should crawler wait after page loads before collecting data, defaults to 2.5s
});
```

ℹ️ Hint: check out `crawl-cli.js` and `crawlerConductor.js` to see how `crawlerConductor` and `crawler` are used in the wild.

## Output format

Each successfully crawled website will create a separate file named after the website (when using the CLI tool). Output data format is specified in `crawler.js` (see `CollectResult` type definition).
Additionally, for each crawl `metadata.json` file will be created containing crawl configuration, system configuration and some high-level stats.

## Data post-processing

Example post-processing script, that can be used as a template, can be found in `post-processing/summary.js`. Execute it from the command line like this:

```sh
node ./post-processing/summary.js -i ./collected-data/ -o ./result.json
```

ℹ️ Hint: When dealing with huge amounts of data you may need to increase nodejs's memory limit e.g. `node --max_old_space_size=4096`.

## Creating new collectors

Each collector needs to extend the `BaseCollector` and has to override following methods:

- `id()` which returns name of the collector (e.g. 'cookies')
- `getData(options)` which should return collected data. `options` have following properties:
- `finalUrl` - final URL of the main document (after all redirects) that you may want to use,
- `filterFunction` which, if provided, takes an URL and returns a boolean telling you if given piece of data should be returned or filtered out based on its origin.

Additionally, each collector can override following methods:

- `init(options)` which is called before the crawl begins
- `addTarget(targetInfo)` which is called whenever new target is created (main page, iframe, web worker etc.)
- `postLoad()` which is called after the page has loaded. This is the place for executing heavy page interactions (`extraExecutionTimeMs` is applied after this hook).

There are couple of built-in collectors in the `collectors/` folder. `CookieCollector` is the simplest one and can be used as a template.

Each new collector has to be added in two places to be discoverable:
- `crawlerConductor.js` - so that `crawlerConductor` knows about it (and it can be used in the CLI tool)
- `main.js` - so that the new collector can be imported by other projects

You can also add types to define the structure of the data exported by your collector. These should be added to the `CollectorData` type in `collectorsList.js`. This will add type hints to all places where the data is used in the code.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/duckduckgo/tracker-radar-collector

Awesome Lists containing this project

README