Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/webrecorder/browsertrix-old

Browsertrix: Containerized High-Fidelity Browser-Based Automated Crawling + Behavior System
https://github.com/webrecorder/browsertrix-old

Last synced: 3 days ago
JSON representation

Browsertrix: Containerized High-Fidelity Browser-Based Automated Crawling + Behavior System

Awesome Lists containing this project

README

        

**Depcrecated**: The Browsertrix system is being refactored into more modular individual components. The main component, [Browsertrix Crawler](https://github.com/webrecorder/browsertrix-crawler) will soon support most of the same crawling features via an integrated Docker image that can be deployed via the command-line. The UI and scheduling components will soon be reimplemented as additional components.

Please see Browsertrix Crawler for latest development.


[![Code Style](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black) [![Build Status](https://travis-ci.org/webrecorder/browsertrix.svg?branch=master)](https://travis-ci.org/webrecorder/browsertrix)

## High Fidelity Browser-Based Crawling Automation

Browsertrix is a brand new toolset from the Webrecorder project for automating browsers to perform complex scripted behaviors
as well as crawl multiple pages. (The name was originally used for an older project with similar goals).

Browsertrix is a system for orchestrating Docker-based Chrome browsers, crawling processes, behavior systems, web archiving capture and replay, and full-text search.

It includes the following features:
* Crawling via customizable YAML-based crawl spec
* High-fidelity browser-based crawlers (controlled via [webrecorder/autobrowser](https://github.com/webrecorder/autobrowser))
* Execution of complex, domain-specific in-page behaviors (provided by [webrecorder/behaviors](https://github.com/webrecorder/behaviors))
* Capture or replay into designated [pywb](https://github.com/webrecorder/pywb) collections
* Screenshot creation of each page (optional).
* Text extraction for each page and full text search via Solr (optional).
* Support for customized browser profiles to minimize capture of private information.

## Getting Started

### Installing Browsertrix

Browsertrix is currently designed to run with Docker and Docker Compose.
The Browsertrix CLI requires local Python 3.6+.

To install, run:

```bash
git clone https://github.com/webrecorder/browsertrix
cd browsertrix
python setup.py install
./install-browsers.sh
docker-compose build
docker-compose up -d
```

The `install-browsers.sh` script installs additional Docker images necessary for dynamic browser creation.
The script can be used to update the images as well.

### Installing Browsertrix CLI

The Browsertrix CLI is installed by running `python setup.py install` and includes full functionality for running crawls and creating browser profiles.

Once installed, browsertrix commands are available via the `browsertrix` command.

## Creating a Crawl

To create a crawl, first a crawl spec should be defined in a yaml file.
An example spec, [sample-crawls/example.yaml](sample-crawls/example.yaml) might look as follows:

```yaml
crawls:
- name: example
crawl_type: all-links
num_browsers: 1

coll: example
mode: record

seed_urls:
- https://www.iana.org/
```

Then, simply run `browsertrix crawl create sample-crawls/example.yaml --watch`

The `--watch` param will also result in the crawling browser opening in a new browser window via vnc connection.

If started successfully, the output will be similar to:
```
Crawl Created and Started: cf30281efc7a
Status: running
Opening Browser 1 of 1 (CKVEMACNI6YBUKLQI6UKKBLB) for crawl cf30281efc7a
```

To view all running crawls, simply run `browsertrix crawl list` which should result in output similar to:

```
CRAWL ID NAME STARTED DURATION STATUS CRAWL TYPE COLL MODE TO CRAWL PENDING SEEN BROWSERS TABS
cf30281efc7a example 0:00:35 ago 0:00:10 running all-links example record 15 1 25 1 1
```

To get more detailed info on the crawl, run `browsertrix crawl info --urls ` (where ` = cf30281efc7a` in this example)

To follow the crawl log in the console window, add the `--log` option (the log followed will be from the first browser).

### Crawling Options

Browsertrix supports a number of options, with a key option being the `crawl_type`, which can be:

- `single-page` -- crawl only the specified seed urls
- `all-links` -- crawl the seed url(s) and all links discovered until max depth is exceeded
- `same-domain` -- crawl the seed url(s) and all links discovered that are on the same domain or sub-domain (up to a depth of 100)
- `custom` -- Supports custom depth and scope rules!

The first 3 options are designed to be a simple way to specify common options, and more may be added later.

When using `custom`, the `crawl_depth` param can specify the crawl depth (hops) from each seed url.

The `scopes` list can contain one or more [urlcanon MatchRules](https://github.com/iipc/urlcanon/blob/master/python/urlcanon/rules.py#L70) specifying urls that are in scope for the crawl.

See [custom-scopes.yaml](sample_crawl_spec/custom-scopes.yaml) for an example on how to use the custom option.

The `coll` option specifies the pywb collection to use for crawling, and mode specifies `record` (default) or `replay` or
`live` (direct live web connection).

The `num_browsers` and `num_tabs` option allow for selecting total number of browsers and number of tabs per browser to use for this crawl.

The seed urls for the crawl should be provided in the `seed_urls` list.

The `cache` option specifies cacheing options for a crawl, with available options:
- `always` -- Strict cacheing via `Cache-Control` on almost every resource to limit duplicate urls in a single browser session (default option when omitted)
- `default` -- Keep default cacheing for a page
- `never` -- disables all cacheing for all urls.

All example crawl configs demonstrating these options are available in: [sample-crawls](sample-crawls/)

### In-Page Behaviors

For every page, Browsertrix runs a designated behavior before collecting outlinks, (optionally) taking screenshots,
and moving on to the next page.

The behaviors are served via a separate behavior API server. The current list of available behaviors is available at:
https://github.com/webrecorder/behaviors/tree/master/behaviors

The behaviors are built using a special library of behavior functions (preliminary docs available here:
https://webrecorder.github.io/behaviors/)

If no site-specific behavior is found, the default `autoscroll.js`

The `behavior_max_time` crawl option specifies the maximum time a behavior can run (current default is 60 seconds).
When crawling sites with infinite scroll, it is recommended to set the `behavior_max_time` to be much higher.

### pywb Collections and Access

All data crawled is placed in the `./webarchive/collections/` directory which corresponds to the [standard pywb directory structure conventions](https://pywb.readthedocs.io/en/latest/manual/configuring.html#directory-structure) eg. a collection `test` would be found under `./webarchive/collections/test`.

Collections are created automatically on first use and can also be managed via `wb-manager` with `webarchive` as the working directory.

The running pywb instance can also be accessed via `http://localhost:8180/`

### Replay Crawling and Screenshots

Currently, screenshot creation is automatically enabled when crawling in record mode and screenshots are added automatically
to the same collection.

Browsertrix supports crawling in replay mode, over an existing collection, which may be useful for QA processes,
especially when combined with screenshot creation.

By setting the `mode` and `screenshot_coll` properties for each crawl, it is possible to run Browsertrix over replay and generate screenshots into a different collection, which may be used for QA comparison.

Additional screenshot options are to be added soon. (Currently, the screenshot is taken after the behavior is run but this will likely change).

Crawl options can also be overriden via command line.

For example, given a crawl spec `./my_crawl.yaml`, one could first capture with:
```
browsertrix crawl create ./my_crawl.yaml --screenshot_coll screenshots-capture
```

and then run:
```
browsertrix crawl create ./my_crawl.yaml --screenshot_coll --mode replay screenshots-qa
```

By default, screenshots are saved with `urn:screenshot:` prefix.
Based on the above crawls, one could then query all capture and qa screenshots in pywb via:
```
http://localhost:8180/screenshots-capture/*/urn:screenshot:*
http://localhost:8180/screenshots-qa/*/urn:screenshot:*
```

Sample record and replay configs, [social-media.yaml](sample-crawls/social-media.yaml) and [social-media-replay.yaml](sample-crawls/social-media-replay.yaml), are also available.

(Note: The screenshot functionality will likely change and additional options will be added)

### Other Crawl operations

Other crawl operations include:
* `browsertrix crawl stop` for stopping a crawl
* `browsertrix crawl logs` for printing and following logs for one or all crawlers
* `browsertrix crawl watch ` for attaching and watching all the browsers in a given crawl.
* `browsertrix crawl remove` for removing a crawl
* `browsertrix crawl remove-all` for stopping and removing all crawls.

See `browsertrix crawl -h` for a complete reference of available commands.

## Full Text Search

Browsertrix now includes a prototype integration with Apache Solr. Text is extracted for each page, after taking a screenshot, and ingested into Solr. The extracted text (as provided via raw DOM text nodes) from all frames,
as well as the title, and url are indexed in Solr using default schema. (This is likely to evolve as well).

The search is available for each collection via the pywb replay interface at: `http://localhost:8180/`

The replay interface currently includes a list of pages, screenshot (if enabled) and ability to search the collection.

(Note: solr data is stored in the `./solr` volume, and may require a permission adjustment on certain systems via `chmod a+w ./solr`)

## Browser Profiles

It is often useful to prepare a browser, such as by logging into social media, other password protected sites
to be able to capture content that is not generally accessible. However, doing so during a crawl is tedious, and worse,
may result in passwords being recorded to WARC.

Browsertrix addresses this problem with the support of browser profiles. A profile can be created by running a base
Chrome browser, performing custom actions, and then 'saving' the running browser into a new 'profile' image.

To create a profile:

1. Run:
```browsertrix profile create```

2. This should start a new remote browser (Chrome 73 by default) and open it in a new window. You can now interact with the browser and log in to any sites as needed.

3. The command line should have the following message and a prompt to enter the profile name, eg. `logged-in`

```
A new browser window should have been opened
You can use the browser to log-in to accounts or otherwise prepare the browser profile
(The content will not be recorded to WARC)
When done, please enter a new name to save the browser profile:
```

4. Once the name is entered the profile is saved, and you can continue browsing to make a new profile, or select 'no' and close the browser.

If everything worked, running ```browsertrix profile list``` should show:

```
PROFILE BASE BROWSER
logged-in chrome:73
```

5. To use the profile, set the `profile` property in the crawl spec YAML, or simply include `--profile` in the command line:

```
browsertrix crawl create ./my_crawl.yaml --profile logged-in
```

The browsers used for the crawl will be a copy of the browser saved during profile creation.

`browsertrix profile remove` can be used to remove an unneeded profile.

Note: The profile functionality is brand new and subject to change. At present, it is tied to the particular browser Docker image used an extend the image. The system may switch to Docker volumes in the future.

## Testing

Browsertrix includes several test suites, also tested on automatically via Travis CI.

### Docker Integration Tests

Browsertrix includes a Docker-based test suite that runs crawls over content replayed from a WARC
(no live web content is accessed). This test suite requires Python 3.6+.

To run this test suite, run:

```bash
bash ./tests/start-test-compose.sh
pip install -U -r test-docker-requirements.txt
py.test --headless ./tests/test_live_crawl.py
bash ./tests/stop-test-compose.sh
```

The test suite does not perform any live crawling, but runs all the [tests/crawl_tests.yaml](tests/crawl_tests.yaml) in replay mode using an existing test WARC downloaded from S3.

### Local API Tests

To install and run local tests of the API (without Docker), run the following:
(Python 3.7+ is required)

```bash
pip install -U -r requirements.txt -r test-local-requirements.txt
py.test ./tests/test_api.py
```

## UI

Browsertrix also includes a UI (still under development) which will
have the same features as the CLI.

To access the browsertrix UI, load `http://localhost:8000/`

The frontend React app is found in `./frontend` and can be started via:

```
yarn run develop
```

(The develop server is started at `http://localhost:8001` to avoid conflict with production)

To build the production bundle, run:
```
yarn run build-prod
```

This should update the production server running at `http://localhost:8000`