https://github.com/edsu/browsertricky
A helper to run browsertrix-crawler locally
https://github.com/edsu/browsertricky
Last synced: about 1 year ago
JSON representation
A helper to run browsertrix-crawler locally
- Host: GitHub
- URL: https://github.com/edsu/browsertricky
- Owner: edsu
- Created: 2023-06-23T20:27:08.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2024-05-09T17:16:32.000Z (about 2 years ago)
- Last Synced: 2024-11-02T10:23:52.837Z (over 1 year ago)
- Language: Shell
- Size: 7.81 KB
- Stars: 8
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# browsertricky
This is a tiny script and directory structure I've used to make it a bit easier
to run and manage [browsertrix-crawler] for archiving websites without needing
to remember the Docker incantation. It works with either Docker or Podman (and
prefers Podman if it is available).
If you'd like to use it:
```
$ git clone https://github.com/edsu/browsertricky.git
$ cd browsertricky
./browsertricky example
```
Now go to https://replayweb.page and load the [WACZ] file that was created at `collections/example/example.wacz`.
That's not a terribly interesting example, so use the example config to create a new one:
```
cp config/example.yaml config/mysite.yaml
```
Edit the `config/mysite.yaml` adding information about a site you would like to archive:
1. Change the name of the collection from `example` to `mysite`
2. Change the `seeds` list to include a new URL like `https://mysite.com`
And run it!
```
$ ./browsertricky mysite
```
If you open http://localhost:9037 while the crawl is underway you should see a screencast of the browser.
You can also see what the progress is:
```
$ ./progress mysite
mysite: 595/2517 [254M]
```
If you would like to write your own [custom behaviors] put them in the `custom-behaviors` directory.
Read the browsertrix-crawler [documentation] for all the options you can put in your YAML configuration files. There are quite a few!
[browsertrix-crawler]: https://github.com/webrecorder/browsertrix-crawler
[documentation]: https://github.com/webrecorder/browsertrix-crawler/blob/main/README.md
[WACZ]: https://specs.webrecorder.net/wacz/latest/
[custom behaviors]: https://github.com/webrecorder/browsertrix-crawler#additional-custom-behaviors