https://github.com/grey-land/warc-browser
a cli toolkit for working with web archives
https://github.com/grey-land/warc-browser
chromedp devtools go golang rod warc web-archive
Last synced: about 2 months ago
JSON representation
a cli toolkit for working with web archives
- Host: GitHub
- URL: https://github.com/grey-land/warc-browser
- Owner: grey-land
- License: agpl-3.0
- Created: 2024-01-16T10:41:44.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-01-16T11:16:57.000Z (over 1 year ago)
- Last Synced: 2025-03-22T08:05:07.784Z (2 months ago)
- Topics: chromedp, devtools, go, golang, rod, warc, web-archive
- Language: Go
- Homepage:
- Size: 469 KB
- Stars: 2
- Watchers: 0
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# warc-browser
a cli toolkit for working with web archives.
warc-browser uses *DevTools* protocol to automate compatible web browsers, captures all content for given wep page (html, css, js, images, videos, pdfs, ...) and stores the results in *[.warc][warc-doc]* file. It came out of need for quickly archiving web pages in a scriptable manner.
## Installation
```bash
make build
./warc-browser --help
```## Usage
Archive a url running browser in headless mode.
```bash
warc-browser --output-dir /tmp/archives browser --headless archive --url http://example.com
```Attach to a running browser, list available tabs, then capture specific tab.
```bash
# Start chromium browser with remote debugging enabled
chromium --remote-debugging-port=9222 --url https://duckduckgo.com/?q=web+archive
# List tabs of chromium
warc-browser browser -a
# Archive first tab
warc-browser browser -a archive -t 0
```Start a web server serving simple ui, to visualize collected archives.
```bash
warc-browser ui
```Open your browser at [localhost:8080](http://localhost:8080).
---
software used
1. [github.com/go-rod/rod][go-rod/rod] web automation framework for browser automation
2. [github.com/nlnwa/gowarc][nlnwa/gowarc] for composing warc records
3. [github.com/webrecorder/replayweb.page][webrecorder/replayweb.page] for visualizing records in web ui.```
coverage: 61.2% of statements
```[nlnwa/gowarc]: https://github.com/nlnwa/gowarc
[go-rod/rod]: https://github.com/go-rod/rod
[warc-doc]: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/
[webrecorder/replayweb.page]: https://github.com/webrecorder/replayweb.page