Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/turicas/crau

Easy-to-use Web archiver
https://github.com/turicas/crau

Last synced: about 2 months ago
JSON representation

Easy-to-use Web archiver

Host: GitHub
URL: https://github.com/turicas/crau
Owner: turicas
License: lgpl-3.0
Created: 2019-10-26T19:21:34.000Z (over 4 years ago)
Default Branch: develop
Last Pushed: 2023-02-19T17:38:33.000Z (over 1 year ago)
Last Synced: 2024-05-04T13:46:42.602Z (about 2 months ago)
Language: Python
Size: 55.7 KB
Stars: 53
Watchers: 4
Forks: 8
Open Issues: 10
Metadata Files:
- Readme: README.md
- License: LICENSE

Lists

awesome-web-archiving - crau - crau is the way (most) Brazilians pronounce crawl, it's the easiest command-line tool for archiving the Web and playing archives: you just need a list of URLs. *(Stable)* (Tools & Software / Acquisition)

README

# crau: Easy-to-use Web Archiver

*crau* is the way (most) Brazilians pronounce *crawl*, it's the easiest
command-line tool for archiving the Web and playing archives: you just need a
list of URLs.

## Installation

`pip install crau`

## Running

### Archiving

Archive a list of URLs by passing them via command-line:

```bash
crau archive myarchive.warc.gz http://example.com/page-1 http://example.org/page-2 ... http://example.net/page-N
```

or passing a text file (one URL per line):

```bash
echo "http://example.com/page-1" > urls.txt
echo "http://example.org/page-2" >> urls.txt
echo "http://example.net/page-N" >> urls.txt

crau archive myarchive.warc.gz -i urls.txt
```

Run `crau archive --help` for more options.

### Extracting data from an archive

List archived URLs in a WARC file:

```bash
crau list myarchive.warc.gz
```

Extract a file from an archive:

```bash
crau extract myarchive.warc.gz https://example.com/page.html extracted-page.html
```

### Playing the archived data on your Web browser

Run a server on [localhost:8080](http://localhost:8080) to play your archive:

```bash
crau play myarchive.warc.gz
```

### Packing downloaded files into a WARC

If you've mirrored a website using `wget -r`, `httrack` or a similiar tool in
which you have the files in your file system, you can use `crau` to create a
WARC file based on this. Run:

```bash
crau pack [--inner-directory=path]
```

Where:

- `start-url`: base URL you've downloaded (this will be joined with the
actual file names to create the complete URL).
- `path-or-archive`: path where the files are located. Can also be a
`.tar.gz`, `.tar.bz2`, `.tar.xz` or `.zip` archive. `crau` will retrieve all
files recursively.
- `warc-filename`: file to be created.
- `--inner-directory`: used when a TAR/ZIP archive is passed to filter which
directory inside the archive will be used to retrieve files. Example: you
have an archive with a `backup/` directory on the root and a
`www.example.com/` inside of it, so the files are actually inside
`backup/www.example.com/` - just pass
`--inner-directory=backup/www.example.com/` and only the files inside this
path will be considered (in this example, the file
`backup/www.example.com/contact.html` will be archived as
`/contact.html`).

## Why not X?

There are other archiving tools, of course. The motivation to start this
project was a lack of easy, fast and robust software to archive URLs - I just
wanted to execute one command without thinking and get a WARC file. Depending
on your problem, crau may not be the best answer - check out more archiving
tools in
[awesome-web-archiving](https://github.com/iipc/awesome-web-archiving#acquisition).

### Why not [GNU Wget](https://www.gnu.org/software/wget/)?

- Lacks parallel downloading;
- Some versions just crashes with segmentation fault depending on the website;
- Lots of options make the task of archiving difficult;
- There's no easy way to extend its behavior.

### Why not [Wpull](https://wpull.readthedocs.io/en/master/)?

- Lots of options make the task of archiving difficult;
- Easiest to extend than wget, but still difficult comparing to crau (since
crau uses [scrapy](https://scrapy.org/)).

### Why not [crawl]()?

- Lacks some features and it's difficult to contribute to (the [Gitlab instance
where it's hosted](https://git.autistici.org/ale/crawl) doesn't allow
registration);
- Has some bugs regarding to collecting page dependencies (like static assets
inside a CSS file);
- Has a bug where it enters in a loop (if a static asset returns a HTML page
instead of the expected file it ignores depth and keep trying to get this
page's dependencies - if any of the latter dependencies also has the same
problem it keeps going on infinite depth).

### Why not [archivenow](https://github.com/oduwsdl/archivenow)?

This tool can be used easily to use archiving services such as
[archive.is](https://archive.is) via command-line and can also, but when
archiving it calls wget to do the job.

## Contributing

Clone the repository:

```bash
git clone https://github.com/turicas/crau.git
```

Install development dependencies (you may want to create a virtualenv):

```bash
cd crau && pip install -r requirements-development.txt
```

Install an editable version of the package:

```bash
pip install -e .
```

Modify everything you want to, commit to another branch and then create a pull
request at GitHub.