Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/maxcountryman/warc-parquet

🗄️ A simple CLI for converting WARC to Parquet.
https://github.com/maxcountryman/warc-parquet

crawling duckdb parquet warc web-archiving

Last synced: 13 days ago
JSON representation

🗄️ A simple CLI for converting WARC to Parquet.

Awesome Lists containing this project

README

        


warc-parquet


🗄️ A utility for converting WARC to Parquet.











## 📦 Install

The binary may be installed via `cargo`:

```sh
$ cargo install warc-parquet
```

To use the crate in your project, add the following to your `Cargo.toml` file:

```
[dependencies]
warc-parquet = "0.6.1"
```

## 🤸 Usage

### The Binary

Once installed, the `warc-parquet` utility can be used to transform WARC into Parquet:

```sh
$ wget --warc-file example 'https://example.com'
$ cat example.warc.gz | warc-parquet --gzipped > example.zstd.parquet
```

`warc-parquet` is meant to fit organically into the UNIX ecosystem. As such processing multiple WARCs at once is straightforward:

```sh
$ wget --warc-file github 'https://github.com'
$ cat example.warc.gz github.warc.gz | warc-parquet --gzipped > combined.zstd.parquet
```

It's also simple to preprocess via standard UNIX piping:

```sh
$ cat example.warc.gz | gzip -d | warc-parquet > example.zstd.parquet
```

Various compression options, including the option to forego compression altogether, are also available:

```sh
$ cat example.warc.gz | warc-parquet --gzipped --compression gzip > example.gz.parquet
```

> 💡 `warc-parquet --help` displays complete options and usage information.

### The Crate

Refer to [the docs](https://docs.rs/warc-parquet) for more details about how to use the `Reader` within your own programs.

### DuckDB

There are any number of ways to consume Parquet once you have it. However a natural fit might be
[DuckDB](https://duckdb.org):

```
$ duckdb
v0.3.3 fe9ba8003
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.

D select type, id from 'example.zstd.parquet';
┌──────────┬─────────────────────────────────────────────────┐
│ type │ id │
├──────────┼─────────────────────────────────────────────────┤
│ warcinfo │ │
│ request │ │
│ response │ │
│ metadata │ │
│ resource │ │
│ resource │ │
└──────────┴─────────────────────────────────────────────────┘

D describe select * from 'example.zstd.parquet';
┌─────────────────────────┬─────────────┬──────┬─────┬─────────┬───────┐
│ column_name │ column_type │ null │ key │ default │ extra │
├─────────────────────────┼─────────────┼──────┼─────┼─────────┼───────┤
│ id │ VARCHAR │ YES │ │ │ │
│ content_length │ UINTEGER │ YES │ │ │ │
│ date │ TIMESTAMP │ YES │ │ │ │
│ type │ VARCHAR │ YES │ │ │ │
│ content_type │ VARCHAR │ YES │ │ │ │
│ concurrent_to │ VARCHAR │ YES │ │ │ │
│ block_digest │ VARCHAR │ YES │ │ │ │
│ payload_digest │ VARCHAR │ YES │ │ │ │
│ ip_address │ VARCHAR │ YES │ │ │ │
│ refers_to │ VARCHAR │ YES │ │ │ │
│ target_uri │ VARCHAR │ YES │ │ │ │
│ truncated │ VARCHAR │ YES │ │ │ │
│ warc_info_id │ VARCHAR │ YES │ │ │ │
│ filename │ VARCHAR │ YES │ │ │ │
│ profile │ VARCHAR │ YES │ │ │ │
│ identified_payload_type │ VARCHAR │ YES │ │ │ │
│ segment_number │ UINTEGER │ YES │ │ │ │
│ segment_origin_id │ VARCHAR │ YES │ │ │ │
│ segment_total_length │ UINTEGER │ YES │ │ │ │
│ body │ BLOB │ YES │ │ │ │
└─────────────────────────┴─────────────┴──────┴─────┴─────────┴───────┘
```

## 🦺 Safety

This crate uses `#![forbid(unsafe_code)]` to ensure everything is implemented in 100% safe Rust.

## 👯 Contributing

We appreciate all kinds of contributions, thank you!

[docs]: https://docs.rs/warc-parquet