Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/maxcountryman/warc-parquet
🗄️ A simple CLI for converting WARC to Parquet.
https://github.com/maxcountryman/warc-parquet
crawling duckdb parquet warc web-archiving
Last synced: 13 days ago
JSON representation
🗄️ A simple CLI for converting WARC to Parquet.
- Host: GitHub
- URL: https://github.com/maxcountryman/warc-parquet
- Owner: maxcountryman
- Created: 2022-06-20T21:55:40.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-09-05T01:09:18.000Z (2 months ago)
- Last Synced: 2024-10-27T01:37:10.655Z (16 days ago)
- Topics: crawling, duckdb, parquet, warc, web-archiving
- Language: Rust
- Homepage:
- Size: 150 KB
- Stars: 106
- Watchers: 5
- Forks: 0
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
warc-parquet
🗄️ A utility for converting WARC to Parquet.## 📦 Install
The binary may be installed via `cargo`:
```sh
$ cargo install warc-parquet
```To use the crate in your project, add the following to your `Cargo.toml` file:
```
[dependencies]
warc-parquet = "0.6.1"
```## 🤸 Usage
### The Binary
Once installed, the `warc-parquet` utility can be used to transform WARC into Parquet:
```sh
$ wget --warc-file example 'https://example.com'
$ cat example.warc.gz | warc-parquet --gzipped > example.zstd.parquet
````warc-parquet` is meant to fit organically into the UNIX ecosystem. As such processing multiple WARCs at once is straightforward:
```sh
$ wget --warc-file github 'https://github.com'
$ cat example.warc.gz github.warc.gz | warc-parquet --gzipped > combined.zstd.parquet
```It's also simple to preprocess via standard UNIX piping:
```sh
$ cat example.warc.gz | gzip -d | warc-parquet > example.zstd.parquet
```Various compression options, including the option to forego compression altogether, are also available:
```sh
$ cat example.warc.gz | warc-parquet --gzipped --compression gzip > example.gz.parquet
```> 💡 `warc-parquet --help` displays complete options and usage information.
### The Crate
Refer to [the docs](https://docs.rs/warc-parquet) for more details about how to use the `Reader` within your own programs.
### DuckDB
There are any number of ways to consume Parquet once you have it. However a natural fit might be
[DuckDB](https://duckdb.org):```
$ duckdb
v0.3.3 fe9ba8003
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.D select type, id from 'example.zstd.parquet';
┌──────────┬─────────────────────────────────────────────────┐
│ type │ id │
├──────────┼─────────────────────────────────────────────────┤
│ warcinfo │ │
│ request │ │
│ response │ │
│ metadata │ │
│ resource │ │
│ resource │ │
└──────────┴─────────────────────────────────────────────────┘D describe select * from 'example.zstd.parquet';
┌─────────────────────────┬─────────────┬──────┬─────┬─────────┬───────┐
│ column_name │ column_type │ null │ key │ default │ extra │
├─────────────────────────┼─────────────┼──────┼─────┼─────────┼───────┤
│ id │ VARCHAR │ YES │ │ │ │
│ content_length │ UINTEGER │ YES │ │ │ │
│ date │ TIMESTAMP │ YES │ │ │ │
│ type │ VARCHAR │ YES │ │ │ │
│ content_type │ VARCHAR │ YES │ │ │ │
│ concurrent_to │ VARCHAR │ YES │ │ │ │
│ block_digest │ VARCHAR │ YES │ │ │ │
│ payload_digest │ VARCHAR │ YES │ │ │ │
│ ip_address │ VARCHAR │ YES │ │ │ │
│ refers_to │ VARCHAR │ YES │ │ │ │
│ target_uri │ VARCHAR │ YES │ │ │ │
│ truncated │ VARCHAR │ YES │ │ │ │
│ warc_info_id │ VARCHAR │ YES │ │ │ │
│ filename │ VARCHAR │ YES │ │ │ │
│ profile │ VARCHAR │ YES │ │ │ │
│ identified_payload_type │ VARCHAR │ YES │ │ │ │
│ segment_number │ UINTEGER │ YES │ │ │ │
│ segment_origin_id │ VARCHAR │ YES │ │ │ │
│ segment_total_length │ UINTEGER │ YES │ │ │ │
│ body │ BLOB │ YES │ │ │ │
└─────────────────────────┴─────────────┴──────┴─────┴─────────┴───────┘
```## 🦺 Safety
This crate uses `#![forbid(unsafe_code)]` to ensure everything is implemented in 100% safe Rust.
## 👯 Contributing
We appreciate all kinds of contributions, thank you!
[docs]: https://docs.rs/warc-parquet