https://github.com/remram44/cdchunking-rs
Content-Defined Chunking for Rust
https://github.com/remram44/cdchunking-rs
chunk chunking rolling-hash-functions rust
Last synced: 3 months ago
JSON representation
Content-Defined Chunking for Rust
- Host: GitHub
- URL: https://github.com/remram44/cdchunking-rs
- Owner: remram44
- License: mit
- Created: 2017-08-03T14:04:42.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2024-12-17T19:46:28.000Z (6 months ago)
- Last Synced: 2025-03-10T20:16:15.143Z (3 months ago)
- Topics: chunk, chunking, rolling-hash-functions, rust
- Language: Rust
- Homepage: https://remram44.github.io/cdchunking-rs/
- Size: 43.9 KB
- Stars: 18
- Watchers: 2
- Forks: 4
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
[](https://github.com/remram44/cdchunking-rs/actions)
[](https://crates.io/crates/cdchunking)
[](https://docs.rs/cdchunking)
[](https://saythanks.io/to/remram44)Content-Defined Chunking
========================This crates provides a way to device a stream of bytes into chunks, using methods that choose the splitting point from the content itself. This means that adding or removing a few bytes in the stream would only change the chunks directly modified. This is different from just splitting every n bytes, because in that case every chunk is different unless the number of bytes changed is a multiple of n.
Content-defined chunking is useful for data de-duplication. It is used in many backup software, and by the rsync data synchronization tool.
This crate exposes both easy-to-use methods, implementing the standard `Iterator` trait to iterate on chunks in an input stream, and efficient zero-allocation methods that reuse an internal buffer.
Using this crate
----------------First, add a dependency on this crate by adding the following to your `Cargo.toml`:
```
cdchunking = 1.0
```And your `lib.rs`:
```rust
extern crate cdchunking;
```Then create a `Chunker` object using a specific method, for example the ZPAQ algorithm:
```rust
use cdchunking::{Chunker, ZPAQ};let chunker = Chunker::new(ZPAQ::new(13)); // 13 bits = 8 KiB block average
```There are multiple way to get chunks out of some input data.
### From an in-memory buffer: iterate on slices
If your whole input data is in memory at once, you can use the `slices()` method. It will return an iterator on slices of this buffer, allowing to handle those chunks with no additional allocation.
```rust
for slice in chunker.slices(data) {
println("{:?}", slice);
}
```### From a file object: read chunks into memory
If you are reading from a file, or any object that implements `Read`, you can use `Chunker` to read whole chunks directly. Use the `whole_chunks()` method to get an iterator on chunks, read as new `Vec` objects.
```rust
for chunk in chunker.whole_chunks(reader) {
let chunk = chunk.expect("Error reading from file");
println!("{:?}", chunk);
}
```You can also read all the chunks from the file and collect them in a `Vec` (of `Vec`s) using the `all_chunks()` method. It will take care of the IO errors for you, returning an error if any of the chunks failed to read.
```rust
let chunks: Vec> = chunker.all_chunks(reader)
.expect("Error reading from file");
for chunk in chunks {
println!("{:?}", chunk);
}
```### From a file object: streaming chunks with zero allocation
If you are reading from a file to write to another, you might deem the allocation of intermediate `Vec` objects unnecessary. If you want, you can have `Chunker` provide you chunks data from the internal read buffer, without allocating anything else. In that case, note that a chunk might be split between multiple read operations. This method will work fine with any chunk sizes.
Use the `stream()` method to do this. Note that because an internal buffer is reused, we cannot implement the `Iterator` trait, so you will have to use a while loop:
```rust
let mut chunk_iterator = chunker.stream(reader);
while let Some(chunk) = chunk_iterator.read() {
let chunk = chunk.unwrap();
match chunk {
ChunkInput::Data(d) => {
print!("{:?}, ", d);
}
ChunkInput::End => println!(" end of chunk"),
}
}
```