Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ethan-wickstrom/rrrs

Welcome to RRRS, a rapid, hyper-optimized CSV random sampling tool designed with performance and efficiency at its core. Crafted meticulously in Rust, RRRS offers an unparalleled solution for extracting random data samples from CSV files swiftly and effortlessly.
https://github.com/ethan-wickstrom/rrrs

analytics cli command-line command-line-tool data data-analysis data-science dataset rust rust-lang sample samples

Last synced: about 1 month ago
JSON representation

Welcome to RRRS, a rapid, hyper-optimized CSV random sampling tool designed with performance and efficiency at its core. Crafted meticulously in Rust, RRRS offers an unparalleled solution for extracting random data samples from CSV files swiftly and effortlessly.

Awesome Lists containing this project

README

        

[//]: # (Image in `.assets/logos/logo.webp`)


RRRS Logo




rrrs.io Latest Release

RRRS: Rust(ic) Rapid Random Sampler
===================================

Welcome to RRRS, a rapid, hyper-optimized CSV random sampling tool designed with performance and efficiency at its core. Crafted meticulously in Rust, RRRS offers an unparalleled solution for extracting random data samples from CSV files swiftly and effortlessly.

๐Ÿคจ Why RRRS
-----------

Born out of a frustrating, repetitive process of sampling from unwieldy or enormous CSV files during my time at Washington University in St. Louis, **RRRS (Rust(ic) Rapid Random Sampler)** represents more than just a tool; it's a perhaps slightly redundant, but fun mission to over-optimize and speed up the all-too-familiar frustration of data sampling. As a student navigating the complex waters of data-heavy courses, I found myself constantly bogged down by the inefficiency of existing methods of importing massive datasets into spreadsheet software, waiting for them to load, and then struggling with plugins or scripting to extract the samples I needed. It was clear: there had to be a better way. So, instead of doing my homework, I work on this:

Enter **RRRS**. Developed with the speed and efficiency of Rust, RRRS is my answer to those frustrating hours. It's designed to make random sampling from large CSV files not just faster, but a seamless part of your workflow. This tool is for anyone who's ever felt this nuisance, turning what was once a bottleneck into a smooth, efficient process. With RRRS, I'm excited to share a solution that helped me and is now here to support data enthusiasts and professionals alike in their analytical endeavors.

**Results**: Only 8.5 seconds to process and sample 100,000 rows of data from ~1.3 million rows of a 5.1-gigabyte dataset ([link to dataset](https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024)).

```bash
(base) ethanwickstrom@Ethans-MacBook-Pro-2 ~ % rrrs -i job_summary.csv -o ./datasets

Please enter the number of rows to sample from the input file:
100000
CSV written
Sampling completed [00:00:08] [----------------------------------------] 1/1 (0s) Sampled data has been successfully written.
Elapsed time: 8.47s
```

๐Ÿš€ Features
-----------

* **Rapid Random Sampling**: Quickly extract random samples from large CSV files.
* **Hyper-Optimized Performance**: Leveraging Rust's powerful system-level capabilities for maximum speed.
* **User-Friendly**: Simple command-line interface to easily specify input and output.
* **Flexibility**: Customizable random sampling according to your data analysis needs.
* **Cross-Platform Compatibility**: Runs seamlessly on any platform supporting Rust.

๐Ÿ›  Usage
--------

To get started with RRRS, follow these simple steps:

`rrrs -i -o `

Upon execution, RRRS will prompt you to enter the desired number of rows to be randomly sampled from your CSV file. The output will be a new CSV file with the original file title and a suffix indicating the number of sampled rows (e.g., `slogan_data-100`). This file will be saved in the execution path or a specified output directory.

๐Ÿ“‚ Directory Structure
----------------------

Understand the organization of RRRS with the following directory structure:

```bash
rrrs/
โ”œโ”€โ”€ Cargo.toml # Project manifest
โ”œโ”€โ”€ src/ # Source files
โ”‚ โ”œโ”€โ”€ main.rs # Entry point
โ”‚ โ”œโ”€โ”€ library.rs # Library code
โ”‚ โ”œโ”€โ”€ args.rs # Argument parsing
โ”‚ โ””โ”€โ”€ library/ # Library code
โ”‚ โ”œโ”€โ”€ sampler_ops/ # Sampling operations
โ”‚ โ”‚ โ”œโ”€โ”€ sampler_ops.rs # Sampling logic
โ”‚ โ””โ”€โ”€ csv_ops/ # CSV operations
โ”‚ โ”œโ”€โ”€ csv_loader.rs # CSV loading functionality
โ”‚ โ””โ”€โ”€ csv_writer.rs # CSV writing functionality
โ””โ”€โ”€ tests/ # Automated tests
โ”œโ”€โ”€ args_tests.rs # Tests for argument parsing
โ”œโ”€โ”€ csv_loader_tests.rs # Tests for CSV loading
โ”œโ”€โ”€ sampler_tests.rs # Tests for sampling logic
โ””โ”€โ”€ csv_writer_tests.rs # Tests for CSV writing
```

๐Ÿ“š Getting Started
------------------

### MacOS and Linux

To use RRRS, you need to have Rust installed on your machine. If you don't have Rust installed, install it using the following command: `curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh`. *For more information, refer to the official Rust installation guide [here](https://www.rust-lang.org/tools/install).*

Once Rust is installed, you can install RRRS using the following command: `cargo install rrrs`.

### Windows

**Note**: RRRS is not yet supported on Windows. However, you can still use it by installing the Windows Subsystem for Linux.

### Building from Source

To build RRRS from source, you can clone the repository and build it using the following commands (*Note that this is primarily for development purposes*):

```bash
git clone [email protected]:ethan-wickstrom/rrrs.git
cd rrrs
cargo build --release
cp target/release/rrrs /usr/local/bin
```

๐Ÿค Contributing
---------------

Contributions to RRRS are warmly welcomed. Feel free to open an issue or submit a pull request, whether it's bug reports, feature requests, or code contributions. Please refer to the contributing guidelines for more details.

๐Ÿ“ License
----------

RRRS is open-sourced under the Apache-2.0 license. See the LICENSE file for more details.