{"id":46289578,"url":"https://github.com/Vitruves/nail-parquet","last_synced_at":"2026-03-18T02:01:15.330Z","repository":{"id":298745864,"uuid":"1001008163","full_name":"Vitruves/nail-parquet","owner":"Vitruves","description":"Fast parquet command line tool with many functions, nailed it! ","archived":false,"fork":false,"pushed_at":"2025-12-21T11:08:43.000Z","size":359,"stargazers_count":75,"open_issues_count":1,"forks_count":5,"subscribers_count":2,"default_branch":"main","last_synced_at":"2026-02-28T06:40:48.112Z","etag":null,"topics":["cli","command-line-tool","data-science","database-management","parquet","parquet-format","xlsx"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Vitruves.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-06-12T17:01:01.000Z","updated_at":"2026-02-25T23:39:33.000Z","dependencies_parsed_at":null,"dependency_job_id":"89eb91ad-0c61-4cae-889b-7e160819226e","html_url":"https://github.com/Vitruves/nail-parquet","commit_stats":null,"previous_names":["vitruves/nail-parquet"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/Vitruves/nail-parquet","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vitruves%2Fnail-parquet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vitruves%2Fnail-parquet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vitruves%2Fnail-parquet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vitruves%2Fnail-parquet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Vitruves","download_url":"https://codeload.github.com/Vitruves/nail-parquet/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vitruves%2Fnail-parquet/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30641684,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-18T01:41:58.583Z","status":"online","status_checked_at":"2026-03-18T02:00:07.824Z","response_time":104,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","command-line-tool","data-science","database-management","parquet","parquet-format","xlsx"],"created_at":"2026-03-04T08:00:26.109Z","updated_at":"2026-03-18T02:01:15.323Z","avatar_url":"https://github.com/Vitruves.png","language":"Rust","funding_links":[],"categories":["Tools"],"sub_categories":["Command-line"],"readme":"# nail - Lightning-Fast Data Analysis CLI\n\n**nail** is a high-performance command-line tool for analyzing, transforming, and exploring data files at blazing speed. Built with Rust, Apache Arrow, and DataFusion, it handles Parquet, CSV, JSON, and Excel files with ease—perfect for data engineers, analysts, and scientists who need quick insights without loading heavy tools.\n\n**🚀 Why nail?** Process gigabyte-scale datasets in seconds • SQL-powered operations • Zero configuration • Works offline • Single binary\n\n[![Crates.io](https://img.shields.io/crates/v/nail-parquet.svg)](https://crates.io/crates/nail-parquet)\n[![Downloads](https://img.shields.io/crates/d/nail-parquet.svg)](https://crates.io/crates/nail-parquet)\n[![License](https://img.shields.io/crates/l/nail-parquet.svg)](https://github.com/Vitruves/nail-parquet/blob/main/LICENSE)\n[![Rust](https://img.shields.io/badge/rust-1.70%2B-blue.svg)](https://www.rust-lang.org)\n\n![nail_parquet](https://github.com/user-attachments/assets/0251facf-0e9b-49d0-bbd4-5dd8a288997c)\n\n## Features\n\n- **Fast operations** on large datasets using Apache Arrow and DataFusion\n- **Multiple file formats** supported: Parquet, CSV, JSON, and Excel\n- **Comprehensive data operations**: inspection, statistics, filtering, sampling, transformations\n- **Data quality tools**: search, deduplication, size analysis, missing value handling\n- **Advanced features**: joins, unions, schema manipulation, stratified sampling\n- **File optimization**: compression, sorting, and encoding for better performance\n- **Data analysis tools**: binning, pivot tables, correlation analysis\n- **Flexible output**: console display or file output in multiple formats\n- **Production-ready** with robust error handling and verbose logging\n\n## System Requirements\n\n- **Operating System**: Linux (Ubuntu 24.04+ recommended), macOS, Windows\n- **Memory**: 4GB+ RAM (8GB+ recommended for large datasets)\n- **Storage**: SSD recommended for large file operations\n- **Dependencies**:\n  - Darwin: none\n  - Linux: `pkg-config` and `openssl` (package names might vary depending on your distro)\n\n## Use Cases\n\n**nail** is the perfect tool for:\n\n- **Parquet file viewer** - Quick inspection and exploration without Spark or Pandas\n- **CSV to Parquet converter** - Fast format conversion with automatic compression\n- **Excel data analysis** - Analyze `.xlsx` files without Excel or LibreOffice\n- **Command-line data science** - Perfect for SSH environments and automation scripts\n- **Offline data tool** - No cloud, no server dependencies - works completely offline\n- **ETL pipelines** - Transform and validate data in shell scripts and CI/CD\n- **Data quality checks** - Validate schemas, detect outliers, find duplicates quickly\n- **Quick statistics** - Get descriptive stats and correlations in seconds\n- **Large file processing** - Handle gigabyte-scale datasets that crash spreadsheet tools\n- **Pandas alternative** - 10x faster for read-only analysis tasks on large files\n\n## Installation\n\n```\ncargo install nail-parquet\n```\n\nor\n\n```bash\n# From source\ngit clone https://github.com/Vitruves/nail-parquet\ncd nail-parquet\ncargo build --release\nsudo cp target/release/nail /usr/local/bin/\n\n# Verify installation\nnail --help\n```\n\nor using `nix`:\n```bash\nnix shell nixpkgs#nail-parquet\n```\n\n\n## Global Options\n\nAll commands support these global flags:\n\n- `-v, --verbose` - Enable verbose output with timing and progress information\n- `-j, --jobs N` - Number of parallel jobs (default: half of available CPU cores)\n- `-o, --output FILE` - Output file path (prints to console if not specified)\n- `-f, --format FORMAT` - Output format: `json`, `csv`, `parquet`, `text` (auto-detect by default)\n- `-h, --help` - Display command help\n\n## Commands\n\n### Data Inspection\n\n#### `nail describe`\n\nShow comprehensive global file overview with metadata, dimensions, column types, and data quality metrics.\n\n```bash\n# Display file overview with colored output\nnail describe data.parquet\n\n# Include verbose logging\nnail describe data.parquet --verbose\n```\n\n**Output includes:**\n\n- File metadata (path, format, size, timestamps)\n- Dimensions (rows, columns, estimated memory)\n- Column type distribution (numeric, string, date/time, boolean)\n- Data quality metrics (density, null values, duplicates)\n- Storage efficiency\n- Column name listings by type\n\n#### `nail head`\n\nDisplay the first N rows of a dataset.\n\n```bash\n# Basic usage\nnail head data.parquet\n\n# Display first 10 rows\nnail head data.parquet -n 10\n\n# Save to JSON file\nnail head data.parquet -n 5 -o sample.json\n\n# Verbose output with timing\nnail head data.parquet -n 3 --verbose\n```\n\n**Options:**\n\n- `-n, --number N` - Number of rows to display (default: 5)\n\n#### `nail tail`\n\nDisplay the last N rows of a dataset.\n\n```bash\n# Display last 5 rows\nnail tail data.parquet\n\n# Display last 20 rows with verbose logging\nnail tail data.parquet -n 20 --verbose\n\n# Save last 10 rows to CSV\nnail tail data.parquet -n 10 -o tail_sample.csv\n```\n\n**Options:**\n\n- `-n, --number N` - Number of rows to display (default: 5)\n\n#### `nail preview`\n\nRandomly sample and display N rows from the dataset. Supports both static display and interactive browsing mode.\n\n```bash\n# Random preview of 5 rows\nnail preview data.parquet\n\n# Reproducible random sample with seed\nnail preview data.parquet -n 10 --random 42\n\n# Interactive mode for browsing records one by one\nnail preview data.parquet --interactive\n```\n\n**Options:**\n\n- `-n, --number N` - Number of rows to display (default: 5)\n- `--random SEED` - Random seed for reproducible results\n- `-I, --interactive` - Interactive mode with scrolling (use arrow keys, q to quit)\n\n**Interactive Mode Controls:**\n\n- `←/→` or `h/l` - Navigate between records (previous/next)\n- `↑/↓` or `k/j` - Navigate between fields within a record\n- `q`, `Esc` - Quit interactive mode\n- `Ctrl+C` - Force quit\n\n#### `nail headers`\n\nList column names, optionally filtered by regex patterns.\n\n```bash\n# List all column headers\nnail headers data.parquet\n\n# Filter headers with regex\nnail headers data.parquet --filter \"^price.*\"\n\n# Save headers to file\nnail headers data.parquet -o columns.txt\n\n# JSON format output\nnail headers data.parquet -f json\n```\n\n**Options:**\n\n- `--filter REGEX` - Filter headers with regex pattern\n\n#### `nail schema`\n\nDisplay detailed schema information including column types and nullability.\n\n```bash\n# Display schema\nnail schema data.parquet\n\n# Save schema to JSON\nnail schema data.parquet -o schema.json\n\n# Verbose schema analysis\nnail schema data.parquet --verbose\n```\n\n#### `nail metadata`\n\nDisplay detailed Parquet file metadata, including schema, row groups, column chunks, compression, encoding, and statistics.\n\n```bash\n# Display basic metadata\nnail metadata data.parquet\n\n# Show all available metadata\nnail metadata data.parquet --all\n\n# Show detailed schema and row group information\nnail metadata data.parquet --schema --row-groups --detailed\n\n# Save all metadata to JSON\nnail metadata data.parquet --all -o metadata.json\n```\n\n**Options:**\n\n- `--schema` - Show detailed schema information\n- `--row-groups` - Show row group information\n- `--column-chunks` - Show column chunk information\n- `--compression` - Show compression information\n- `--encoding` - Show encoding information\n- `--statistics` - Show statistics information\n- `--all` - Show all available metadata\n- `--detailed` - Show metadata in detailed format\n\n#### `nail size`\n\nAnalyze file and memory usage with detailed size breakdowns.\n\n```bash\n# Basic size analysis\nnail size data.parquet\n\n# Show per-column size breakdown\nnail size data.parquet --columns\n\n# Show per-row analysis\nnail size data.parquet --rows\n\n# Show all size metrics\nnail size data.parquet --columns --rows\n\n# Raw bits output (no human-friendly formatting)\nnail size data.parquet --bits\n\n# Save size analysis to file\nnail size data.parquet --columns --rows -o size_report.txt\n```\n\n**Options:**\n\n- `--columns` - Show per-column sizes\n- `--rows` - Show per-row analysis\n- `--bits` - Show raw bits without human-friendly conversion\n\n#### `nail search`\n\nSearch for specific values across columns with flexible matching options.\n\n```bash\n# Basic search across all columns\nnail search data.parquet --value \"John\"\n\n# Search in specific columns\nnail search data.parquet --value \"error\" -c \"status,message,log\"\n\n# Case-insensitive search\nnail search data.parquet --value \"ACTIVE\" --ignore-case\n\n# Exact match only (no partial matches)\nnail search data.parquet --value \"complete\" --exact\n\n# Return row numbers only\nnail search data.parquet --value \"Bob\" --rows\n\n# Save search results\nnail search data.parquet --value \"error\" -o search_results.json\n```\n\n**Options:**\n\n- `--value VALUE` - Value to search for (required)\n- `-c, --columns PATTERN` - Comma-separated column names to search in\n- `--rows` - Return matching row numbers only\n- `--ignore-case` - Case-insensitive search\n- `--exact` - Exact match only (no partial matches)\n\n### Data Quality Tools\n\n#### `nail outliers`\n\nDetect and optionally remove outliers from numeric columns using various methods.\n\n```bash\n# Detect outliers using IQR method for specific column\nnail outliers data.parquet -c \"price\" --method iqr\n\n# Detect outliers using Z-score with a custom threshold, showing values\nnail outliers data.parquet -c \"revenue\" --method z-score --z-score-threshold 2.5 --show-values\n\n# Remove outliers using Modified Z-score method from multiple columns\nnail outliers data.parquet -c \"age,income\" --method modified-z-score --remove -o cleaned_data.parquet\n\n# Detect outliers using Isolation Forest method (simplified)\nnail outliers data.parquet -c \"score\" --method isolation-forest\n```\n\n**Options:**\n\n- `-c, --columns PATTERN` - Comma-separated column names or regex patterns for outlier detection\n- `--method METHOD` - Outlier detection method: `iqr`, `z-score`, `modified-z-score`, `isolation-forest` (default: iqr)\n- `--iqr-multiplier VALUE` - IQR multiplier for outlier detection (default: 1.5)\n- `--z-score-threshold VALUE` - Z-score threshold for outlier detection (default: 3.0)\n- `--show-values` - Show outlier values instead of just flagging them\n- `--include-row-numbers` - Include row numbers in output\n- `--remove` - Remove outliers from dataset and save cleaned data\n\n### Statistics \u0026 Analysis\n\n#### `nail stats`\n\nCompute statistical summaries for numeric and categorical columns with flexible percentile calculations and type filtering.\n\n```bash\n# Basic statistics (mean, Q25, Q50, Q75, unique count)\nnail stats data.parquet\n\n# Exhaustive statistics\nnail stats data.parquet --stats-type exhaustive\n\n# Statistics for specific columns\nnail stats data.parquet -c \"price,volume,quantity\"\n\n# Statistics with regex column selection\nnail stats data.parquet -c \"^(price|vol).*\" --stats-type exhaustive\n\n# Custom percentiles\nnail stats data.parquet -c \"revenue\" --percentiles \"0.1,0.5,0.9,0.95,0.99\"\n\n# Numeric columns only\nnail stats data.parquet --numeric-only\n\n# Categorical columns only\nnail stats data.parquet --categorical-only\n\n# Save statistics to file\nnail stats data.parquet --stats-type basic -o stats.json\n```\n\n**Options:**\n\n- `-c, --columns PATTERN` - Comma-separated column names or regex patterns\n- `-t, --stats-type TYPE` - Statistics type: `basic`, `exhaustive`, `hypothesis` (default: basic)\n- `-p, --percentiles VALUES` - Custom percentiles (comma-separated, e.g., '0.1,0.5,0.9')\n- `--numeric-only` - Include only numeric columns\n- `--categorical-only` - Include only categorical (string) columns\n\n**Statistics Types:**\n\n- **basic**: count, mean, quartiles (q25, q50, q75), and number of unique values.\n- **exhaustive**: count, mean, std dev, min, max, variance, duplicates, and unique values.\n- **hypothesis**: statistical significance tests (not yet implemented).\n\n#### `nail correlations`\n\nCompute correlation matrices between numeric columns with optional statistical significance testing.\n\n```bash\n# Basic Pearson correlation\nnail correlations data.parquet\n\n# Specific correlation types\nnail correlations data.parquet --type kendall\nnail correlations data.parquet --type spearman\n\n# Correlations for specific columns\nnail correlations data.parquet -c \"price,volume,quantity\"\n\n# Output as correlation matrix format\nnail correlations data.parquet --matrix\n\n# Include statistical significance tests (p-values for fisher, t-test, chi-sqr)\nnail correlations data.parquet --tests fisher_exact,t_test\n\n# Comprehensive correlation analysis with significance tests\nnail correlations data.parquet --tests fisher_exact -o correlations.json\n```\n\n**Options:**\n\n- `-c, --columns PATTERN` - Comma-separated column names or regex patterns\n- `-t, --type TYPE` - Correlation type: `pearson`, `kendall`, `spearman` (default: pearson)\n- `--matrix` - Output as correlation matrix format\n- `--tests` - Include statistical significance tests (`fisher_exact`, `chi_sqr`, `t_test`)\n- `--digits N` - Number of decimal places for correlation values (default: 4)\n\n#### `nail frequency`\n\nCompute frequency tables for categorical columns showing value counts, distributions, and percentages.\n\n```bash\n# Basic frequency table for a single column\nnail frequency data.parquet -c \"category\"\n\n# Multiple columns frequency analysis\nnail frequency data.parquet -c \"category,status,region\"\n\n# Save frequency table to file\nnail frequency data.parquet -c \"product_type\" -o frequency_table.csv\n\n# Verbose output with progress information\nnail frequency data.parquet -c \"category,status\" --verbose\n```\n\n**Options:**\n\n- `-c, --columns PATTERN` - Comma-separated column names to analyze (required).\n\n**Output:** Shows frequency counts with percentages for each value, helping identify data distribution patterns.\n\n### Data Manipulation\n\n#### `nail select`\n\nSelect specific columns and/or rows from the dataset.\n\n```bash\n# Select specific columns\nnail select data.parquet -c \"id,name,price\"\n\n# Select columns with regex\nnail select data.parquet -c \"^(id|price).*\"\n\n# Select specific rows\nnail select data.parquet -r \"1,5,10-20\"\n\n# Select both columns and rows\nnail select data.parquet -c \"id,price\" -r \"1-100\"\n\n# Save selection to new file\nnail select data.parquet -c \"id,name\" -o subset.parquet\n```\n\n**Options:**\n\n- `-c, --columns PATTERN` - Column names or regex patterns (comma-separated)\n- `-r, --rows SPEC` - Row numbers or ranges (e.g., \"1,3,5-10\")\n\n#### `nail drop`\n\nRemove specific columns and/or rows from the dataset.\n\n```bash\n# Drop specific columns\nnail drop data.parquet -c \"temp_col,debug_info\"\n\n# Drop columns matching pattern\nnail drop data.parquet -c \"^temp_.*\"\n\n# Drop specific rows\nnail drop data.parquet -r \"1,5,100-200\"\n\n# Drop both columns and rows\nnail drop data.parquet -c \"temp_col\" -r \"1-10\"\n```\n\n**Options:**\n\n- `-c, --columns PATTERN` - Column names or regex patterns to drop\n- `-r, --rows SPEC` - Row numbers or ranges to drop\n\n#### `nail filter`\n\nFilter data based on column conditions or row characteristics.\n\n```bash\n# Filter by column conditions\nnail filter data.parquet -c \"price\u003e100,volume\u003c1000\"\n\n# Multiple conditions with different operators\nnail filter data.parquet -c \"age\u003e=18,status=active,score!=0\"\n\n# String matching and numeric comparisons\nnail filter data.parquet -c \"name!=test,salary\u003c=50000,active=true\"\n\n# Filter to numeric columns only\nnail filter data.parquet --rows numeric-only\n\n# Remove rows with NaN values\nnail filter data.parquet --rows no-nan\n\n# Remove rows with zeros\nnail filter data.parquet --rows no-zeros\n\n# String columns only\nnail filter data.parquet --rows char-only\n```\n\n**Options:**\n\n- `-c, --columns CONDITIONS` - Column filter conditions (comma-separated). Supported operators:\n  - `=` (equals), `!=` (not equals)\n  - `\u003e` (greater than), `\u003e=` (greater or equal)\n  - `\u003c` (less than), `\u003c=` (less or equal)\n  - Examples: `age\u003e25`, `status=active`, `price\u003c=100`\n- `--rows FILTER` - Row filter type: `no-nan`, `numeric-only`, `char-only`, `no-zeros`\n\n#### `nail fill`\n\nFill missing values using various strategies.\n\n```bash\n# Fill with specific value\nnail fill data.parquet --method value --value 0\n\n# Fill specific columns with value\nnail fill data.parquet -c \"price,quantity\" --method value --value -1\n\n# Fill with mean (for numeric columns)\nnail fill data.parquet --method mean\n\n# Fill with median\nnail fill data.parquet --method median -c \"price,volume\"\n```\n\n**Options:**\n\n- `--method METHOD` - Fill method: `value`, `mean`, `median`, `mode`, `forward`, `backward` (default: value)\n- `--value VALUE` - Fill value (required for 'value' method)\n- `-c, --columns PATTERN` - Comma-separated column names to fill\n\n#### `nail rename`\n\nRename one or more columns.\n\n```bash\n# Rename a single column\nnail rename data.parquet --column \"old_name=new_name\" -o renamed.parquet\n\n# Rename multiple columns\nnail rename data.parquet --column \"id=user_id,val=value\" -o renamed.parquet\n```\n\n**Options:**\n\n- `-c, --column SPECS` - Column rename specs (`before=after`), comma-separated.\n\n#### `nail create`\n\nCreate new columns with expressions based on existing columns.\n\n```bash\n# Create a single new column\nnail create data.parquet --column \"total=price*quantity\" -o enhanced.parquet\n\n# Create multiple columns\nnail create data.parquet --column \"total=price*quantity,margin=(price-cost)\" -o enhanced.parquet\n\n# Filter rows while creating columns\nnail create data.parquet --column \"category_score=score*2\" --row-filter \"score\u003e50\" -o filtered_enhanced.parquet\n```\n\n**Options:**\n\n- `-c, --column SPECS` - Column creation specifications (`name=expression`), comma-separated.\n- `-r, --row-filter FILTER` - Row filter expression to apply before creating columns.\n\n#### `nail dedup`\n\nRemove duplicate rows or columns from the dataset.\n\n```bash\n# Remove duplicate rows (all columns considered)\nnail dedup data.parquet --row-wise\n\n# Remove duplicate rows based on specific columns\nnail dedup data.parquet --row-wise -c \"id,email\"\n\n# Keep last occurrence instead of first\nnail dedup data.parquet --row-wise --keep last\n\n# Remove duplicate columns (by name)\nnail dedup data.parquet --col-wise\n\n# Save deduplicated data\nnail dedup data.parquet --row-wise -o clean_data.parquet\n```\n\n**Options:**\n\n- `--row-wise` - Remove duplicate rows (conflicts with --col-wise)\n- `--col-wise` - Remove duplicate columns (conflicts with --row-wise)\n- `-c, --columns PATTERN` - Columns to consider for row-wise deduplication\n- `--keep STRATEGY` - Keep 'first' or 'last' occurrence (default: first)\n\n### Data Sampling \u0026 Transformation\n\n#### `nail sample`\n\nSample data using various strategies.\n\n```bash\n# Random sampling\nnail sample data.parquet -n 1000\n\n# Reproducible random sampling\nnail sample data.parquet -n 500 --method random --random 42\n\n# Stratified sampling\nnail sample data.parquet -n 1000 --method stratified --stratify-by category\n\n# First N rows\nnail sample data.parquet -n 100 --method first\n\n# Last N rows\nnail sample data.parquet -n 100 --method last\n```\n\n**Options:**\n\n- `-n, --number N` - Number of samples (default: 10)\n- `--method METHOD` - Sampling method: `random`, `stratified`, `first`, `last` (default: random)\n- `--stratify-by COLUMN` - Column name for stratified sampling\n- `--random SEED` - Random seed for reproducible results\n\n#### `nail shuffle`\n\nRandomly shuffle the order of rows in the dataset.\n\n```bash\n# Random shuffle\nnail shuffle data.parquet\n\n# Reproducible shuffle (Note: DataFusion's RANDOM() may not be deterministic)\nnail shuffle data.parquet --random 42\n\n# Shuffle and save to new file\nnail shuffle data.parquet -o shuffled.parquet --verbose\n```\n\n**Options:**\n\n- `--random SEED` - Random seed for reproducible results\n\n#### `nail sort`\n\nSort data by one or more columns with flexible sorting strategies and null handling.\n\n```bash\n# Sort by all columns (auto-detect data types)\nnail sort data.parquet\n\n# Sort by specific columns\nnail sort data.parquet -c \"price,date\"\n\n# Sort with specific strategies\nnail sort data.parquet -c \"date,amount,name\" -s \"date,numeric,alphabetic\"\n\n# Descending sort\nnail sort data.parquet -c \"revenue\" -d true\n\n# Multiple columns with mixed directions\nnail sort data.parquet -c \"category,price\" -d \"false,true\"\n\n# Handle nulls differently\nnail sort data.parquet -c \"score\" --nulls first\nnail sort data.parquet -c \"rating\" --nulls skip\n\n# Case-insensitive alphabetic sorting\nnail sort data.parquet -c \"name,category\" -s \"alphabetic\" --case-insensitive\n\n# Sort dates with custom format\nnail sort data.parquet -c \"date\" -s \"date\" --date-format \"mm-dd-yyyy\"\n\n# Sort time values with custom format\nnail sort data.parquet -c \"timestamp\" -s \"hour\" --hour-format \"hh:mm:ss\"\n```\n\n**Options:**\n\n- `-c, --column COLUMNS` - Columns to sort by (comma-separated or 'all') (default: all)\n- `-s, --strategy STRATEGIES` - Sort strategy per column: `numeric`, `date`, `alphabetic`, `alphabetic-numeric`, `numeric-alphabetic`, `hour`, `auto` (comma-separated)\n- `-d, --descending FLAGS` - Sort in descending order (comma-separated true/false per column)\n- `--nulls HANDLING` - Null value handling: `first`, `last`, `skip` (default: last)\n- `--date-format FORMAT` - Date format pattern (e.g., 'mm-dd-yyyy', 'dd/mm/yyyy', 'yyyy-mm-dd')\n- `--hour-format FORMAT` - Time format pattern (e.g., 'hh:mm:ss', 'mm:ss')\n- `--case-insensitive` - Case-insensitive alphabetic sorting\n\n**Sort Strategies:**\n\n- **auto**: Automatically detect based on data type\n- **numeric**: Sort numerically (converts strings to numbers if needed)\n- **date**: Sort by date (supports custom formats)\n- **alphabetic**: Sort alphabetically (supports case-insensitive)\n- **alphabetic-numeric**: Sort alphabetically first, then numerically\n- **numeric-alphabetic**: Sort numerically first, then alphabetically\n- **hour**: Sort by time/hour values\n\n#### `nail id`\n\nAdd a unique ID column to the dataset.\n\n```bash\n# Add simple numeric ID column\nnail id data.parquet --create\n\n# Add ID with custom name and prefix\nnail id data.parquet --create --id-col-name record_id --prefix \"REC-\"\n\n# Save with new ID column\nnail id data.parquet --create -o data_with_ids.parquet\n```\n\n**Options:**\n\n- `--create` - Create new ID column\n- `--prefix PREFIX` - Prefix for ID values (default: \"id\")\n- `--id-col-name NAME` - ID column name (default: \"id\")\n\n### Data Combination\n\n#### `nail diff`\n\nCompare two datasets and show differences using key-based or row-based comparison.\n\n```bash\n# Compare files using key columns\nnail diff old_data.parquet --compare new_data.parquet --keys \"id,timestamp\"\n\n# Show only changed rows\nnail diff v1.parquet --compare v2.parquet --keys \"id\" --changes-only\n\n# Show rows only in left file\nnail diff current.parquet --compare archive.parquet --keys \"record_id\" --left-only\n\n# Show rows only in right file\nnail diff baseline.parquet --compare updated.parquet --keys \"user_id\" --right-only\n\n# Row-by-row positional comparison (no keys)\nnail diff file1.csv --compare file2.csv\n\n# Save diff results\nnail diff old.parquet --compare new.parquet --keys \"id\" -o differences.parquet\n```\n\n**Options:**\n\n- `-c, --compare FILE` - Second file to compare with (required)\n- `-k, --keys COLUMNS` - Columns to use as primary key for comparison (comma-separated)\n- `--changes-only` - Show only rows that differ\n- `--left-only` - Show only rows in left file\n- `--right-only` - Show only rows in right file\n\n**Output:**\n\n- `diff_status` column indicates: `ADDED`, `REMOVED`, or `MODIFIED`\n- For key-based: Shows matching records with left/right values\n- For row-based: Shows records by position with left/right values\n\n#### `nail merge`\n\nJoin two datasets horizontally based on a common key column.\n\n```bash\n# Inner join (default)\nnail merge left.parquet --right right.parquet --key id -o merged.parquet\n\n# Left join - keep all records from left table\nnail merge customers.parquet --right orders.parquet --left-join --key customer_id -o customer_orders.parquet\n\n# Right join - keep all records from right table\nnail merge orders.parquet --right customers.parquet --right-join --key customer_id -o order_customers.parquet\n\n# Merge on columns with different names\nnail merge table1.parquet --right table2.parquet --key-mapping \"table1_id=table2_user_id\"\n```\n\n**Options:**\n\n- `--right FILE` - Right table file to merge with (required)\n- `--key COLUMN` - Join key column name (if same in both tables)\n- `--key-mapping MAPPING` - Join key mapping for different column names (`left_col=right_col`)\n- `--left-join` - Perform left join\n- `--right-join` - Perform right join\n\n#### `nail append`\n\nAppend multiple datasets vertically (union operation).\n\n```bash\n# Append files with matching schemas\nnail append base.parquet --files \"file1.parquet,file2.parquet\" -o combined.parquet\n\n# Append with verbose logging\nnail append base.parquet --files \"jan.parquet,feb.parquet,mar.parquet\" --verbose -o q1_data.parquet\n\n# Force append with schema differences (fills missing columns with nulls)\nnail append base.parquet --files \"different_schema.parquet\" --ignore-schema -o combined.parquet\n```\n\n**Options:**\n\n- `--files FILES` - Comma-separated list of files to append (required)\n- `--ignore-schema` - Ignore schema mismatches and force append\n\n#### `nail split`\n\nSplit dataset into multiple files based on ratios or stratification.\n\n```bash\n# Split by ratio\nnail split data.parquet --ratio \"0.7,0.3\" --names \"train,test\" --output-dir splits/\n\n# Stratified split\nnail split data.parquet --ratio \"0.8,0.2\" --stratified-by category --output-dir splits/\n\n# Reproducible split\nnail split data.parquet --ratio \"0.6,0.2,0.2\" --random 42 --output-dir splits/\n```\n\n**Options:**\n\n- `--ratio RATIOS` - Comma-separated split ratios (must sum to 1.0 or 100.0)\n- `--names NAMES` - Comma-separated output file names\n- `--output-dir DIR` - Output directory for split files\n- `--stratified-by COLUMN` - Column for stratified splitting\n- `--random SEED` - Random seed for reproducible splits\n\n### File Optimization\n\n#### `nail optimize`\n\nOptimize Parquet files by applying compression, sorting, and encoding techniques.\n\n```bash\n# Basic optimization with default compression (snappy)\nnail optimize input.parquet -o optimized.parquet\n\n# Optimize with specific compression type and level\nnail optimize input.parquet -o optimized.parquet --compression zstd --compression-level 5\n\n# Sort data while optimizing\nnail optimize input.parquet -o optimized.parquet --sort-by \"timestamp,id\"\n\n# Enable dictionary encoding for better compression\nnail optimize input.parquet -o optimized.parquet --dictionary\n\n# Comprehensive optimization and validation\nnail optimize input.parquet -o optimized.parquet --compression zstd --sort-by \"date,category\" --dictionary --validate --verbose\n```\n\n**Options:**\n\n- `--compression TYPE` - Compression type: `snappy`, `gzip`, `zstd`, `brotli` (default: snappy)\n- `--compression-level LEVEL` - Compression level (1-9, default: 6)\n- `--sort-by COLUMNS` - Comma-separated columns to sort by\n- `--dictionary` - Enable dictionary encoding\n- `--no-dictionary` - Disable dictionary encoding\n- `--validate` - Validate optimized file after creation\n\n### Data Analysis \u0026 Transformation\n\n#### `nail binning`\n\nBin continuous variables into categorical ranges for analysis. *Note: Current implementation supports one column at a time.*\n\n```bash\n# Equal-width binning with 5 bins\nnail binning data.parquet -c \"age\" -b 5 -o binned.parquet\n\n# Custom bins with specific edges\nnail binning data.parquet -c \"score\" -b \"0,50,80,90,100\" --method custom -o binned.parquet\n\n# Add bin labels\nnail binning data.parquet -c \"temperature\" -b 3 --labels \"Cold,Warm,Hot\" -o binned.parquet\n```\n\n**Options:**\n\n- `-c, --columns COLUMN` - Column to bin (required)\n- `-b, --bins BINS` - Number of bins or custom edges (e.g., \"5\" or \"0,10,50,100\") (default: 10)\n- `--method METHOD` - Binning method: `equal-width`, `custom` (default: equal-width)\n- `--labels LABELS` - Comma-separated bin labels\n- `--suffix SUFFIX` - Suffix for new binned column (default: \"_binned\")\n- `--drop-original` - Drop original column after binning\n\n#### `nail pivot`\n\nCreate pivot tables for data aggregation and cross-tabulation. *Note: Current implementation is a simplified group-by aggregation and does not create a wide-format pivot table.*\n\n```bash\n# Basic pivot table with sum aggregation\nnail pivot data.parquet -i \"region\" -c \"product\" -l \"sales\" -o pivot.parquet\n\n# Pivot with different aggregation functions\nnail pivot data.parquet -i \"category\" -c \"product\" -l \"revenue\" --agg mean -o avg_pivot.parquet\n```\n\n**Options:**\n\n- `-i, --index COLUMNS` - Row index columns (comma-separated) (required)\n- `-c, --columns COLUMNS` - Column pivot columns (comma-separated) (required)\n- `-l, --values COLUMNS` - Value columns to aggregate (comma-separated)\n- `--agg FUNC` - Aggregation function: `sum`, `mean`, `count`, `min`, `max` (default: sum)\n- `--fill VALUE` - Fill missing values (default: \"0\")\n\n### Format Conversion \u0026 Utility\n\n#### `nail convert`\n\nConvert between different file formats.\n\n```bash\n# Convert Parquet to CSV\nnail convert data.parquet -o data.csv\n\n# Convert CSV to Parquet\nnail convert data.csv -o data.parquet\n\n# Convert Parquet to Excel\nnail convert data.parquet -o data.xlsx\n\n# Verbose conversion with progress\nnail convert large_dataset.csv -o large_dataset.parquet --verbose\n```\n\n**Supported Formats:**\n\n- **Input**: Parquet, CSV, JSON, Excel (xlsx)\n- **Output**: Parquet, CSV, JSON, Excel (xlsx)\n\n#### `nail count`\n\nCount the number of rows in a dataset.\n\n```bash\n# Basic row count\nnail count data.parquet\n\n# Count with verbose output\nnail count data.parquet --verbose\n```\n\n#### `nail update`\n\nCheck for newer versions of the `nail` tool.\n\n```bash\n# Check for updates\nnail update\n\n# Check with verbose output\nnail update --verbose\n```\n\n## Examples\n\n### Basic Data Exploration\n\n```bash\n# Quick dataset overview with comprehensive file info\nnail describe sales_data.parquet\n\n# Traditional exploration\nnail schema sales_data.parquet\nnail size sales_data.parquet --columns --rows\nnail head sales_data.parquet -n 10\nnail stats sales_data.parquet --stats-type basic\n\n# Advanced statistics with custom percentiles\nnail stats sales_data.parquet -c \"revenue,profit\" --percentiles \"0.25,0.5,0.75,0.90,0.95,0.99\"\n\n# Column inspection\nnail headers sales_data.parquet --filter \"price\"\nnail correlations sales_data.parquet -c \"price,quantity,discount\" --stats-tests t_test\n\n# Frequency analysis for categorical data\nnail frequency sales_data.parquet -c \"category,region,status\"\n```\n\n### Data Quality Investigation\n\n```bash\n# Comprehensive data quality overview\nnail describe data.parquet\n\n# Compare datasets to find differences\nnail diff yesterday.parquet --compare today.parquet --keys \"id\" --changes-only\n\n# Search for problematic values\nnail search data.parquet --value \"error\" --ignore-case\nnail search data.parquet --value \"null\" -c \"critical_fields\" --rows\n\n# Find and remove duplicates\nnail dedup data.parquet --row-wise -c \"id\" --verbose -o unique_data.parquet\n\n# Analyze data size and memory usage\nnail size data.parquet --columns --bits\n\n# Check for specific patterns in text fields\nnail search data.parquet --value \"@gmail.com\" -c \"email\" --exact\n\n# Frequency analysis to identify data quality issues\nnail frequency data.parquet -c \"status\" --verbose\n```\n\n### Data Enhancement and Transformation\n\n```bash\n# Create new calculated columns\nnail create sales_data.parquet --column \"total=price*quantity,profit=(price-cost)\" -o enhanced_sales.parquet\n\n# Rename columns for clarity\nnail rename enhanced_sales.parquet --column \"total=total_sale,profit=net_profit\" -o final_sales.parquet\n\n# Create complex derived metrics\nnail create customer_data.parquet --column \"lifetime_value=orders*avg_order\" -o customer_metrics.parquet\n\n# Filter and enhance in one step\nnail create large_dataset.parquet --column \"score=performance*weight\" --row-filter \"active=true\" -o active_scored.parquet\n```\n\n### Data Optimization and Processing Pipeline\n\n```bash\n# 1. Optimize raw data files for better performance\nnail optimize raw_data.parquet -o optimized_data.parquet --compression zstd --sort-by \"timestamp,customer_id\" --dictionary --verbose\n\n# 2. Create analytical features with binning\nnail binning optimized_data.parquet -c \"age\" -b \"18,25,35,50,65\" --method custom --labels \"18-24,25-34,35-49,50-64,65+\" -o aged_data.parquet\n\n# 3. Group data for analysis (using pivot's group-by functionality)\nnail pivot aged_data.parquet -i \"age_binned\" -c \"category\" -l \"revenue\" --agg sum -o age_revenue_summary.parquet\n\n# 4. Add derived metrics\nnail create age_revenue_summary.parquet --column \"avg_revenue=sum_revenue/count_revenue\" -o enhanced_summary.parquet\n\n# 5. Statistical analysis of optimized data\nnail stats enhanced_summary.parquet --stats-type exhaustive -o summary_stats.json\n```\n\n## Performance Tips\n\n1. **Use Parquet for large datasets** - Parquet is columnar and much faster than CSV for analytical operations.\n2. **Specify column patterns** - Use `-c` with regex patterns to operate only on relevant columns.\n3. **Chain operations** - Use intermediate files for complex multi-step transformations.\n4. **Adjust parallelism** - Use `-j` to control parallel processing based on your system.\n5. **Enable verbose mode** - Use `--verbose` to monitor performance and progress on large datasets.\n\n## Error Handling\n\nnail provides detailed error messages for common issues:\n\n- **File not found**: Clear indication of missing input files\n- **Schema mismatches**: Detailed information about incompatible schemas in merge/append operations\n- **Invalid expressions**: Specific feedback on malformed filter conditions or column patterns\n- **Memory issues**: Graceful handling of large datasets with appropriate error messages\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Add tests for new functionality\n4. Ensure all tests pass\n5. Submit a pull request\n\n## Support\n\nFor issues and questions:\n\n- GitHub Issues: https://github.com/Vitruves/nail-parquet/issues\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FVitruves%2Fnail-parquet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FVitruves%2Fnail-parquet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FVitruves%2Fnail-parquet/lists"}