{"id":19145389,"url":"https://github.com/paradigmxyz/tbl","last_synced_at":"2025-07-24T11:05:49.873Z","repository":{"id":252433221,"uuid":"814322910","full_name":"paradigmxyz/tbl","owner":"paradigmxyz","description":"tbl is a swiss army knife for parquet read and write operations","archived":false,"fork":false,"pushed_at":"2024-09-04T08:37:58.000Z","size":536,"stargazers_count":123,"open_issues_count":4,"forks_count":6,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-30T20:11:55.706Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/paradigmxyz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE-APACHE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-12T19:28:17.000Z","updated_at":"2025-02-05T21:47:00.000Z","dependencies_parsed_at":"2024-08-14T02:18:47.989Z","dependency_job_id":"5a56e23a-d5d5-41f1-a877-d8ee0b84af4a","html_url":"https://github.com/paradigmxyz/tbl","commit_stats":null,"previous_names":["paradigmxyz/tbl"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/paradigmxyz%2Ftbl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/paradigmxyz%2Ftbl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/paradigmxyz%2Ftbl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/paradigmxyz%2Ftbl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/paradigmxyz","download_url":"https://codeload.github.com/paradigmxyz/tbl/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247557767,"owners_count":20958047,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-09T07:39:57.905Z","updated_at":"2025-04-06T22:09:51.065Z","avatar_url":"https://github.com/paradigmxyz.png","language":"Rust","funding_links":[],"categories":["Rust"],"sub_categories":[],"readme":"\n# tbl ┳━┳\n\n`tbl` is a cli tool for reading and editing parquet files\n\n#### Goals of `tbl`:\n- be a swiss army knife for reading/editing parquet (kind of like [`jq`](https://github.com/jqlang/jq) is for JSON)\n- make it effortless to manage multi-file multi-schema parquet datasets\n- use a cli-native version of [polars](https://github.com/pola-rs/polars) syntax, so if you know python polars you already mostly know `tbl`\n\n#### Example use cases:\n- quickly look up schemas, row counts, and per-column storage usage\n- migrate from one schema to another, like add/remove/rename a column\n- perform these operations on multiple files in parallel\n\n\nTo discuss `tbl`, check out the [Paradigm Data Tools](https://t.me/paradigm_data) telegram group.\n\n\n## Contents\n1. [Installation](#installation)\n2. [Example Usage](#example-usage)\n    1. [Listing files](#listing-files)\n    2. [Looking up schemas](#looking-up-schemas)\n    3. [Selecting input files](#selecting-input-files)\n    4. [Performing edits](#performing-edits)\n    5. [Selecting output mode](#selecting-output-mode)\n4. [API Reference](#api-reference)\n    1. [`tbl`](#tbl)\n    2. [`tbl ls`](#tbl-ls)\n    3. [`tbl schema`](#tbl-schema)\n6. [FAQ](#faq)\n    1. [What is parquet?](#what-is-parquet)\n    2. [What other parquet cli tools exist?](#what-other-parquet-cli-tools-exist)\n    3. [Why use `tbl` when `duckdb` has a cli?](#why-use-tbl-when-duckdb-has-a-cli)\n    4. [What is the plan for `tbl`?](#what-is-the-plan-for-tbl)\n\n## Installation\n\n##### Install from crates.io\n```bash\ncargo install tbl-cli\n```\n\n##### Install from source\n```bash\ngit clone https://github.com/paradigmxyz/tbl\ncd tbl\ncargo install --path crates/tbl-cli\n```\n\n## Example Usage\n\n### Listing files\n\n`tbl` can list files and display their statistics, similar to the `ls` cli command.\n\nThe command `tbl ls` produces output:\n\n```\nblocks__00000000_to_00000999.parquet\nblocks__00001000_to_00001999.parquet\nblocks__00002000_to_00002999.parquet\nblocks__00003000_to_00003999.parquet\nblocks__00004000_to_00004999.parquet\nblocks__00005000_to_00005999.parquet\nblocks__00006000_to_00006999.parquet\nblocks__00007000_to_00007999.parquet\nblocks__00008000_to_00008999.parquet\nblocks__00009000_to_00009999.parquet\n... 19,660 files not shown\n19,041,325 rows stored in 1.05 GB across 19,708 tabular files\n```\n\nSee full list of `tbl ls` options [below](#tbl-ls).\n\n### Looking up schemas\n\n`tbl` can display the schemas of parquet files.\n\nThe command `tbl schema` produces output:\n\n```\n1 unique schema, 19,041,325 rows, 19,708 files, 1.05 GB\n\n     column name  │   dtype  │  disk size  │  full size  │  disk %\n──────────────────┼──────────┼─────────────┼─────────────┼────────\n      block_hash  │  binary  │  649.97 MB  │  657.93 MB  │  63.78%\n          author  │  binary  │   40.52 MB  │   40.59 MB  │   3.98%\n    block_number  │     u32  │   76.06 MB  │   75.75 MB  │   7.46%\n        gas_used  │     u64  │   84.23 MB  │  133.29 MB  │   8.26%\n      extra_data  │  binary  │   46.66 MB  │   76.91 MB  │   4.58%\n       timestamp  │     u32  │   76.06 MB  │   75.75 MB  │   7.46%\nbase_fee_per_gas  │     u64  │   41.85 MB  │   49.58 MB  │   4.11%\n        chain_id  │     u64  │    3.74 MB  │    3.70 MB  │   0.37%\n```\n\nSee full list of `tbl schema` options [below](#tbl-schema).\n\n### Selecting input files\n\n`tbl` can operate on one file, or many files across multiple directories.\n\nThese input selection options can be used with each `tbl` subcommand:\n\n| input selection | command |\n| --- | --- |\n| Select all tabular files in current directory | `tbl` (default behavior) |\n| Select a single file | `tbl /path/to/file.parquet` |\n| Select files using a glob | `tbl *.parquet` |\n| Select files from multiple directories | `tbl /path/to/dir1 /path/to/dir2` |\n| Select files recursively | `tbl /path/to/dir --tree` |\n\n### Performing edits\n\n`tbl` can perform many different operations on the selected files:\n\n| operation | command |\n| --- | --- |\n| Rename a column | `tbl --rename old_name=new_name` |\n| Cast to a new type | `tbl --cast col1=u64 col2=String` |\n| Add new columns | `tbl --with-columns name:String date:Date=2024-01-01` |\n| Drop columns | `tbl --drop col1 col2 col3` |\n| Filter rows | `tbl --filter col1=val1` \u003cbr\u003e `tbl --filter col1!=val1` \u003cbr\u003e `tbl --filter \"col1\u003eval1\"` \u003cbr\u003e `tbl --filter \"col1\u003cval1\"`\u003cbr\u003e `tbl --filter \"col1\u003e=val1\"` \u003cbr\u003e `tbl --filter \"col1\u003c=val1\"` |\n| Sort rows | `tbl --sort col1 col2:desc` |\n| Select columns | `tbl --select col1 col2 col3` |\n\nSee full list of transformation operations [below](#tbl).\n\n### Selecting output mode\n\n`tbl` can output its results in many different modes:\n\n| output mode | description | command |\n| --- | --- | --- |\n| Single File | output all results to single file | `tbl --output-file /path/to/file.parquet` |\n| Inplace | modify each file inplace | `tbl --inplace` |\n| New Directory | create equivalent files in a new directory | `tbl --output-dir /path/to/dir` |\n| Interactive | load dataframe in interactive python session | `tbl --df` |\n| Stdout | output data to stdout | `tbl` (default behavior) |\n\nSee full list of output options [below](#tbl).\n\n## API Reference\n\n#### `tbl`\n##### Output of `tbl -h`:\n\n```markdown\ntbl is a tool for reading and editing tabular data files\n\nUsage: tbl has two modes\n1. Summary mode: tbl [ls | schema] [SUMMARY_OPTIONS]\n2. Data mode:    tbl [DATA_OPTIONS]\n\nGet help with SUMMARY_OPTIONS using tbl [ls | schema] -h\n\nData mode is the default mode. DATA_OPTIONS are documented below\n\nOptional Subcommands:\n  ls      Display list of tabular files, similar to the cli `ls` command\n  schema  Display table representation of each schema in the selected files\n\nGeneral Options:\n  -h, --help                       display help message\n  -V, --version                    display version\n\nInput Options:\n  [PATHS]...                       input path(s) to use\n  -t, --tree                       recursively use all files in tree as inputs\n\nTransform Options:\n  -c, --columns \u003cCOLUMN\u003e...        select only these columns [alias --select]\n      --drop \u003cDROP\u003e...             drop column(s)\n      --with-columns \u003cNEW_COL\u003e...  insert columns, syntax NAME:TYPE [alias --with]\n      --rename \u003cRENAME\u003e...         rename column(s), syntax OLD_NAME=NEW_NAME\n      --cast \u003cCAST\u003e...             change column type(s), syntax COLUMN=TYPE\n      --set \u003cCOLUMN\u003e...            set column values, syntax COLUMN=VALUE\n      --nullify \u003cCOLUMN\u003e...        set column values to null\n      --filter \u003cFILTER\u003e...         filter rows by values, syntax COLUMN=VALUE\n      --sort \u003cSORT\u003e...             sort rows, syntax COLUMN[:desc]\n      --head \u003cHEAD\u003e                keep only the first n rows [alias --limit]\n      --tail \u003cTAIL\u003e                keep only the last n rows\n      --offset \u003cOFFSET\u003e            skip the first n rows of table\n      --value-counts \u003cCOLUMN\u003e      compute value counts of column(s)\n\nOutput Options:\n      --no-summary                 skip printing a summary\n  -n, --n \u003cN\u003e                      number of rows to print in stdout, all for all\n      --csv                        output data as csv\n      --json                       output data as json\n      --jsonl                      output data as json lines\n      --hex                        encode binary columns as hex for output\n      --inplace                    modify files in place\n      --output-file \u003cFILE_PATH\u003e    write all data to a single new file\n      --output-dir \u003cDIR_PATH\u003e      rewrite all files into this output directory\n      --output-prefix \u003cPRE-FIX\u003e    prefix to add to output filenames\n      --output-postfix \u003cPOST-FIX\u003e  postfix to add to output filenames\n      --df                         load as DataFrame in interactive python session\n      --lf                         load as LazyFrame in interactive python session\n      --executable \u003cEXECUTABLE\u003e    python executable to use with --df or --lf\n      --confirm                    confirm that files should be edited\n      --dry                        dry run without editing files\n\nOutput Modes:\n1. output results in single file   --output-file /path/to/file.parquet\n2. modify each file inplace        --inplace\n3. copy files into a new dir       --output-dir /path/to/dir\n4. load as interactive python      --df | --lf\n5. output data to stdout           (default behavior)\n```\n\n#### `tbl ls`\n##### Output of `tbl ls -h`:\n\n```markdown\nDisplay list of tabular files, similar to the cli `ls` command\n\nUsage: tbl ls [OPTIONS] [PATHS]...\n\nArguments:\n  [PATHS]...  input path(s) to use\n\nOptions:\n  -t, --tree         recursively list all files in tree\n      --absolute     show absolute paths instead of relative\n      --n \u003cN\u003e        number of file names to print\n      --sort \u003cSORT\u003e  sort by number of rows, files, or bytes [default: bytes]\n\nGeneral Options:\n  -h, --help  display help message\n```\n\n#### `tbl schema`\n##### Output of `tbl schema -h`:\n\n```markdown\nDisplay table representation of each schema in the selected files\n\nUsage: tbl schema [OPTIONS] [PATHS]...\n\nArguments:\n  [PATHS]...  input path(s) to use\n\nOptions:\n  -t, --tree               recursively list all files in tree\n      --columns \u003cCOLUMNS\u003e  columns to print\n      --n \u003cN\u003e              number of schemas to print\n      --examples           show examples\n      --absolute           show absolute paths in examples\n      --sort \u003cSORT\u003e        sort by number of rows, files, or bytes [default: bytes]\n\nGeneral Options:\n  -h, --help  display help message\n```\n\n## FAQ\n\n### What is parquet?\n\n[Parquet](https://en.wikipedia.org/wiki/Apache_Parquet) is a file format for storing tabular datasets. In many cases parquet is a simpler and faster alternative to using an actual database. Parquet has become an industry standard and its ecosystem of tools is growing rapidly.\n\n### What other parquet cli tools exist?\n\nThe most common tools are [`duckdb`](https://duckdb.org/docs/api/cli/overview), [`pqrs`](https://github.com/manojkarthick/pqrs), and [`parquet-cli`](https://github.com/apache/parquet-java/blob/master/parquet-cli/README.md).\n\n### Why use `tbl` when `duckdb` has a cli?\n\n`duckdb` is an incredible tool. We recommend checking it out, especially when you're running complex workloads. However there are 3 reasons you might prefer `tbl` as a cli tool:\n1. **CLI-Native:** Compared to `duckdb`'s SQL, `tbl` has a cli-native syntax. This makes `tbl` simpler to use with fewer keystrokes:\n    1. `duckdb \"DESCRIBE read_parquet('test.parquet')\"` vs `tbl schema test.parquet` \n    2. `duckdb \"SELECT * FROM read_parquet('test.parquet')\"` vs `tbl test.parquet`\n    3. `duckdb \"SELECT * FROM read_parquet('test.parquet') ORDER BY co1\"` vs `tbl test.parquet --sort col1`\n2. **High Level vs Low Level:** Sometimes SQL can also be a very low-level language. `tbl` and `polars` let you operate on a higher level of abstraction which reduces cognitive load:\n    1. `duckdb`: `duckdb \"SELECT col1, COUNT(col1) FROM read_parquet('test.parquet') GROUP BY col1\"`\n    2. `tbl`: `tbl test.parquet --value-counts col1`\n3. **Operational QoL:** `tbl` is built specifically for making it easy to manage large parquet archives. Features like `--tree`, `--inplace`, and multi-schema commands make life easier for archive management.\n\n### What is the plan for `tbl`?\n\nThere are a few features that we are currently exploring:\n1. **S3 and cloud buckets**: ability to read and write cloud bucket parquet files using the same operations that can be performed on local files\n2. **Re-partitioning**: ability to change how a set of parquet files are partitioned, such as changing the partition key or partition size\n3. **Direct python syntax**: ability to directly use python polars syntax to perform complex operations like `group_by()`, `join()`, and more\n4. **Idempotent Workflows**: ability to interrupt and re-run commands arbitrarily would make migrations more robust\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fparadigmxyz%2Ftbl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fparadigmxyz%2Ftbl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fparadigmxyz%2Ftbl/lists"}