{"id":19808683,"url":"https://github.com/luminousmen/data-toolset","last_synced_at":"2025-10-19T23:07:11.201Z","repository":{"id":194579467,"uuid":"691394391","full_name":"luminousmen/data-toolset","owner":"luminousmen","description":"Upgrade from avro-tools and parquet-tools jars to a more user-friendly Python package.","archived":false,"fork":false,"pushed_at":"2024-03-04T22:04:24.000Z","size":1448,"stargazers_count":1,"open_issues_count":8,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-12T23:48:07.446Z","etag":null,"topics":["avro","avro-tools","hacktoberfest","parquet","parquet-tools"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/luminousmen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":null,"support":null,"governance":null}},"created_at":"2023-09-14T05:08:18.000Z","updated_at":"2024-11-15T06:20:36.000Z","dependencies_parsed_at":"2023-11-07T16:42:34.224Z","dependency_job_id":"27478e21-8e5b-4d20-ac03-43d7189fd3a8","html_url":"https://github.com/luminousmen/data-toolset","commit_stats":null,"previous_names":["luminousmen/data-toolset"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminousmen%2Fdata-toolset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminousmen%2Fdata-toolset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminousmen%2Fdata-toolset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminousmen%2Fdata-toolset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/luminousmen","download_url":"https://codeload.github.com/luminousmen/data-toolset/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247301228,"owners_count":20916471,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["avro","avro-tools","hacktoberfest","parquet","parquet-tools"],"created_at":"2024-11-12T09:14:31.195Z","updated_at":"2025-10-19T23:07:11.107Z","avatar_url":"https://github.com/luminousmen.png","language":"Python","readme":"\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/luminousmen/data-toolset/master/branding/logo/logo.png\" width=\"200\"\u003e\n\u003c/div\u003e\n\n[![Master](https://github.com/luminousmen/data-toolset/actions/workflows/master.yml/badge.svg?branch=master)](https://github.com/luminousmen/data-toolset/actions/workflows/master.yml)\n[![codecov](https://codecov.io/gh/luminousmen/data-toolset/branch/master/graph/badge.svg?token=6V9IPSRCB0)](https://codecov.io/gh/luminousmen/data-toolset)\n\n# data-tools(et)\n\ndata-toolset is designed to simplify your data processing tasks by providing a more user-friendly alternative to the traditional JAR utilities like avro-tools and parquet-tools. With this Python package, you can effortlessly handle various data file formats, including Avro and Parquet, using a simple and intuitive command-line interface.\n\n## Installation\n\nPython 3.8, Python 3.9 and 3.10 are supported and tested (to some extent).\n\n```bash\npython -m pip install data-toolset\n```\n\n## Legacy\n\nDo you want polars to run on an old CPU (e.g. dating from before 2011), or on an x86-64 build of Python on Apple Silicon under Rosetta? Install `pip install polars-lts-cpu`. This version of polars is compiled without AVX target features.\n\n## Usage\n\n```bash\n$ data-toolset -h\nusage: data-toolset [-h] {head,tail,meta,schema,stats,query,validate,merge,count,to_json,to_csv,to_avro,to_parquet,random_sample} ...\n\npositional arguments:\n  {head,tail,meta,schema,stats,query,validate,merge,count,to_json,to_csv,to_avro,to_parquet,random_sample}\n                        commands\n    head                Print the first N records from a file\n    tail                Print the last N records from a file\n    meta                Print a file's metadata\n    schema              Print the Avro schema for a file\n    stats               Print statistics about a file\n    query               Query a file\n    validate            Validate a file\n    merge               Merge multiple files into one\n    count               Count the number of records in a file\n    to_json             Convert a file to JSON format\n    to_csv              Convert a file to CSV format\n    to_avro             Convert a file to Avro format\n    to_parquet          Convert a file to Parquet format\n    random_sample       Randomly sample records from a file\n```\n\n## Examples\n\nPrint the first 10 records of a Parquet file:\n\n```bash\n$ data-toolset head my_data.parquet -n 10\nshape: (1, 7)\n┌───────────┬─────┬──────────┬────────┬──────────────────────────┬────────────────────────────┬──────────────────┐\n│ character ┆ age ┆ is_human ┆ height ┆ quote                    ┆ friends                    ┆ appearance       │\n│ ---       ┆ --- ┆ ---      ┆ ---    ┆ ---                      ┆ ---                        ┆ ---              │\n│ str       ┆ i64 ┆ bool     ┆ f64    ┆ str                      ┆ list[str]                  ┆ struct[2]        │\n╞═══════════╪═════╪══════════╪════════╪══════════════════════════╪════════════════════════════╪══════════════════╡\n│ Alice     ┆ 10  ┆ true     ┆ 150.5  ┆ Curiouser and curiouser! ┆ [\"Rabbit\", \"Cheshire Cat\"] ┆ {\"blue\",\"small\"} │\n└───────────┴─────┴──────────┴────────┴──────────────────────────┴────────────────────────────┴──────────────────┘\n```\n\nQuery a Parquet file using a SQL-like expression:\n\n```bash\n$ data-toolset query my_data.parquet \"SELECT * FROM 'my_data.parquet' WHERE height \u003e 165\"\nshape: (2, 7)\n┌─────────────────┬─────┬──────────┬────────┬───────────────────────┬────────────────────────────────────┬───────────────────┐\n│ character       ┆ age ┆ is_human ┆ height ┆ quote                 ┆ friends                            ┆ appearance        │\n│ ---             ┆ --- ┆ ---      ┆ ---    ┆ ---                   ┆ ---                                ┆ ---               │\n│ str             ┆ i64 ┆ bool     ┆ f64    ┆ str                   ┆ list[str]                          ┆ struct[2]         │\n╞═════════════════╪═════╪══════════╪════════╪═══════════════════════╪════════════════════════════════════╪═══════════════════╡\n│ Mad Hatter      ┆ 35  ┆ true     ┆ 175.2  ┆ I'm late!             ┆ [\"Alice\"]                          ┆ {\"green\",\"tall\"}  │\n│ Queen of Hearts ┆ 50  ┆ false    ┆ 165.8  ┆ Off with their heads! ┆ [\"White Rabbit\", \"King of Hearts\"] ┆ {\"red\",\"average\"} │\n└─────────────────┴─────┴──────────┴────────┴───────────────────────┴────────────────────────────────────┴───────────────────┘\n```\n\nGet basic data statistics: \n\n```bash\n$ data-toolset stats my_data.avro\nshape: (9, 8)\n┌────────────┬─────────────────┬───────────┬──────────┬────────────┬──────────────────────────┬─────────┬────────────┐\n│ describe   ┆ character       ┆ age       ┆ is_human ┆ height     ┆ quote                    ┆ friends ┆ appearance │\n│ ---        ┆ ---             ┆ ---       ┆ ---      ┆ ---        ┆ ---                      ┆ ---     ┆ ---        │\n│ str        ┆ str             ┆ f64       ┆ f64      ┆ f64        ┆ str                      ┆ str     ┆ str        │\n╞════════════╪═════════════════╪═══════════╪══════════╪════════════╪══════════════════════════╪═════════╪════════════╡\n│ count      ┆ 3               ┆ 3.0       ┆ 3.0      ┆ 3.0        ┆ 3                        ┆ 3       ┆ 3          │\n│ null_count ┆ 0               ┆ 0.0       ┆ 0.0      ┆ 0.0        ┆ 0                        ┆ 0       ┆ 0          │\n│ mean       ┆ null            ┆ 31.666667 ┆ 0.666667 ┆ 163.833333 ┆ null                     ┆ null    ┆ null       │\n│ std        ┆ null            ┆ 20.207259 ┆ 0.57735  ┆ 12.466889  ┆ null                     ┆ null    ┆ null       │\n│ min        ┆ Alice           ┆ 10.0      ┆ 0.0      ┆ 150.5      ┆ Curiouser and curiouser! ┆ null    ┆ null       │\n│ 25%        ┆ null            ┆ 10.0      ┆ null     ┆ 150.5      ┆ null                     ┆ null    ┆ null       │\n│ 50%        ┆ null            ┆ 35.0      ┆ null     ┆ 165.8      ┆ null                     ┆ null    ┆ null       │\n│ 75%        ┆ null            ┆ 50.0      ┆ null     ┆ 175.2      ┆ null                     ┆ null    ┆ null       │\n│ max        ┆ Queen of Hearts ┆ 50.0      ┆ 1.0      ┆ 175.2      ┆ Off with their heads!    ┆ null    ┆ null       │\n└────────────┴─────────────────┴───────────┴──────────┴────────────┴──────────────────────────┴─────────┴────────────┘\n```\n\nMerge multiple Avro files into one:\n\n```bash\n$ data-toolset merge file1.avro file2.avro file3.avro merged_file.avro\n```\n\nConvert Avro file into Parquet:\n\n```bash\n$ data-toolset to_parquet my_data.avro output.parquet\n```\n\nConvert Parquet file into JSON:\n\n```bash\n$ data-toolset to_json my_data.parquet output.json\n```\n\n## Contributing\n\nContributions are welcome! If you have any suggestions, bug reports, or feature requests, please open an issue on GitHub.\n\n# TODO\n\n- optimizations [TBD]\n- benchmarking","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluminousmen%2Fdata-toolset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fluminousmen%2Fdata-toolset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluminousmen%2Fdata-toolset/lists"}