{"id":47700738,"url":"https://github.com/akeating/zig-parquet","last_synced_at":"2026-04-05T05:02:20.603Z","repository":{"id":347354374,"uuid":"1167182610","full_name":"akeating/zig-parquet","owner":"akeating","description":"Native Parquet library for Zig with C ABI, WASM support, and CLI tool","archived":false,"fork":false,"pushed_at":"2026-04-02T14:25:42.000Z","size":9305,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-03T03:36:15.802Z","etag":null,"topics":["arrow","c-api","cli","parquet","wasm","zig"],"latest_commit_sha":null,"homepage":"","language":"Zig","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/akeating.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE-APACHE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":"COPYRIGHT","agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-02-26T02:54:17.000Z","updated_at":"2026-04-02T14:25:46.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/akeating/zig-parquet","commit_stats":null,"previous_names":["akeating/zig-parquet"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/akeating/zig-parquet","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akeating%2Fzig-parquet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akeating%2Fzig-parquet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akeating%2Fzig-parquet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akeating%2Fzig-parquet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/akeating","download_url":"https://codeload.github.com/akeating/zig-parquet/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akeating%2Fzig-parquet/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31424931,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-05T02:22:46.605Z","status":"ssl_error","status_checked_at":"2026-04-05T02:22:33.263Z","response_time":75,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arrow","c-api","cli","parquet","wasm","zig"],"created_at":"2026-04-02T17:09:22.153Z","updated_at":"2026-04-05T05:02:20.597Z","avatar_url":"https://github.com/akeating.png","language":"Zig","readme":"# zig-parquet\n\nA native Parquet library built for portability, embeddability, and low deployment friction. Use it from Zig or through a C ABI.\n\n[![CI](https://github.com/akeating/zig-parquet/actions/workflows/test.yml/badge.svg)](https://github.com/akeating/zig-parquet/actions/workflows/test.yml)\n[![Zig](https://img.shields.io/badge/Zig-0.15.2-f7a41d?logo=zig)](https://ziglang.org/)\n[![License](https://img.shields.io/badge/License-MIT%2FApache--2.0-blue.svg)](COPYRIGHT)\n\n## Features\n\n- **Embeddable Native Library** - Link Parquet support directly into native applications\n- **Full Read/Write Support** - Read and write Parquet files with all physical and logical types\n- **All Standard Encodings** - PLAIN, RLE, DICTIONARY, DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY, BYTE_STREAM_SPLIT\n- **Nested Types** - Lists, structs, maps, and arbitrary nesting depth\n- **Compression** - zstd, gzip, snappy, lz4, brotli (individually selectable; experimental pure Zig zstd, gzip, snappy available)\n- **Logical Types** - STRING, DATE, TIME, TIMESTAMP (millis/micros/nanos), DECIMAL, UUID, INT annotations, FLOAT16, ENUM, JSON, BSON, INTERVAL, GEOMETRY, GEOGRAPHY\n- **Dynamic Row API** - Runtime `DynamicWriter` / `DynamicReader` for all types and arbitrary nesting depth\n- **Schema-Agnostic Reading** - Read any Parquet file without knowing the schema at compile time\n- **Column Statistics** - Min/max/null_count in column metadata\n- **Page-Level CRC Checksums** - Written by default, validated on read\n- **Key-Value Metadata** - Read and write arbitrary file-level metadata\n- **DataPage V1 and V2** - Read both page formats; write uses V1\n- **Buffer and Callback Transports** - Read/write from memory, files, or custom I/O backends\n- **Hardened Against Malformed Input** - Designed for safe casting, bounds checking, and no undefined behavior on untrusted data\n- **C ABI with Arrow C Data Interface** - Call from C, C++, and other languages via ArrowSchema, ArrowArray, and ArrowArrayStream\n- **Portable Deployment** - Native library and CLI for desktops, servers, edge devices, and serverless jobs\n- **WASM Compatible** - 103 KB plain or 438 KB with all compression codecs (brotli-compressed)\n- **CLI Tool** - `pqi` for inspecting and validating Parquet files\n\n## CLI Tool\n\nThe `pqi` command-line tool is included for working with Parquet files:\n\n```bash\n# Build the CLI\ncd cli \u0026\u0026 zig build\n\n# Show schema\npqi schema data.parquet\n\n# Preview rows\npqi head data.parquet -n 10\n\n# Output all rows as JSON\npqi cat data.parquet --json\n\n# Row count\npqi count data.parquet\n\n# File statistics\npqi stats data.parquet\n\n# Row group details\npqi rowgroups data.parquet\n\n# File size breakdown\npqi size data.parquet\n\n# Column detail across row groups\npqi column data.parquet price quantity\n\n# Validate file integrity\npqi validate data.parquet\n```\n\n## Why zig-parquet?\n\nIf you need Parquet support inside a native application, zig-parquet is a straightforward way to ship it.\n\n- **Embed directly** - Use Parquet from Zig or through the C ABI instead of shelling out to a separate tool or service\n- **Keep deployment simple** - Stay native without requiring the JVM, Python, or the full Arrow C++ stack\n- **Ship across targets** - Use the same core library on desktops, servers, edge devices, serverless workloads, and WASM\n- **Start with the CLI** - Use `pqi` to inspect and validate files, then embed the same implementation in your application\n\n## Installation\n\nAdd `zig-parquet` to your project using `zig fetch`. This will automatically download the package and update your `build.zig.zon` with the correct cryptographic hash:\n\n```bash\nzig fetch --save https://github.com/akeating/zig-parquet/releases/download/v0.1.6/zig-parquet-v0.1.6.tar.gz\n```\n\nThen in your `build.zig`:\n\n```zig\nconst target = b.standardTargetOptions(.{});\nconst optimize = b.standardOptimizeOption(.{});\n\nconst parquet = b.dependency(\"parquet\", .{\n    .target = target,\n    .optimize = optimize,\n});\nexe.root_module.addImport(\"parquet\", parquet.module(\"parquet\"));\nexe.linkLibrary(parquet.artifact(\"parquet\"));\n```\n\n## Quick Start\n\n### Row-Based API (Recommended)\n\nDefine your schema at runtime, write rows with typed setters, and read back dynamically:\n\n```zig\nconst std = @import(\"std\");\nconst parquet = @import(\"parquet\");\n\npub fn main() !void {\n    var gpa = std.heap.GeneralPurposeAllocator(.{}){};\n    defer _ = gpa.deinit();\n    const allocator = gpa.allocator();\n\n    // Write\n    {\n        const file = try std.fs.cwd().createFile(\"sensors.parquet\", .{});\n        defer file.close();\n\n        var writer = try parquet.createFileDynamic(allocator, file);\n        defer writer.deinit();\n\n        // Columns are OPTIONAL by default; use .asRequired() for non-nullable\n        const TypeInfo = parquet.TypeInfo;\n        try writer.addColumn(\"sensor_id\", TypeInfo.int32.asRequired(), .{});\n        try writer.addColumn(\"timestamp\", TypeInfo.timestamp_micros, .{});\n        try writer.addColumn(\"temperature\", TypeInfo.double_, .{});\n        try writer.addColumn(\"location\", TypeInfo.string, .{});\n        writer.setCompression(.zstd);\n        try writer.begin();\n\n        try writer.setInt32(0, 1);\n        try writer.setInt64(1, 1704067200000000);\n        try writer.setDouble(2, 23.5);\n        try writer.setBytes(3, \"Building A\");\n        try writer.addRow();\n\n        try writer.close();\n    }\n\n    // Read\n    {\n        const file = try std.fs.cwd().openFile(\"sensors.parquet\", .{});\n        defer file.close();\n\n        var reader = try parquet.openFileDynamic(allocator, file, .{});\n        defer reader.deinit();\n\n        const rows = try reader.readAllRows(0);\n        defer {\n            for (rows) |row| row.deinit();\n            allocator.free(rows);\n        }\n\n        for (rows) |row| {\n            const id = if (row.getColumn(0)) |v| v.asInt32() orelse 0 else 0;\n            const temp = if (row.getColumn(2)) |v| v.asDouble() orelse 0 else 0;\n            std.debug.print(\"Sensor {}: {d}°C\\n\", .{ id, temp });\n        }\n    }\n}\n```\n\n### Column Projection\n\nRead only a subset of top-level columns (skips I/O for unrequested columns):\n\n```zig\nvar reader = try parquet.openFileDynamic(allocator, file, .{});\ndefer reader.deinit();\n\n// Read only columns 1 and 3 (dense-packed: returned rows have 2 values)\nconst rows = try reader.readRowsProjected(0, \u0026.{ 1, 3 });\ndefer {\n    for (rows) |row| row.deinit();\n    allocator.free(rows);\n}\n\nfor (rows) |row| {\n    const name = if (row.getColumn(0)) |v| v.asBytes() orelse \"\" else \"\";\n    const score = if (row.getColumn(1)) |v| v.asDouble() orelse 0 else 0;\n    std.debug.print(\"{s}: {d}\\n\", .{ name, score });\n}\n```\n\n### Row Group Filtering\n\nUse column statistics to skip row groups that don't match your criteria:\n\n```zig\nvar reader = try parquet.openFileDynamic(allocator, file, .{});\ndefer reader.deinit();\n\nfor (0..reader.getNumRowGroups()) |rg| {\n    const stats = reader.getColumnStatistics(0, rg) orelse continue;\n    const min_bytes = stats.min_value orelse stats.min orelse continue;\n    const max_bytes = stats.max_value orelse stats.max orelse continue;\n    const min = std.mem.readInt(i32, min_bytes[0..4], .little);\n    const max = std.mem.readInt(i32, max_bytes[0..4], .little);\n\n    if (target \u003c min or target \u003e max) continue; // skip this row group\n\n    const rows = try reader.readAllRows(rg);\n    defer {\n        for (rows) |row| row.deinit();\n        allocator.free(rows);\n    }\n    // ... process matching rows ...\n}\n```\n\n### Row Iterator\n\nStream through all rows without managing row groups manually. Only one row group's data is held in memory at a time:\n\n```zig\nvar reader = try parquet.openFileDynamic(allocator, file, .{});\ndefer reader.deinit();\n\nvar iter = reader.rowIterator();\ndefer iter.deinit();\n\nwhile (try iter.next()) |row| {\n    const id = if (row.getColumn(0)) |v| v.asInt32() orelse 0 else 0;\n    std.debug.print(\"id={}\\n\", .{id});\n}\n```\n\n## Supported Types\n\n### Physical Types\n\n| Parquet Type | Zig Type |\n|--------------|----------|\n| BOOLEAN | `bool` |\n| INT32 | `i32` |\n| INT64 | `i64` |\n| FLOAT | `f32` |\n| DOUBLE | `f64` |\n| BYTE_ARRAY | `[]const u8` |\n| FIXED_LEN_BYTE_ARRAY | `[]const u8` |\n\n### Logical Types\n\n| Logical Type | TypeInfo Constant | Physical Storage |\n|--------------|-------------------|------------------|\n| STRING | `TypeInfo.string` | BYTE_ARRAY |\n| DATE | `TypeInfo.date` | INT32 (days since epoch) |\n| TIMESTAMP | `TypeInfo.timestamp_micros` | INT64 |\n| TIME | `TypeInfo.time_micros` | INT64 |\n| UUID | `TypeInfo.uuid` | FIXED_LEN_BYTE_ARRAY(16) |\n| INTERVAL | `TypeInfo.interval` | FIXED_LEN_BYTE_ARRAY(12) |\n| GEOMETRY | `TypeInfo.geometry` | BYTE_ARRAY (WKB) |\n| GEOGRAPHY | `TypeInfo.geography` | BYTE_ARRAY (WKB) |\n| DECIMAL | `TypeInfo.forDecimal(p, s)` | INT32/INT64/FIXED |\n| JSON | `TypeInfo.json` | BYTE_ARRAY |\n| BSON | `TypeInfo.bson` | BYTE_ARRAY |\n| ENUM | `TypeInfo.enum_` | BYTE_ARRAY |\n\n### Nested Types\n\nBuild arbitrary nested schemas at runtime using `SchemaNode`:\n\n```zig\n// list\u003cstruct\u003cproduct_id: i32, quantity: i32, price: f64\u003e\u003e\nconst pid = try writer.allocSchemaNode(.{ .int32 = .{} });\nconst qty = try writer.allocSchemaNode(.{ .int32 = .{} });\nconst price = try writer.allocSchemaNode(.{ .double = .{} });\nvar fields = try writer.allocSchemaFields(3);\nfields[0] = .{ .name = try writer.dupeSchemaName(\"product_id\"), .node = pid };\nfields[1] = .{ .name = try writer.dupeSchemaName(\"quantity\"), .node = qty };\nfields[2] = .{ .name = try writer.dupeSchemaName(\"price\"), .node = price };\nconst item = try writer.allocSchemaNode(.{ .struct_ = .{ .fields = fields } });\nconst items = try writer.allocSchemaNode(.{ .list = item });\ntry writer.addColumnNested(\"items\", items, .{});\n```\n\nSupports lists, structs, maps, and arbitrary nesting depth (e.g., `list\u003cmap\u003cstring, list\u003cstruct\u003c...\u003e\u003e\u003e\u003e`).\nSee `examples/basic/03_nested_types.zig` for a complete example.\n\n## Compression\n\nAll major Parquet compression codecs are supported, individually selectable at build time:\n\n| Codec | Implementation | Notes |\n|-------|---------------|-------|\n| zstd | C libzstd 1.5.7 | Recommended default |\n| gzip | C zlib 1.3.1 | Wide compatibility |\n| snappy | C++ snappy 1.2.2 | Fast, moderate ratio |\n| lz4 | C lz4 1.10.0 | Very fast |\n| brotli | C brotli 1.2.0 | High ratio |\n| zig-zstd | Pure Zig (experimental) | No C dependency; level-1 compressor + stdlib decompressor |\n| zig-gzip | Pure Zig (experimental) | No C dependency; level-9 deflate compressor + stdlib decompressor |\n| zig-snappy | Pure Zig (experimental) | No C/C++ dependency; full Snappy block format |\n\n```zig\nvar writer = try parquet.createFileDynamic(allocator, file);\nwriter.setCompression(.zstd);\n```\n\n```bash\nzig build                           # all codecs (default: C libs)\nzig build -Dcodecs=none             # no compression (smallest binary)\nzig build -Dcodecs=zstd,snappy      # only zstd and snappy\nzig build -Dcodecs=zig-only         # all pure Zig codecs (no C/C++ deps)\n```\n\nSee [COMPRESSION.md](COMPRESSION.md) for build sizes, API details, and the full set of build options.\n\n### Per-Column and Per-Leaf Options\n\nSet options per column at definition time, or per leaf path for nested types:\n\n```zig\n// Per-column options via addColumn\ntry writer.addColumn(\"timestamp\", parquet.TypeInfo.int64, .{\n    .encoding = .delta_binary_packed,\n    .compression = .zstd,\n});\n\n// Per-leaf options for nested columns via setPathProperties\ntry writer.addColumnNested(\"address\", struct_node, .{});\ntry writer.setPathProperties(\"address.city\", .{ .compression = .zstd });\ntry writer.setPathProperties(\"address.zip\", .{ .use_dictionary = false });\n```\n\nGlobal defaults apply to any column/leaf without an explicit override:\n\n```zig\nwriter.setUseDictionary(false);        // disable dictionary encoding globally\nwriter.setIntEncoding(.delta_binary_packed);  // default for int columns\nwriter.setMaxPageSize(1_048_576);      // 1MB page size limit\n```\n\n## Spec Coverage\n\n| Feature | | Notes |\n|---------|:-:|-------|\n| **Physical Types** | | |\n| BOOLEAN, INT32, INT64, FLOAT, DOUBLE | ✅ | All primitive types |\n| BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY | ✅ | Variable and fixed-length binary |\n| INT96 | ✅ | Legacy timestamp support; read always, write via column API only (not DynamicWriter) |\n| **Encodings** | | |\n| PLAIN | ✅ | All physical types |\n| RLE / BIT_PACKED | ✅ | Levels, dictionary indices, booleans |\n| PLAIN_DICTIONARY / RLE_DICTIONARY | ✅ | Strings and integers |\n| DELTA_BINARY_PACKED | ✅ | Sorted integers, timestamps |\n| DELTA_LENGTH_BYTE_ARRAY | ✅ | Variable-length byte arrays |\n| DELTA_BYTE_ARRAY | ✅ | Sorted strings (prefix compression) |\n| BYTE_STREAM_SPLIT | ✅ | Float/double/int/fixed columns |\n| **Compression** | | |\n| UNCOMPRESSED | ✅ | |\n| SNAPPY | ✅ | Via C++ library |\n| GZIP | ✅ | C zlib (default) or pure Zig (experimental via `zig-gzip`) |\n| ZSTD | ✅ | C libzstd (default) or pure Zig (experimental via `zig-zstd`) |\n| LZ4_RAW | ✅ | Via lz4 |\n| BROTLI | ✅ | Via brotli |\n| LZ4 (non-raw) | ❌ | Hadoop-specific framing format |\n| LZO | ❌ | Not implemented |\n| **Logical Types** | | |\n| STRING, ENUM, JSON, BSON | ✅ | BYTE_ARRAY with annotation |\n| UUID | ✅ | FIXED_LEN_BYTE_ARRAY(16) |\n| INT (8/16/32/64, signed/unsigned) | ✅ | Width annotations |\n| DECIMAL | ✅ | INT32/INT64/FIXED backing |\n| FLOAT16 | ✅ | Half-precision float |\n| DATE | ✅ | Days since epoch |\n| TIME (MILLIS/MICROS) | ✅ | Time of day |\n| TIMESTAMP (MILLIS/MICROS) | ✅ | Instant or local |\n| TIME/TIMESTAMP (NANOS) | ✅ | Full read/write support |\n| INTERVAL | ✅ | Legacy ConvertedType (months/days/millis) |\n| GEOMETRY / GEOGRAPHY | ✅ | GeoParquet 1.1 compatible |\n| VARIANT | ⏳ | Future |\n| **Nested Types** | | |\n| LIST | ✅ | 3-level structure |\n| MAP | ✅ | Key-value pairs |\n| Nested structs | ✅ | Arbitrary depth |\n| **Page Types** | | |\n| DATA_PAGE (v1) | ✅ | |\n| DATA_PAGE_V2 | ✅ | Read only; optimized split decompression |\n| DICTIONARY_PAGE | ✅ | |\n| **Features** | | |\n| Column projection | ✅ | Read subset of columns; skips I/O for unselected columns |\n| Row group filtering | ✅ | Statistics-based; skip row groups via min/max/null_count |\n| Streaming iteration | ✅ | Row iterator; one row group in memory at a time |\n| Column statistics | ✅ | min/max/null_count |\n| Multi-page columns | ✅ | Large column support |\n| Multi-row-group files | ✅ | |\n| Bloom filters | ⏳ | Planned |\n| Page Index | ⏳ | Planned |\n| CRC checksums | ✅ | Page-level CRC32 |\n| Encryption | 🔍 | Under review — Java/Python-only ecosystem support |\n\nLegend: ✅ Supported | ⏳ Planned | 🔍 Under review | ❌ Unsupported\n\nFiles containing unsupported features return explicit errors rather than silently producing incorrect results.\n\n## WASM Support\n\nBoth `wasm32-wasi` and `wasm32-freestanding` targets are supported. WASI supports all codecs via `-Dcodecs=`; freestanding builds without compression. See [COMPRESSION.md](COMPRESSION.md) for per-codec WASM binary sizes.\n\nBuild for WASI:\n\n```bash\ncd zig-parquet\nzig build -Dwasm_wasi -Doptimize=ReleaseSmall\n# Output: zig-out/bin/parquet_wasi.wasm\n```\n\nBuild for browser (freestanding, no compression):\n\n```bash\ncd zig-parquet\nzig build -Dwasm_freestanding -Doptimize=ReleaseSmall\n# Output: zig-out/bin/parquet_freestanding.wasm\n```\n\nRun with a WASI runtime:\n\n```bash\nwasmtime --dir=. zig-out/bin/parquet_wasi.wasm\n```\n\nSee `examples/wasm_demo/` and `examples/wasm_freestanding/` for usage examples.\n\n## Requirements\n\n- **Zig 0.15.2**\n- C compiler (for compression libraries; not needed with `-Dcodecs=none` or `-Dcodecs=zig-only`)\n- C++ compiler (for Snappy; not needed if snappy is excluded)\n\n## License\n\nLicensed under either of\n\n- [Apache License, Version 2.0](LICENSE-APACHE)\n- [MIT License](LICENSE-MIT)\n\nat your option.\n\n## Contributing\n\nContributions welcome! Please read the existing code style and add tests for new functionality.\n\n## Acknowledgments\n\n- [Apache Parquet](https://parquet.apache.org/) specification\n- [PyArrow](https://arrow.apache.org/docs/python/) for reference implementation and test file generation\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fakeating%2Fzig-parquet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fakeating%2Fzig-parquet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fakeating%2Fzig-parquet/lists"}