An open API service indexing awesome lists of open source software.

https://github.com/ktrueda/go-parquet-tools

Faster parquet-tools
https://github.com/ktrueda/go-parquet-tools

parquet parquet-tools

Last synced: 5 months ago
JSON representation

Faster parquet-tools

Awesome Lists containing this project

README

          

# go-parquet-tools

Alternative to [pypi parquet-tools](https://pypi.org/project/parquet-tools/) in Golang.

You can show content/schema of parquet file(s) on local disk or on Amazon S3. It is incompatible with original parquet-tools. go-parquet-tools is faster because this is implemented in golang.

## Install

```bash
go install github.com/ktrueda/go-parquet-tools@latest
```

## Usage

```bash
go-parquet-tools csv test_resources/test1.parquet
one,two,three
-1,foo,true
,bar,false
2.5,baz,true
```

```bash
go-parquet-tools show --nil None "test_resources/*"
+------+-----+-------+
| one | two | three |
+------+-----+-------+
| -1 | foo | true |
| None | bar | false |
| 2.5 | baz | true |
| -1 | foo | true |
| None | bar | false |
| 2.5 | baz | true |
+------+-----+-------+
```

```bash
go-parquet-tools show s3://foo/test1.parquet
Downloaded s3://foo/test.parquet to /var/folders/f3/9l_qwscs3z94m3yw255bw4l40000gn/T/9ed16365-58e2-40f2-a492-e8477b418a0f.parquet .
+-------+-----+-------+
| one | two | three |
+-------+-----+-------+
| -1 | foo | true |
| | bar | false |
| 2.5 | baz | true |
+-------+-----+-------+
```

```bash
go-parquet-tools inspect test_resources/test1.parquet
```

insepct output

```bash
Version: 1
Schema:
######### schema #########
Type:
TypeLength:
RepetitionType: REQUIRED
Name: schema
NumChildren: 0xc0000288d8
ConvertedType:
Scale:
Precision:
FieldID:
LogicalType:
######### one #########
Type: DOUBLE
TypeLength:
RepetitionType: OPTIONAL
Name: one
NumChildren:
ConvertedType:
Scale:
Precision:
FieldID:
LogicalType:
######### two #########
Type: BYTE_ARRAY
TypeLength:
RepetitionType: OPTIONAL
Name: two
NumChildren:
ConvertedType: UTF8
Scale:
Precision:
FieldID:
LogicalType: LogicalType({STRING:StringType({}) MAP: LIST: ENUM: DECIMAL: DATE: TIME: TIMESTAMP: INTEGER: UNKNOWN: JSON: BSON: UUID:})
######### three #########
Type: BOOLEAN
TypeLength:
RepetitionType: OPTIONAL
Name: three
NumChildren:
ConvertedType:
Scale:
Precision:
FieldID:
LogicalType:
NumRows: 3
RowGroups:
Columns:
#########
FilePath
FileOffset 108
MetaData.Type DOUBLE
MetaData.Encodings [PLAIN_DICTIONARY PLAIN RLE]
MetaData.PathInSchema [one]
MetaData.Codec SNAPPY
MetaData.NumValues 3
MetaData.TotalUncompressedSize 100
MetaData.TotalCompressedSize 104
MetaData.KeyValueMetadata []
MetaData.DataPageOffset 36
MetaData.IndexPageOffset
MetaData.DictionaryPageOffset 0xc000028930
MetaData.Statistics Statistics({Max:[0 0 0 0 0 0 4 64] Min:[0 0 0 0 0 0 240 191] NullCount:0xc000028938 DistinctCount: MaxValue:[0 0 0 0 0 0 4 64] MinValue:[0 0 0 0 0 0 240 191]})
MetaData.EncodingStats [PageEncodingStats({PageType:DICTIONARY_PAGE Encoding:PLAIN_DICTIONARY Count:1}) PageEncodingStats({PageType:DATA_PAGE Encoding:PLAIN_DICTIONARY Count:1})]
MetaData.BloomFilterOffset
OffsetIndexOffset
OffsetIndexLength
ColumnIndexOffset
ColumnIndexLength
CryptoMeatadata
EncryptedColumnMetadata []
#########
FilePath
FileOffset 281
MetaData.Type BYTE_ARRAY
MetaData.Encodings [PLAIN_DICTIONARY PLAIN RLE]
MetaData.PathInSchema [two]
MetaData.Codec SNAPPY
MetaData.NumValues 3
MetaData.TotalUncompressedSize 76
MetaData.TotalCompressedSize 80
MetaData.KeyValueMetadata []
MetaData.DataPageOffset 238
MetaData.IndexPageOffset
MetaData.DictionaryPageOffset 0xc000028948
MetaData.Statistics Statistics({Max:[] Min:[] NullCount:0xc000028950 DistinctCount: MaxValue:[102 111 111] MinValue:[98 97 114]})
MetaData.EncodingStats [PageEncodingStats({PageType:DICTIONARY_PAGE Encoding:PLAIN_DICTIONARY Count:1}) PageEncodingStats({PageType:DATA_PAGE Encoding:PLAIN_DICTIONARY Count:1})]
MetaData.BloomFilterOffset
OffsetIndexOffset
OffsetIndexLength
ColumnIndexOffset
ColumnIndexLength
CryptoMeatadata
EncryptedColumnMetadata []
#########
FilePath
FileOffset 388
MetaData.Type BOOLEAN
MetaData.Encodings [PLAIN RLE]
MetaData.PathInSchema [three]
MetaData.Codec SNAPPY
MetaData.NumValues 3
MetaData.TotalUncompressedSize 40
MetaData.TotalCompressedSize 42
MetaData.KeyValueMetadata []
MetaData.DataPageOffset 346
MetaData.IndexPageOffset
MetaData.DictionaryPageOffset
MetaData.Statistics Statistics({Max:[1] Min:[0] NullCount:0xc000028970 DistinctCount: MaxValue:[1] MinValue:[0]})
MetaData.EncodingStats [PageEncodingStats({PageType:DATA_PAGE Encoding:PLAIN Count:1})]
MetaData.BloomFilterOffset
OffsetIndexOffset
OffsetIndexLength
ColumnIndexOffset
ColumnIndexLength
CryptoMeatadata
EncryptedColumnMetadata []
TotalByteSize: 226
NumRows: 3
SotringColumns: []
FileOffset: 0xc000028978
TotalCompressedSize: 0xc000028980
Ordinal: 0xc000028988
KeyValueMetaData: [KeyValue({Key:pandas Value:0xc000063330}) KeyValue({Key:ARROW:schema Value:0xc000063340})]
CreatedBy: 0xc000063350
ColumnOrders: [ColumnOrder({TYPE_ORDER:TypeDefinedOrder({})}) ColumnOrder({TYPE_ORDER:TypeDefinedOrder({})}) ColumnOrder({TYPE_ORDER:TypeDefinedOrder({})})]
EncryptionAlgorithm:
FooterSigningKeyMetadata: []
```

## Benchmark result

go-parquet-tools is 100x faster than pypi parquet-tools.

| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
| :-------------------------------------------------- | -----------: | -------: | -------: | -------: |
| `parquet-tools csv test_resources/test1.parquet` | 702.8 ± 19.9 | 676.2 | 739.4 | 1.00 |
| `go-parquet-tools csv test_resources/test1.parquet` | 6.6 ± 0.4 | 6.2 | 7.3 | 1.00 |

https://github.com/sharkdp/hyperfine