https://github.com/runsascoded/parquet-diff-test
Demonstrate differences in Parquet files generated by pyarrow on macOS vs. {Ubuntu, Windows}.
https://github.com/runsascoded/parquet-diff-test
arrow parquet pyarrow
Last synced: about 1 month ago
JSON representation
Demonstrate differences in Parquet files generated by pyarrow on macOS vs. {Ubuntu, Windows}.
- Host: GitHub
- URL: https://github.com/runsascoded/parquet-diff-test
- Owner: runsascoded
- Created: 2023-12-30T20:34:17.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-12-31T17:10:16.000Z (over 2 years ago)
- Last Synced: 2024-04-18T11:29:24.699Z (about 2 years ago)
- Topics: arrow, parquet, pyarrow
- Language: Python
- Homepage:
- Size: 42 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# parquet-diff-test
Demonstrate differences in Parquet files generated by [pyarrow] on macOS vs. {Ubuntu, Windows} (see [arrow#39399](https://github.com/apache/arrow/issues/39399)).
## CLI
For each {engine, compression codec}:
- **Engine:** [pyarrow], [fastparquet]
- **Compression:** snappy, gzip, brotli, lz4, zstd
[`parquet-diff-test`] writes a simple Parquet file:
```python
df = pd.DataFrame([{ 'a': 111 }])
empty_df = df.iloc[:0] # subset the dataset to have 0 rows
out_dir = f'out/{engine}/{compression}'
parquet_path = f'{out_dir}/empty.parquet'
empty_df.to_parquet(parquet_path, engine=engine, compression=compression)
```
In the same directory, it also writes:
- `metadata.json`, which includes:
- the `pyarrow.ParquetFile.metadata` dictionary
- file size
- file sha256 hash
- `xxd.txt`: ASCII representation of every byte in `empty.parquet`
## Results
The [test.yml](.github/workflows/test.yml) workflow runs `parquet-diff-test` on Ubuntu, macOS, and Windows, and pushes the results of each to a branch.
Here are the [`macos`] and [`windows`] branches' compared to [`ubuntu`]:
- [`ubuntu..macos`]
- [`ubuntu..windows`]
### Summary
- ✅ In all cases, Parquet files generated by [`fastparquet`] are identical .across OSes
- 🤔 In many cases, those generated by `pyarrow` are different from each other.
#### pyarrow
| | Ubuntu | Windows | macOS |
|-------:|-------:|--------:|------:|
| brotli | ✅ | ✅ | ❌ |
| gzip | ⚠️ | ⚠️ | ❌ |
| lz4 | ✅ | ✅ | ❌ |
| snappy | ✅ | ✅ | ❌ |
| zstd | ✅ | ✅ | ❌ |
#### fastparquet
| | Ubuntu | Windows | macOS |
|-------:|-------:|--------:|------:|
| brotli | ✅ | ✅ | ✅ |
| gzip | ✅ | ✅ | ✅ |
| lz4 | ✅ | ✅ | ✅ |
| snappy | ✅ | ✅ | ✅ |
| zstd | ✅ | ✅ | ✅ |
### Full diffs
#### [`ubuntu..macos`]
- All [`fastparquet`] parquets are identical.
- All [`pyarrow`] parquets differ.
For example, [here's the diff][ubuntu..macos xxd] for {`pyarrow`, `snappy`}:
git diff ubuntu..macos -- out/pyarrow/snappy/xxd.txt
```diff
00000280: 7741 4141 4145 4141 6741 4367 4141 414e wAAAAEAAgACgAAAN
00000290: 7742 4141 4145 4141 4141 4151 4141 4141 wBAAAEAAAAAQAAAA
000002a0: 7741 4141 4149 4141 7741 4241 4149 4141 wAAAAIAAwABAAIAA
-000002b0: 6741 4141 4149 4141 4141 4541 4141 4141 gAAAAIAAAAEAAAAA
-000002c0: 5941 4141 4277 5957 356b 5958 4d41 414b YAAABwYW5kYXMAAK
-000002d0: 5942 4141 4237 496d 6c75 5a47 5634 5832 YBAAB7ImluZGV4X2
-000002e0: 4e76 6248 5674 626e 4d69 4f69 4262 6579 NvbHVtbnMiOiBbey
-000002f0: 4a72 6157 356b 496a 6f67 496e 4a68 626d JraW5kIjogInJhbm
-00000300: 646c 4969 7767 496d 3568 6257 5569 4f69 dlIiwgIm5hbWUiOi
-00000310: 4275 6457 7873 4c43 4169 6333 5268 636e BudWxsLCAic3Rhcn
-00000320: 5169 4f69 4177 4c43 4169 6333 5276 6343 QiOiAwLCAic3RvcC
-00000330: 4936 4944 4173 4943 4a7a 6447 5677 496a I6IDAsICJzdGVwIj
-00000340: 6f67 4d58 3164 4c43 4169 5932 3973 6457 ogMX1dLCAiY29sdW
-00000350: 3175 5832 6c75 5a47 5634 5a58 4d69 4f69 1uX2luZGV4ZXMiOi
-00000360: 4262 6579 4a75 5957 316c 496a 6f67 626e BbeyJuYW1lIjogbn
-00000370: 5673 6243 7767 496d 5a70 5a57 786b 5832 VsbCwgImZpZWxkX2
-00000380: 3568 6257 5569 4f69 4275 6457 7873 4c43 5hbWUiOiBudWxsLC
-00000390: 4169 6347 4675 5a47 467a 5833 5235 6347 AicGFuZGFzX3R5cG
-000003a0: 5569 4f69 4169 6457 3570 5932 396b 5a53 UiOiAidW5pY29kZS
-000003b0: 4973 4943 4a75 6457 3177 6556 3930 6558 IsICJudW1weV90eX
-000003c0: 426c 496a 6f67 496d 3969 616d 566a 6443 BlIjogIm9iamVjdC
-000003d0: 4973 4943 4a74 5a58 5268 5a47 4630 5953 IsICJtZXRhZGF0YS
-000003e0: 4936 4948 7369 5a57 356a 6232 5270 626d I6IHsiZW5jb2Rpbm
-000003f0: 6369 4f69 4169 5656 5247 4c54 6769 6658 ciOiAiVVRGLTgifX
-00000400: 3164 4c43 4169 5932 3973 6457 3175 6379 1dLCAiY29sdW1ucy
-00000410: 4936 4946 7437 496d 3568 6257 5569 4f69 I6IFt7Im5hbWUiOi
-00000420: 4169 5953 4973 4943 4a6d 6157 5673 5a46 AiYSIsICJmaWVsZF
-00000430: 3975 5957 316c 496a 6f67 496d 4569 4c43 9uYW1lIjogImEiLC
-00000440: 4169 6347 4675 5a47 467a 5833 5235 6347 AicGFuZGFzX3R5cG
-00000450: 5569 4f69 4169 6157 3530 4e6a 5169 4c43 UiOiAiaW50NjQiLC
-00000460: 4169 626e 5674 6348 6c66 6448 6c77 5a53 AibnVtcHlfdHlwZS
-00000470: 4936 4943 4a70 626e 5132 4e43 4973 4943 I6ICJpbnQ2NCIsIC
-00000480: 4a74 5a58 5268 5a47 4630 5953 4936 4947 JtZXRhZGF0YSI6IG
-00000490: 3531 6247 7839 5853 7767 496d 4e79 5a57 51bGx9XSwgImNyZW
-000004a0: 4630 6233 4969 4f69 4237 496d 7870 596e F0b3IiOiB7ImxpYn
-000004b0: 4a68 636e 6b69 4f69 4169 6348 6c68 636e JhcnkiOiAicHlhcn
-000004c0: 4a76 6479 4973 4943 4a32 5a58 4a7a 6157 JvdyIsICJ2ZXJzaW
-000004d0: 3975 496a 6f67 496a 4530 4c6a 4175 4d69 9uIjogIjE0LjAuMi
-000004e0: 4a39 4c43 4169 6347 4675 5a47 467a 5833 J9LCAicGFuZGFzX3
-000004f0: 5a6c 636e 4e70 6232 3469 4f69 4169 4d69 ZlcnNpb24iOiAiMi
-00000500: 3478 4c6a 5169 6651 4141 4151 4141 4142 4xLjQifQAAAQAAAB
+000002b0: 6741 4141 4330 4151 4141 4241 4141 414b gAAAC0AQAABAAAAK
+000002c0: 5942 4141 4237 496d 6c75 5a47 5634 5832 YBAAB7ImluZGV4X2
+000002d0: 4e76 6248 5674 626e 4d69 4f69 4262 6579 NvbHVtbnMiOiBbey
+000002e0: 4a72 6157 356b 496a 6f67 496e 4a68 626d JraW5kIjogInJhbm
+000002f0: 646c 4969 7767 496d 3568 6257 5569 4f69 dlIiwgIm5hbWUiOi
+00000300: 4275 6457 7873 4c43 4169 6333 5268 636e BudWxsLCAic3Rhcn
+00000310: 5169 4f69 4177 4c43 4169 6333 5276 6343 QiOiAwLCAic3RvcC
+00000320: 4936 4944 4173 4943 4a7a 6447 5677 496a I6IDAsICJzdGVwIj
+00000330: 6f67 4d58 3164 4c43 4169 5932 3973 6457 ogMX1dLCAiY29sdW
+00000340: 3175 5832 6c75 5a47 5634 5a58 4d69 4f69 1uX2luZGV4ZXMiOi
+00000350: 4262 6579 4a75 5957 316c 496a 6f67 626e BbeyJuYW1lIjogbn
+00000360: 5673 6243 7767 496d 5a70 5a57 786b 5832 VsbCwgImZpZWxkX2
+00000370: 3568 6257 5569 4f69 4275 6457 7873 4c43 5hbWUiOiBudWxsLC
+00000380: 4169 6347 4675 5a47 467a 5833 5235 6347 AicGFuZGFzX3R5cG
+00000390: 5569 4f69 4169 6457 3570 5932 396b 5a53 UiOiAidW5pY29kZS
+000003a0: 4973 4943 4a75 6457 3177 6556 3930 6558 IsICJudW1weV90eX
+000003b0: 426c 496a 6f67 496d 3969 616d 566a 6443 BlIjogIm9iamVjdC
+000003c0: 4973 4943 4a74 5a58 5268 5a47 4630 5953 IsICJtZXRhZGF0YS
+000003d0: 4936 4948 7369 5a57 356a 6232 5270 626d I6IHsiZW5jb2Rpbm
+000003e0: 6369 4f69 4169 5656 5247 4c54 6769 6658 ciOiAiVVRGLTgifX
+000003f0: 3164 4c43 4169 5932 3973 6457 3175 6379 1dLCAiY29sdW1ucy
+00000400: 4936 4946 7437 496d 3568 6257 5569 4f69 I6IFt7Im5hbWUiOi
+00000410: 4169 5953 4973 4943 4a6d 6157 5673 5a46 AiYSIsICJmaWVsZF
+00000420: 3975 5957 316c 496a 6f67 496d 4569 4c43 9uYW1lIjogImEiLC
+00000430: 4169 6347 4675 5a47 467a 5833 5235 6347 AicGFuZGFzX3R5cG
+00000440: 5569 4f69 4169 6157 3530 4e6a 5169 4c43 UiOiAiaW50NjQiLC
+00000450: 4169 626e 5674 6348 6c66 6448 6c77 5a53 AibnVtcHlfdHlwZS
+00000460: 4936 4943 4a70 626e 5132 4e43 4973 4943 I6ICJpbnQ2NCIsIC
+00000470: 4a74 5a58 5268 5a47 4630 5953 4936 4947 JtZXRhZGF0YSI6IG
+00000480: 3531 6247 7839 5853 7767 496d 4e79 5a57 51bGx9XSwgImNyZW
+00000490: 4630 6233 4969 4f69 4237 496d 7870 596e F0b3IiOiB7ImxpYn
+000004a0: 4a68 636e 6b69 4f69 4169 6348 6c68 636e JhcnkiOiAicHlhcn
+000004b0: 4a76 6479 4973 4943 4a32 5a58 4a7a 6157 JvdyIsICJ2ZXJzaW
+000004c0: 3975 496a 6f67 496a 4530 4c6a 4175 4d69 9uIjogIjE0LjAuMi
+000004d0: 4a39 4c43 4169 6347 4675 5a47 467a 5833 J9LCAicGFuZGFzX3
+000004e0: 5a6c 636e 4e70 6232 3469 4f69 4169 4d69 ZlcnNpb24iOiAiMi
+000004f0: 3478 4c6a 5169 6651 4141 4267 4141 4148 4xLjQifQAABgAAAH
+00000500: 4268 626d 5268 6377 4141 4151 4141 4142 BhbmRhcwAAAQAAAB
00000510: 5141 4141 4151 4142 5141 4341 4147 4141 QAAAAQABQACAAGAA
00000520: 6341 4441 4141 4142 4141 4541 4141 4141 cADAAAABAAEAAAAA
00000530: 4141 4151 4951 4141 4141 4841 4141 4141 AAAQIQAAAAHAAAAA
```
The `pyarrow` metadata is the same for both; I can't tell what explains the difference.
#### [`ubuntu..windows`]
- All `fastparquet` parquets are identical.
- `pyarrow` parquets are mostly identical, except for one header byte in the `gzip` codec.
git diff ubuntu..windows -- out/pyarrow/gzip/xxd.txt
```diff
00000000: 5041 5231 1504 1500 1528 4c15 0015 0012 PAR1.....(L.....
-00000010: 0000 1f8b 0800 0000 0000 0003 0300 0000 ................
+00000010: 0000 1f8b 0800 0000 0000 000a 0300 0000 ................
00000020: 0000 0000 0000 264c 1c15 0419 2500 0619 ......&L....%...
00000030: 1801 6115 0416 0016 1c16 4426 0026 0829 ..a.......D&.&.)
00000040: 1c15 0415 0015 0200 0000 1504 192c 3500 .............,5.
```
## Discussion
The discrepancy between macOS and Ubuntu has made some tests inconvenient; it would be nice to understand why it occurs.
### Docker
Interestingly, I see the same macOS diffs when running [`run.sh`] in an `ubuntu` Docker image on a macOS host machine
[`parquet-diff-test`]: parquet_diff_test/cli.py
[`fastparquet`]: https://pypi.org/project/fastparquet/
[fastparquet]: https://pypi.org/project/fastparquet/
[pyarrow]: https://pypi.org/project/pyarrow/
[`pyarrow`]: https://pypi.org/project/pyarrow/
[`macos`]: https://github.com/runsascoded/parquet-diff-test/tree/macos
[`windows`]: https://github.com/runsascoded/parquet-diff-test/tree/windows
[`ubuntu`]: https://github.com/runsascoded/parquet-diff-test/tree/ubuntu
[`ubuntu..macos`]: https://github.com/runsascoded/parquet-diff-test/compare/ubuntu..macos
[`ubuntu..windows`]: https://github.com/runsascoded/parquet-diff-test/compare/ubuntu..windows
[ubuntu..macos xxd]: https://github.com/runsascoded/parquet-diff-test/compare/ubuntu..macos#diff-1aff51203a0bbf705859a61d542f15bfa553b121b30fea500f03024a8ae44258
[`run.sh`]: run.sh