Projects in Awesome Lists tagged with pyarrow
A curated list of projects in awesome lists tagged with pyarrow .
https://github.com/vaexio/vaex
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second π
bigdata data-science dataframe hdf5 machine-learning machinelearning memory-mapped-file pyarrow python tabular-data visualization
Last synced: 12 Dec 2025
https://github.com/ibis-project/ibis
the portable Python dataframe library
bigquery clickhouse database datafusion duckdb impala mssql mysql pandas polars postgresql pyarrow pyspark python snowflake sql sqlite trino
Last synced: 13 May 2025
https://github.com/uber/petastorm
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
deep-learning machine-learning parquet parquet-files pyarrow pyspark pytorch sysml tensorflow
Last synced: 10 Apr 2025
https://github.com/gizmodata/gizmosql
π GizmoSQL β High-Performance SQL Server
adbc apache-arrow apache-arrow-flight apache-arrow-flight-sql database databases duckdb gizmodata gizmosql ibis jdbc jwt-authentication pyarrow sql sqlalchemy sqlite sqlite3 tls
Last synced: 27 Feb 2026
https://github.com/dacort/faker-cli
Command-line interface to quickly generate fake CSV and JSON data
aws csv deltalake faker-provider json parquet pyarrow
Last synced: 31 Oct 2025
https://github.com/vertti/daffy
Lightweight DataFrame validation decorators for Pandas, Polars, Modin, and PyArrow. No custom types required.
data-quality data-validation dataframe dataframe-schema dataframe-validation decorator modin narwhals pandas polars pyarrow pydantic python python-decorator runtime-validation validation
Last synced: 21 Feb 2026
https://github.com/randomfractals/chicago-crimes
Exploring Chicago crimes dataset with Jupyter notebooks, DuckDB, Malloy and new Panel/PyScript data and dashboard tools.
chicago crimes duckdb julia jupyter-notebooks large-csv malloy malloydata parquet polars pyarrow
Last synced: 22 Mar 2025
https://github.com/ashvardanian/stringtape
Apache Arrow-compatible space-efficient "tape" class in pure Rust to be used with StringZilla for GPU, NUMA, and disk transfers of variable length strings
allocator apache-arrow arrow pyarrow string-manipulation tape
Last synced: 09 Mar 2026
https://github.com/kraina-ai/overturemaestro
An open-source tool for reading OvertureMaps data with multiprocessing and additional Quality-of-Life features
geo geospatial open-source openstreetmap overture-maps overturemaps pyarrow python
Last synced: 27 Jul 2025
https://github.com/icaropires/pdf2dataset
Converts a whole subdirectory with a big (or small) volume of PDF documents to a dataset (pandas DataFrame) with error tracking and choice of features
data-science distributed-computing distributed-systems ocr pandas-dataframe parallel parquet pdf pdf2image pdftotext pyarrow pytesseract pytesseract-ocr python python3 ray tesseract tesseract-ocr
Last synced: 13 Apr 2025
https://github.com/milesgranger/flaco
(PoC) A very memory-efficient way to read data from PostgreSQL
arrow postgresql pyarrow python rust
Last synced: 20 Mar 2025
https://github.com/danielavdar/pandas-pyarrow
Seamlessly switch Pandas DataFrame backend to PyArrow.
arrow backend db-dtypes dtypes pandas pandas-arrow pandas-dataframe pandas-pyarrow pyarrow python
Last synced: 14 Dec 2025
https://github.com/DanielAvdar/pandas-pyarrow
Seamlessly switch Pandas DataFrame backend to PyArrow.
arrow backend db-dtypes dtypes pandas pandas-arrow pandas-dataframe pandas-pyarrow pyarrow python
Last synced: 17 Apr 2025
https://github.com/trustedshops-public/schema2pyarrow
Converts AsyncApi and JsonSchema to PyArrow schema
asyncapi data-engineering datacontracts jsonschema pyarrow schema tslibraries
Last synced: 13 May 2025
https://github.com/legout/pydala2
poor manΒ΄s data lake - Simple api to efficiently query your parquet datasets using Duckdb or polars
duckdb fsspec local localcache object-storage pandas polars pyarrow python
Last synced: 11 Oct 2025
https://github.com/lykmapipo/python-spark-log-analysis
Python scripts to process, and analyze log files using PySpark.
apache-arrow apache-spark apache-spark-sql data-analysis data-extraction data-processing data-transformation log-analysis log-analyzer log-monitor lykmapipo pandas pyarrow pyspark python seaborn spark-ml spark-nlp sparkml-pipelines sql
Last synced: 22 Jun 2025
https://github.com/lykmapipo/nyc-tlc-trip-data
Python scripts to download, process, and analyze the New York City Taxi and Limousine Commission (TLC) Trip Record Data dataset
apache-arrow apache-spark data data-engineering data-extraction data-transformation etl fsspec geopandas joblib jupyterlab lykmapipo metadata nyc nyc-taxi-dataset pandas pyarrow python s3
Last synced: 17 Sep 2025
https://github.com/jaysnm/dremio-arrow
Dremio Arrow Flight Client
dataframe dremio dremio-arrow pandas pyarrow python r
Last synced: 18 Feb 2026
https://github.com/xbrianh/xdlake
A loose implementation of the deltalake protocol, written in Python on top of pyarrow, focused on extensibility, customizability, and distributed data.
databricks delta-lake deltalake deltatables hive parquet pyarrow python spark
Last synced: 27 Jun 2025
https://github.com/dr-saad-la/pyarrow-tuts
Pyarrow Tutorials
programming pyarrow python3 tutorials
Last synced: 09 Apr 2025
https://github.com/psmyth94/biosets
A bioinformatics extension of π€ Datasets library, built for ML applications on biological and omics data, offering easy integration of metadata and low-code data management tools.
big-data bioinfo classification data-preprocessing data-processing data-science datasets genomics high-performance huggingface machine-learning metadata omics open-source pandas polars proteomics pyarrow python regression
Last synced: 20 Jan 2026
https://github.com/d-chris/federleicht
lightweigth function decorators to cache your `pandas.DataFrame` as feather.
cache pandas pyarrow pypi-package xxhash
Last synced: 23 Mar 2025
https://github.com/sparrow-org/sparrow-rockfinch
The Sparrow PyCapsuleInterface
apache-arrow pyarrow pycaspule python sparrow
Last synced: 14 Feb 2026
https://github.com/milenkovicm/ballista_python
Ballista cluster pyarrow udf support
arrow ballista datafusion distributed pyarrow pyo3 python rust rust-lang udf
Last synced: 31 Mar 2025
https://github.com/miraisolutions/apache-arrow-flight-python-example
Code examples / snippets for website news post
Last synced: 04 Sep 2025
https://github.com/anto18671/arrow-datasets
A high-performance Rust utility that converts large image datasets into chunked Apache Arrow files for efficient storage and processing.
arrow datasets huggingface image-dataset preprocessing pyarrow
Last synced: 05 Oct 2025
https://github.com/edisedis777/coffee-shops-analysis
This project analyzes the Foursquare Open Source Places dataset to explore the distribution of coffee shops across the United States, with a special focus on Portland, Oregon.
altair coffee coffee-shop daft folium plotly polars pyarrow python
Last synced: 21 Jun 2025
https://github.com/1blt-archive/dspg22_pyarrow-example
Saving large files on GitHub
Last synced: 24 Aug 2025
https://github.com/hansalemaos/procmondf
provides a convenient and efficient solution for capturing and analyzing system activity logs using Procmon and converting them to the pandas compatible Parquet file format (2% of the original pml file size)
dataframe logging microsoft pandas parquet procmon pyarrow windows
Last synced: 30 Oct 2025
https://github.com/agutiernc/data-eng-zoomcamp
Data Engineering Zoomcamp 2024
apache-kafka apache-spark ci-cd data-ingestion data-warehouse dbt dlt docker etl etl-pipeline google-cloud-platform jupyter-notebook mage-ai pandas pipelines postgresql pyarrow python sql terraform
Last synced: 30 Dec 2025
https://github.com/polsm91/acero-delta-lake-streaming
Proof Of Concept to pull news from RSS feeds, and store them in a Data Lake using Delta Lake's "delta-rs" as a writer, and "PyArrow Acero" as the streaming and compute engine.
delta-lake pyarrow rss-feed streaming
Last synced: 01 Apr 2025
https://github.com/amoeba/pyarrow-ipc-example
An example showing how to send compressed RecordBatches over HTTP with PyArrow.
Last synced: 21 Feb 2025
https://github.com/runsascoded/parquet-diff-test
Demonstrate differences in Parquet files generated by pyarrow on macOS vs. {Ubuntu, Windows}.
Last synced: 25 Dec 2025
https://github.com/leehuwuj/lake-inspector
Inspect your lakehouse data by using PyArrow
arrow datalake lakehouse pyarrow
Last synced: 03 Apr 2025
https://github.com/treynas/runetick-v1-etl-portfolio
An ETL pipeline for the OSRS Trading App that extracts, transforms, and loads trading data and news from RuneScape Wiki APIs and RSS feeds into structured Parquet files stored in Google Cloud Storage. Deployed on Cloud Run and orchestrated via Cloud Scheduler.
cloud-run cloud-scheduler etl flask google-cloud osrs pandas parquet pyarrow trading-data
Last synced: 07 Mar 2025
https://github.com/itsbigspark/pymetagen
Metadata Generator
cli csv metadata metadata-extraction parquet parquet-tools polars pyarrow python sql-query
Last synced: 20 Jan 2026
https://github.com/d-chris/federleicht-benchmark
small script to benchmark `federleicht`
benchmark federleicht pandas pyarrow
Last synced: 23 Mar 2025
https://github.com/murtaza-arif/all-you-need-to-know-for-data-engineer
This repository is designed to showcase various aspects of data engineering, including tools, frameworks, and end-to-end projects. It covers everything from data ingestion and transformation to data warehousing and cloud-based solutions.
cassandra data data-engineering data-science kafka kafka-consumer kafka-streams pyarrow spark
Last synced: 06 Apr 2025
https://github.com/iljavaleev/arrow_examples
apache arrow cpp examples
apache-arrow cpp20 pandas polars pyarrow python3
Last synced: 22 Mar 2025