An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with pyarrow

A curated list of projects in awesome lists tagged with pyarrow .

https://github.com/vaexio/vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second πŸš€

bigdata data-science dataframe hdf5 machine-learning machinelearning memory-mapped-file pyarrow python tabular-data visualization

Last synced: 12 Dec 2025

https://github.com/uber/petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

deep-learning machine-learning parquet parquet-files pyarrow pyspark pytorch sysml tensorflow

Last synced: 10 Apr 2025

https://github.com/narwhals-dev/narwhals

Lightweight and extensible compatibility layer between dataframe libraries!

cudf dask duckdb ibis pandas polars pyarrow pyspark

Last synced: 06 Jan 2026

https://narwhals-dev.github.io/narwhals/

Lightweight and extensible compatibility layer between dataframe libraries!

cudf dask duckdb ibis pandas polars pyarrow pyspark

Last synced: 18 Jul 2025

https://github.com/wheretrue/biobear

Work with bioinformatic files using Arrow, Polars, and/or DuckDB

arrow bioinformatics biology biopython duckdb polars pyarrow python rust-bio samtools

Last synced: 20 Jan 2026

https://github.com/dacort/faker-cli

Command-line interface to quickly generate fake CSV and JSON data

aws csv deltalake faker-provider json parquet pyarrow

Last synced: 31 Oct 2025

https://github.com/vertti/daffy

Lightweight DataFrame validation decorators for Pandas, Polars, Modin, and PyArrow. No custom types required.

data-quality data-validation dataframe dataframe-schema dataframe-validation decorator modin narwhals pandas polars pyarrow pydantic python python-decorator runtime-validation validation

Last synced: 21 Feb 2026

https://github.com/randomfractals/chicago-crimes

Exploring Chicago crimes dataset with Jupyter notebooks, DuckDB, Malloy and new Panel/PyScript data and dashboard tools.

chicago crimes duckdb julia jupyter-notebooks large-csv malloy malloydata parquet polars pyarrow

Last synced: 22 Mar 2025

https://github.com/ashvardanian/stringtape

Apache Arrow-compatible space-efficient "tape" class in pure Rust to be used with StringZilla for GPU, NUMA, and disk transfers of variable length strings

allocator apache-arrow arrow pyarrow string-manipulation tape

Last synced: 09 Mar 2026

https://github.com/kraina-ai/overturemaestro

An open-source tool for reading OvertureMaps data with multiprocessing and additional Quality-of-Life features

geo geospatial open-source openstreetmap overture-maps overturemaps pyarrow python

Last synced: 27 Jul 2025

https://github.com/icaropires/pdf2dataset

Converts a whole subdirectory with a big (or small) volume of PDF documents to a dataset (pandas DataFrame) with error tracking and choice of features

data-science distributed-computing distributed-systems ocr pandas-dataframe parallel parquet pdf pdf2image pdftotext pyarrow pytesseract pytesseract-ocr python python3 ray tesseract tesseract-ocr

Last synced: 13 Apr 2025

https://github.com/milesgranger/flaco

(PoC) A very memory-efficient way to read data from PostgreSQL

arrow postgresql pyarrow python rust

Last synced: 20 Mar 2025

https://github.com/zen-xu/pyarrow-stubs

Type annotations for pyarrow

pyarrow typing

Last synced: 05 Apr 2025

https://github.com/danielavdar/pandas-pyarrow

Seamlessly switch Pandas DataFrame backend to PyArrow.

arrow backend db-dtypes dtypes pandas pandas-arrow pandas-dataframe pandas-pyarrow pyarrow python

Last synced: 14 Dec 2025

https://github.com/DanielAvdar/pandas-pyarrow

Seamlessly switch Pandas DataFrame backend to PyArrow.

arrow backend db-dtypes dtypes pandas pandas-arrow pandas-dataframe pandas-pyarrow pyarrow python

Last synced: 17 Apr 2025

https://github.com/trustedshops-public/schema2pyarrow

Converts AsyncApi and JsonSchema to PyArrow schema

asyncapi data-engineering datacontracts jsonschema pyarrow schema tslibraries

Last synced: 13 May 2025

https://github.com/legout/pydala2

poor manΒ΄s data lake - Simple api to efficiently query your parquet datasets using Duckdb or polars

duckdb fsspec local localcache object-storage pandas polars pyarrow python

Last synced: 11 Oct 2025

https://github.com/lykmapipo/nyc-tlc-trip-data

Python scripts to download, process, and analyze the New York City Taxi and Limousine Commission (TLC) Trip Record Data dataset

apache-arrow apache-spark data data-engineering data-extraction data-transformation etl fsspec geopandas joblib jupyterlab lykmapipo metadata nyc nyc-taxi-dataset pandas pyarrow python s3

Last synced: 17 Sep 2025

https://github.com/jaysnm/dremio-arrow

Dremio Arrow Flight Client

dataframe dremio dremio-arrow pandas pyarrow python r

Last synced: 18 Feb 2026

https://github.com/xbrianh/xdlake

A loose implementation of the deltalake protocol, written in Python on top of pyarrow, focused on extensibility, customizability, and distributed data.

databricks delta-lake deltalake deltatables hive parquet pyarrow python spark

Last synced: 27 Jun 2025

https://github.com/kiwi0fruit/featherhelper

Concise interface to cache numpy arrays and pandas dataframes

cache numpy pandas pyarrow python

Last synced: 05 Mar 2025

https://github.com/psmyth94/biosets

A bioinformatics extension of πŸ€— Datasets library, built for ML applications on biological and omics data, offering easy integration of metadata and low-code data management tools.

big-data bioinfo classification data-preprocessing data-processing data-science datasets genomics high-performance huggingface machine-learning metadata omics open-source pandas polars proteomics pyarrow python regression

Last synced: 20 Jan 2026

https://github.com/d-chris/federleicht

lightweigth function decorators to cache your `pandas.DataFrame` as feather.

cache pandas pyarrow pypi-package xxhash

Last synced: 23 Mar 2025

https://github.com/sparrow-org/sparrow-rockfinch

The Sparrow PyCapsuleInterface

apache-arrow pyarrow pycaspule python sparrow

Last synced: 14 Feb 2026

https://github.com/miraisolutions/apache-arrow-flight-python-example

Code examples / snippets for website news post

arrow-flight pyarrow python

Last synced: 04 Sep 2025

https://github.com/anto18671/arrow-datasets

A high-performance Rust utility that converts large image datasets into chunked Apache Arrow files for efficient storage and processing.

arrow datasets huggingface image-dataset preprocessing pyarrow

Last synced: 05 Oct 2025

https://github.com/edisedis777/coffee-shops-analysis

This project analyzes the Foursquare Open Source Places dataset to explore the distribution of coffee shops across the United States, with a special focus on Portland, Oregon.

altair coffee coffee-shop daft folium plotly polars pyarrow python

Last synced: 21 Jun 2025

https://github.com/namansnghl/sqlify

Text (biz req) to SQL Semantic Parser with LLMs Transfer Learning. This will help Analysts query DB without knowing SQL.

bart databases nmt-model pyarrow t5-small

Last synced: 20 Jan 2026

https://github.com/1blt-archive/dspg22_pyarrow-example

Saving large files on GitHub

dataset partitioning pyarrow

Last synced: 24 Aug 2025

https://github.com/pwojcieszak/taxidataanalyzer

Analysis of NYC taxi trips data using Ansible, Terraform, GCP, Spark, Hadoop and Kafka

ansible gcp hdfs kafka numpy pandas parquet pyarrow spark terraform

Last synced: 30 Dec 2025

https://github.com/hansalemaos/procmondf

provides a convenient and efficient solution for capturing and analyzing system activity logs using Procmon and converting them to the pandas compatible Parquet file format (2% of the original pml file size)

dataframe logging microsoft pandas parquet procmon pyarrow windows

Last synced: 30 Oct 2025

https://github.com/polsm91/acero-delta-lake-streaming

Proof Of Concept to pull news from RSS feeds, and store them in a Data Lake using Delta Lake's "delta-rs" as a writer, and "PyArrow Acero" as the streaming and compute engine.

delta-lake pyarrow rss-feed streaming

Last synced: 01 Apr 2025

https://github.com/derak-isaack/ubereatsanalytics

Analyze Uber Eats Menu big data for various analytics

apache duckdb kaggle olap-database parquet pyarrow python3 seaborn sql uber-eats

Last synced: 14 Mar 2025

https://github.com/amoeba/pyarrow-ipc-example

An example showing how to send compressed RecordBatches over HTTP with PyArrow.

apache-arrow pyarrow

Last synced: 21 Feb 2025

https://github.com/runsascoded/parquet-diff-test

Demonstrate differences in Parquet files generated by pyarrow on macOS vs. {Ubuntu, Windows}.

arrow parquet pyarrow

Last synced: 25 Dec 2025

https://github.com/leehuwuj/lake-inspector

Inspect your lakehouse data by using PyArrow

arrow datalake lakehouse pyarrow

Last synced: 03 Apr 2025

https://github.com/treynas/runetick-v1-etl-portfolio

An ETL pipeline for the OSRS Trading App that extracts, transforms, and loads trading data and news from RuneScape Wiki APIs and RSS feeds into structured Parquet files stored in Google Cloud Storage. Deployed on Cloud Run and orchestrated via Cloud Scheduler.

cloud-run cloud-scheduler etl flask google-cloud osrs pandas parquet pyarrow trading-data

Last synced: 07 Mar 2025

https://github.com/d-chris/federleicht-benchmark

small script to benchmark `federleicht`

benchmark federleicht pandas pyarrow

Last synced: 23 Mar 2025

https://github.com/murtaza-arif/all-you-need-to-know-for-data-engineer

This repository is designed to showcase various aspects of data engineering, including tools, frameworks, and end-to-end projects. It covers everything from data ingestion and transformation to data warehousing and cloud-based solutions.

cassandra data data-engineering data-science kafka kafka-consumer kafka-streams pyarrow spark

Last synced: 06 Apr 2025