Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Projects in Awesome Lists tagged with parquet
A curated list of projects in awesome lists tagged with parquet .
https://github.com/cldellow/parquet-metadata
Dump metadata about a Parquet file.
apache-arrow apache-parquet parquet
Last synced: 02 Nov 2024
https://github.com/dirkster99/pynotes
My notebook on using Python with Jupyter Notebook, PySpark etc
dataframe jupyter-notebook panda pandas-dataframe parquet pyspark python spark spark-sql sparknlp
Last synced: 17 Oct 2024
https://github.com/guru107/hadoop-small-files-merger
A Spark application to merge small files on Hadoop
apache-hadoop apache-spark avro parquet scala text
Last synced: 10 Nov 2024
https://github.com/hyparam/hysnappy
Snappy decompression with WebAssembly
compression parquet snappy wasm webassembly
Last synced: 19 Nov 2024
https://github.com/ryan-williams/next-duckdb-parquet-demo
Example Next.js app using duckdb-wasm to read/fetch Parquet files, in Node and the browser
Last synced: 17 Oct 2024
https://github.com/gordonmurray/apache_flink_and_iceberg
Using Apache Flink to write to s3 in Apache Iceberg format
apache-flink apache-iceberg parquet s3
Last synced: 04 Dec 2024
https://github.com/stoewer/parquet-cli
Commnd line tool to analyze parquet files
Last synced: 29 Oct 2024
https://github.com/exasol/cloud-storage-extension
Exasol Cloud Storage Extension for accessing formatted data Avro, Orc and Parquet, on public cloud storage systems
avro azure-blob-storage azure-storage cloud-storage exasol exasol-integration gcs orc parquet s3
Last synced: 14 Nov 2024
https://github.com/Errahum/SQLite-data-creator
This application enables users to create and open SQLite databases, create tables, load data from json, csv and Parquet files, display table contents, and drop tables as needed.
csv json parquet sqlite sqlite-database
Last synced: 03 Sep 2024
https://github.com/bioinfo-chru-strasbourg/howard
Highly Open Workflow for Annotation & Ranking toward genomic variant Discovery
annotation annovar duckdb genetic parquet prioritization snpeff variations vcf
Last synced: 17 Nov 2024
https://github.com/ibmstreams/streamsx.parquet
(Incubation) Toolkit providing adapters to Parquet
hadoop ibm-streams parquet stream-processing toolkit
Last synced: 23 Nov 2024
https://github.com/andreax79/airflow-provider-xlsx
Airflow operators for converting XLSX files from/to Parquet/CSV/JSON
airflow apache-airflow excel parquet
Last synced: 11 Nov 2024
https://github.com/nryanov/serializationbenchmark
avro benchmark deserialization java json msgpack orc parquet protobuf scala serialization thrift
Last synced: 03 Nov 2024
https://github.com/hengfeiyang/how-query-engines-work-zh-cn
How Query Engines Work 中文版
arrow ballista datafusion parquet
Last synced: 12 Nov 2024
https://github.com/hrbrmstr/zeekr
Tools to Make Analyses Using Zeek Easier
cybersecurity parquet pcap r rstats zeek
Last synced: 15 Nov 2024
https://github.com/strader07/py-orderbook
Custom maket data archiver into level 2 order books
api-client cryptocurrency cryptocurrency-exchanges level2 market-data orderbook parquet poetry python3
Last synced: 13 Nov 2024
https://github.com/markpflug/data-convert
An experiment in .NET AOT compilation.
aot aot-compilation csv dotnet excel parquet
Last synced: 10 Nov 2024
https://github.com/bayoadejare/lightning-streams
Batch/stream ETL pipeline of NOAA GLM dataset, using Python frameworks: Dagster, PySpark and Parquet storage.
clustering csv data-engineering data-pipeline data-warehousing database etl-pipeline jupyter-notebook k-means-clustering machine-learning noaa-data orchestration parquet pyspark python spark-sql spark-streaming sql streaming
Last synced: 06 Nov 2024
https://github.com/anicolaspp/mapr-data-gen
Data generator for MapR Data Platform
data mapr mapr-db mapr-es mapr-streams maprdb parquet scala spark
Last synced: 16 Nov 2024
https://github.com/exasol/s3-document-files-virtual-schema
Virtual Schema for document files on AWS S3
aws-s3 exasol exasol-integration parquet s3 virtual-schema
Last synced: 14 Nov 2024
https://github.com/luminousmen/data-toolset
Upgrade from avro-tools and parquet-tools jars to a more user-friendly Python package.
avro avro-tools hacktoberfest parquet parquet-tools
Last synced: 12 Nov 2024
https://github.com/exacaster/delta-fetch
HTTP API on Delta Lake tables
big-data delta-lake parquet s3 spark
Last synced: 11 Nov 2024
https://github.com/a-poor/parq
A CLI for examining parquet files.
cli data-science golang parquet
Last synced: 07 Dec 2024
https://github.com/hyparam/hyperparam-cli
Hyperparam local dataset viewer
dataset javascript parquet table viewer
Last synced: 19 Nov 2024
https://github.com/hyparam/hyparquet-compressors
Decompressors for hyparquet
brotli decompress decompression decompressor gzip lz4 parquet zstd
Last synced: 19 Nov 2024
https://github.com/willianantunes/pyfriends
Let's research over all the seasons of Friends sitcom and try to get some insights from it 🕵
html-parser jupyter-notebook pandas parquet postgresql python
Last synced: 22 Nov 2024
https://github.com/tonivade/pq
The objetive is create a tool similar to jq but for parquet files
Last synced: 17 Dec 2024
https://github.com/heuermh/duckdb-parquet-tools
Apache Parquet format tools for DuckDB.
cli command-line command-line-tool duckdb jdbc parquet
Last synced: 21 Dec 2024
https://github.com/exasol/parquet-edml-generator
Tool that generates EDML definitions for Parquet files
administration-tools-and-libraries exasol-integration parquet virtual-schemas
Last synced: 14 Nov 2024
https://github.com/apache/incubator-parquet-format
Mirror of Apache Parquet
Last synced: 17 Dec 2024
https://github.com/pprzetacznik/datalake
Simple datalake
avro data-engineering kafka parquet schema-registry spark spark-structured-streaming
Last synced: 08 Dec 2024
https://github.com/silvanheller/parquet-demo
Parquet demo project for the Workshop in the Course DIS. Benchmarks Parquet versus ORC, JSON and CSV
benchmark orc parquet r scala spark university-project
Last synced: 28 Nov 2024
https://github.com/louisbrulenaudet/legalkit-pipeline
Publication pipeline for French legal codes on 🤗 Datasets from LegiFrance with concurrent upload and dynamic REAMDE.md.
data datasets huggingface huggingface-datasets legal legaltech legifrance open-source parquet piste-api python
Last synced: 23 Nov 2024
https://github.com/abroniewski/idlecompute-data-management-architecture
Implementation of a big data management and analysis backbone architecture using PySpark for distributed and scalable data ingestion and MLlib for machine learning analysis. Part of Big Data Management and Analytics (BDMA) program.
bdma big-data big-data-analytics bigdata dataops hadoop-hdfs machine-learning parquet pipeline pyspark-mllib
Last synced: 12 Nov 2024
https://github.com/jvdsandt/laminate
A Java library to export JDBC ResultSet data to Parquet files
Last synced: 08 Nov 2024
https://github.com/bluegranite/azure-synapse-vcf-analysis
Sample code for analyzing VCF files (converted to Parquet) in Azure Databricks and Synapse.
azure azure-databricks azure-synapse bioinformatics computational-biology databricks genomics glow parquet spark synapse vcf
Last synced: 18 Nov 2024
https://github.com/igor-suhorukov/arrow_to_database
Import data from Arrow Dataset API into relational DB via JDBC
arrow h2-database jdbc-connector orc parquet postgresql questdb
Last synced: 23 Nov 2024
https://github.com/ankushkgupta2/databricks-poc
:computer: :bar_chart: Proof of Concept (POC) Using Azure Databricks for Automated & Real-Time ETL, Generation of Visualizations, and Pipeline Integration for Various Pathogens
api azure backend blob clarity-lims database databricks dbfs elims elt etl graph-database json livetables metadata mpxv nextflow parquet poc yaml
Last synced: 08 Nov 2024
https://github.com/opengeos/source-coop-readme
Readme file and Jupyter notebook examples for data repositories on Source Cooperative
duckdb geospatial openaccess parquet python vector
Last synced: 11 Nov 2024
https://github.com/cajuncoding/parquetfiles.blobhelpers
A simple library and console application to illustrate how to read and load data into class models from Parquet files saved to Azure Blob Storage using Parquet .Net (parquet-dotnet). This is useful for E-L-T processes whereby you need to load the data into Memory, Sql Server (e.g. Azure SQL), etc. or any other location where there is no built-in or default mechanism for working with Parquet data.
azure-blob azure-blob-storage azure-functions parquet parquet-data parquet-dotnet parquet-files parquet-tools
Last synced: 12 Nov 2024
https://github.com/gordonmurray/apache_flink_and_hudi
Using Apache Flink to store data in S3 using Apache Hudi
apache-flink apache-hudi parquet s3
Last synced: 04 Dec 2024
https://github.com/feliciamarlove/streaming-with-scala-and-spark
Related to Handling Fast Data with Apache Spark SQL and Streaming course on Pluralsight https://app.pluralsight.com/library/courses/apache-spark-sql-fast-data-handling-streaming/exercise-files
data-engineering hive parquet scala spark streaming
Last synced: 18 Dec 2024
https://github.com/emensonlimaa/sample-parquet
Basic example using Parquet with .NET 8
Last synced: 13 Dec 2024
https://github.com/sdspot2034/exploring-parquet
Project to compare write efficiency and memory efficiency of CSV and Parquet files
chunking csv-export data-engineering data-modeling database decorators etl mysql parquet pyspark python3 spark
Last synced: 11 Oct 2024
https://github.com/mahlukedankuranggairah/simple-backup-ci4-clickhouse
MySQl/MariaDB Backup, convert to Parquet, using PHP CodeIgniter4 Framework, Python, and Clickhouse.
clickhouse codeigniter parquet python3
Last synced: 28 Nov 2024
https://github.com/meghajit/parquet-generator
A sweet library to generate parquet files as per the required schema
Last synced: 17 Dec 2024
https://github.com/statisticsnorway/parquet-buddy
Utilities for working with the parquet file format
Last synced: 17 Dec 2024
https://github.com/hrmeetsingh/parquetreader
Parquet reader code in Java
java parquet parquet-files parquet-tools parquet-viewer
Last synced: 09 Nov 2024
https://github.com/philipmay/pandas_compression
bz2 compression csv feather gzip pandas parquet zstd
Last synced: 09 Nov 2024
https://github.com/muneeb706/django-samples
Sample implemention of different functions in django
django django-rest-framework fhir lua parquet python redis
Last synced: 04 Dec 2024
https://github.com/jaime-alv/into-parquet
CLI tool for giving CSV files a schema and transform them to Parquet format
cli-app csv-parser parquet spark
Last synced: 20 Dec 2024
https://github.com/hackolade/parquet
Hackolade plugin for Apache Parquet schema
columnar-storage data-modeling data-models entity-relationship-diagram er-diagram hadoop nosql parquet parquet-schema schema-design
Last synced: 17 Nov 2024
https://github.com/sipemu/excel-to-parquet
A command-line tool written in Rust that converts Excel (XLSX) files to Parquet format. This tool is designed to be simple and efficient, making it easy to convert Excel data for use with big data tools.
Last synced: 15 Dec 2024
https://github.com/clearhanhui/parquetloader
A Distributed Streaming PyTorch Dataloader for Parquet.
Last synced: 15 Nov 2024
https://github.com/fnu-ankit/uberdataengganalysis
Uber Data Engineering and Analysis. https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
csv data-analytics data-engineering parquet python
Last synced: 30 Nov 2024
https://github.com/yo-mah-ya/file_creator
create files which formats are like "orc", "parquet", "xlsx", "json" and so on with Python
orcfile pandas parquet parquet-files python3
Last synced: 20 Nov 2024
https://github.com/luminousmen/data-tools
Upgrade from avro-tools and parquet-tools jars to a more user-friendly Python package.
avro avro-tools parquet parquet-tools
Last synced: 12 Nov 2024
https://github.com/vara-co/home_sales
Module 22 challenge: Using Google Colab to work on Big Data queries with PySpark SQL, parquet, and cache partitions
big-data big-data-analytics cache google-colab google-colaboratory parquet pyspark pyspark-sql
Last synced: 07 Dec 2024
https://github.com/tayeva/satellite-kafka-spark-delta-lake-pipeline-example
Demo App - Satellite Produce Consumer App
cpp17 delta-lake docker docker-compose flatbuffers java kafka parquet scala spark spark-streaming
Last synced: 19 Dec 2024
https://github.com/hwywl/business-tools
在开发中积攒下来的业务工具类,方便快速编写业务。
html parquet parquet-generator parquet-tools
Last synced: 10 Nov 2024
https://github.com/followthefourleafedclover/allen-mouse-brain-regional-marker-identifier
Find region specific markers within the Allen Institute for Brain Science: Mouse Whole Cortex and Hippocampus 10x Dataset
bioinformatics bioinformatics-pipeline pandas parquet python3 streamlit streamlit-webapp
Last synced: 08 Dec 2024
https://github.com/nikoshet/rust-dms-cdc-operator
The rust-dms-cdc-operator is a Rust-based utility for comparing the state of a list of tables in an Amazon RDS database with data stored in Parquet files on Amazon S3, particularly useful for change data capture (CDC) scenarios.
aws cdc data dms parquet pgdatadiff polars postgres rds rust s3 validation
Last synced: 09 Nov 2024
https://github.com/thehouseplant/axum-parquet
A small project to test the feasibility of utilizing parquet as a database for large datasets
Last synced: 28 Nov 2024
https://github.com/thehouseplant/hono-parquet
A small project to test the feasibility of utilizing parquet as a database for large datasets
Last synced: 09 Dec 2024
https://github.com/sravanigodavarthi/automated-elt-pipeline-aws
An Apache Airflow data pipeline is designed to perform ELT operations, utilizing Amazon S3 and Amazon Redshift Serverless.
airflow aws datamodeling datapipeline dataprocessing dataqualitycheck docker elt-pipeline parquet python redshift-serverless s3-buckets sql
Last synced: 06 Nov 2024
https://github.com/tweedge/parquet2csv
Listen, sometimes you just need a dang CSV file, ok?
Last synced: 05 Nov 2024
https://github.com/johnymontana/hands-on-havasu-geoparquet
Notebook to accompany the "Hands-On With Havasu & GeoParquet" livestream
apache-iceberg apache-sedona geoparquet parquet sedonadb
Last synced: 18 Dec 2024