Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Projects in Awesome Lists tagged with parquet

A curated list of projects in awesome lists tagged with parquet .

https://github.com/cldellow/parquet-metadata

Dump metadata about a Parquet file.

apache-arrow apache-parquet parquet

Last synced: 02 Nov 2024

https://github.com/dirkster99/pynotes

My notebook on using Python with Jupyter Notebook, PySpark etc

dataframe jupyter-notebook panda pandas-dataframe parquet pyspark python spark spark-sql sparknlp

Last synced: 17 Oct 2024

https://github.com/guru107/hadoop-small-files-merger

A Spark application to merge small files on Hadoop

apache-hadoop apache-spark avro parquet scala text

Last synced: 10 Nov 2024

https://github.com/hyparam/hysnappy

Snappy decompression with WebAssembly

compression parquet snappy wasm webassembly

Last synced: 19 Nov 2024

https://github.com/ryan-williams/next-duckdb-parquet-demo

Example Next.js app using duckdb-wasm to read/fetch Parquet files, in Node and the browser

duckdb next nextjs parquet

Last synced: 17 Oct 2024

https://github.com/datahappy1/csv_to_parquet_converter

csv to parquet and vice versa file converter based on Pandas written in Python3

aws-s3 converter csv pandas parquet python3

Last synced: 14 Oct 2024

https://github.com/gordonmurray/apache_flink_and_iceberg

Using Apache Flink to write to s3 in Apache Iceberg format

apache-flink apache-iceberg parquet s3

Last synced: 04 Dec 2024

https://github.com/dmyersturnbull/typed-dfs

Make Pandas DataFrames enforce definitions, self-organize, and correctly serialize in 18 formats.

csv dataframes excel feather hdf5 ini json pandas parquet required toml typed

Last synced: 28 Oct 2024

https://github.com/stoewer/parquet-cli

Commnd line tool to analyze parquet files

cli golang parquet

Last synced: 29 Oct 2024

https://github.com/exasol/cloud-storage-extension

Exasol Cloud Storage Extension for accessing formatted data Avro, Orc and Parquet, on public cloud storage systems

avro azure-blob-storage azure-storage cloud-storage exasol exasol-integration gcs orc parquet s3

Last synced: 14 Nov 2024

https://github.com/apache/parquet-site

Apache Parquet Site

apache parquet parquet-site

Last synced: 07 Oct 2024

https://github.com/Errahum/SQLite-data-creator

This application enables users to create and open SQLite databases, create tables, load data from json, csv and Parquet files, display table contents, and drop tables as needed.

csv json parquet sqlite sqlite-database

Last synced: 03 Sep 2024

https://github.com/samthor/parq

Parquet reader in JS

javascript parquet

Last synced: 12 Oct 2024

https://github.com/bioinfo-chru-strasbourg/howard

Highly Open Workflow for Annotation & Ranking toward genomic variant Discovery

annotation annovar duckdb genetic parquet prioritization snpeff variations vcf

Last synced: 17 Nov 2024

https://github.com/ahuang11/mapnstreets

Have you ever wondered how common (uncreative) some street names are?

duckdb fugue geopandas geoviews maps panel parquet

Last synced: 13 Oct 2024

https://github.com/ibmstreams/streamsx.parquet

(Incubation) Toolkit providing adapters to Parquet

hadoop ibm-streams parquet stream-processing toolkit

Last synced: 23 Nov 2024

https://github.com/andreax79/airflow-provider-xlsx

Airflow operators for converting XLSX files from/to Parquet/CSV/JSON

airflow apache-airflow excel parquet

Last synced: 11 Nov 2024

https://github.com/lnsp/trace-explorer

Toolset to explain and visualize database workload traces and benchmark data points.

database duckdb parquet python traces

Last synced: 31 Oct 2024

https://github.com/wolfeidau/arrow-gh-processor

This project illustrates how to build a data processor using a Go, Apache Arrow.

arrow github golang json parquet

Last synced: 12 Oct 2024

https://github.com/hengfeiyang/how-query-engines-work-zh-cn

How Query Engines Work 中文版

arrow ballista datafusion parquet

Last synced: 12 Nov 2024

https://github.com/hrbrmstr/zeekr

Tools to Make Analyses Using Zeek Easier

cybersecurity parquet pcap r rstats zeek

Last synced: 15 Nov 2024

https://github.com/markpflug/data-convert

An experiment in .NET AOT compilation.

aot aot-compilation csv dotnet excel parquet

Last synced: 10 Nov 2024

https://github.com/anicolaspp/mapr-data-gen

Data generator for MapR Data Platform

data mapr mapr-db mapr-es mapr-streams maprdb parquet scala spark

Last synced: 16 Nov 2024

https://github.com/exasol/s3-document-files-virtual-schema

Virtual Schema for document files on AWS S3

aws-s3 exasol exasol-integration parquet s3 virtual-schema

Last synced: 14 Nov 2024

https://github.com/rohitxsh/sql2parquet_py

Python script to migrate genomic data from MySQL DB to parquet files (with added support to upload output files to AWS S3) | GSoC '22

aws boto3 ensembl genome google-summer-of-code gsoc gsoc-2022 pandas parquet python python3

Last synced: 01 Dec 2024

https://github.com/jean-philippe-martin/lcdio

Lowest Common Denominator IO. Everything is a list of dictionaries!

csv json parquet python3 sqlite toml tsv yaml

Last synced: 13 Oct 2024

https://github.com/jmaupetit/data7

⚡ Open your data in minutes

csv database datasets http parquet sql

Last synced: 27 Oct 2024

https://github.com/luminousmen/data-toolset

Upgrade from avro-tools and parquet-tools jars to a more user-friendly Python package.

avro avro-tools hacktoberfest parquet parquet-tools

Last synced: 12 Nov 2024

https://github.com/exacaster/delta-fetch

HTTP API on Delta Lake tables

big-data delta-lake parquet s3 spark

Last synced: 11 Nov 2024

https://github.com/a-poor/parq

A CLI for examining parquet files.

cli data-science golang parquet

Last synced: 07 Dec 2024

https://github.com/hyparam/hyperparam-cli

Hyperparam local dataset viewer

dataset javascript parquet table viewer

Last synced: 19 Nov 2024

https://github.com/willianantunes/pyfriends

Let's research over all the seasons of Friends sitcom and try to get some insights from it 🕵

html-parser jupyter-notebook pandas parquet postgresql python

Last synced: 22 Nov 2024

https://github.com/josh/wikidatabots

Wikidata bots running under Josh404Bot

bot parquet python wikidata wikimedia

Last synced: 27 Nov 2024

https://github.com/tonivade/pq

The objetive is create a tool similar to jq but for parquet files

cli java native-image parquet

Last synced: 17 Dec 2024

https://github.com/heuermh/duckdb-parquet-tools

Apache Parquet format tools for DuckDB.

cli command-line command-line-tool duckdb jdbc parquet

Last synced: 21 Dec 2024

https://github.com/exasol/parquet-edml-generator

Tool that generates EDML definitions for Parquet files

administration-tools-and-libraries exasol-integration parquet virtual-schemas

Last synced: 14 Nov 2024

https://github.com/apache/incubator-parquet-format

Mirror of Apache Parquet

big-data java parquet

Last synced: 17 Dec 2024

https://github.com/psanford/parquet-buddy

Parquet-buddy is a CLI tool for inspecting parquet files written in Go

buddy cli go golang parquet tool

Last synced: 18 Dec 2024

https://github.com/silvanheller/parquet-demo

Parquet demo project for the Workshop in the Course DIS. Benchmarks Parquet versus ORC, JSON and CSV

benchmark orc parquet r scala spark university-project

Last synced: 28 Nov 2024

https://github.com/louisbrulenaudet/legalkit-pipeline

Publication pipeline for French legal codes on 🤗 Datasets from LegiFrance with concurrent upload and dynamic REAMDE.md.

data datasets huggingface huggingface-datasets legal legaltech legifrance open-source parquet piste-api python

Last synced: 23 Nov 2024

https://github.com/abroniewski/idlecompute-data-management-architecture

Implementation of a big data management and analysis backbone architecture using PySpark for distributed and scalable data ingestion and MLlib for machine learning analysis. Part of Big Data Management and Analytics (BDMA) program.

bdma big-data big-data-analytics bigdata dataops hadoop-hdfs machine-learning parquet pipeline pyspark-mllib

Last synced: 12 Nov 2024

https://github.com/jvdsandt/laminate

A Java library to export JDBC ResultSet data to Parquet files

java jdbc parquet

Last synced: 08 Nov 2024

https://github.com/bluegranite/azure-synapse-vcf-analysis

Sample code for analyzing VCF files (converted to Parquet) in Azure Databricks and Synapse.

azure azure-databricks azure-synapse bioinformatics computational-biology databricks genomics glow parquet spark synapse vcf

Last synced: 18 Nov 2024

https://github.com/igor-suhorukov/arrow_to_database

Import data from Arrow Dataset API into relational DB via JDBC

arrow h2-database jdbc-connector orc parquet postgresql questdb

Last synced: 23 Nov 2024

https://github.com/tim-hub/parquet-to-json

a script to convert parquet to json

apache jsonl pandas parquet python

Last synced: 23 Nov 2024

https://github.com/apache/incubator-parquet-mr

Mirror of Apache Parquet

big-data java parquet

Last synced: 07 Oct 2024

https://github.com/ankushkgupta2/databricks-poc

:computer: :bar_chart: Proof of Concept (POC) Using Azure Databricks for Automated & Real-Time ETL, Generation of Visualizations, and Pipeline Integration for Various Pathogens

api azure backend blob clarity-lims database databricks dbfs elims elt etl graph-database json livetables metadata mpxv nextflow parquet poc yaml

Last synced: 08 Nov 2024

https://github.com/opengeos/source-coop-readme

Readme file and Jupyter notebook examples for data repositories on Source Cooperative

duckdb geospatial openaccess parquet python vector

Last synced: 11 Nov 2024

https://github.com/tansen87/insightsql

A tool that can quickly view Excel, CSV and Parquet using SQL, base on Tauri.

csv excel parquet polars rust sql tauri

Last synced: 20 Nov 2024

https://github.com/cajuncoding/parquetfiles.blobhelpers

A simple library and console application to illustrate how to read and load data into class models from Parquet files saved to Azure Blob Storage using Parquet .Net (parquet-dotnet). This is useful for E-L-T processes whereby you need to load the data into Memory, Sql Server (e.g. Azure SQL), etc. or any other location where there is no built-in or default mechanism for working with Parquet data.

azure-blob azure-blob-storage azure-functions parquet parquet-data parquet-dotnet parquet-files parquet-tools

Last synced: 12 Nov 2024

https://github.com/gordonmurray/apache_flink_and_hudi

Using Apache Flink to store data in S3 using Apache Hudi

apache-flink apache-hudi parquet s3

Last synced: 04 Dec 2024

https://github.com/feliciamarlove/streaming-with-scala-and-spark

Related to Handling Fast Data with Apache Spark SQL and Streaming course on Pluralsight https://app.pluralsight.com/library/courses/apache-spark-sql-fast-data-handling-streaming/exercise-files

data-engineering hive parquet scala spark streaming

Last synced: 18 Dec 2024

https://github.com/emensonlimaa/sample-parquet

Basic example using Parquet with .NET 8

dotnet parquet

Last synced: 13 Dec 2024

https://github.com/sdspot2034/exploring-parquet

Project to compare write efficiency and memory efficiency of CSV and Parquet files

chunking csv-export data-engineering data-modeling database decorators etl mysql parquet pyspark python3 spark

Last synced: 11 Oct 2024

https://github.com/kenf1/datainterop

Example of data interoperability between Python & R

pandas parquet polars python r

Last synced: 17 Dec 2024

https://github.com/mahlukedankuranggairah/simple-backup-ci4-clickhouse

MySQl/MariaDB Backup, convert to Parquet, using PHP CodeIgniter4 Framework, Python, and Clickhouse.

clickhouse codeigniter parquet python3

Last synced: 28 Nov 2024

https://github.com/meghajit/parquet-generator

A sweet library to generate parquet files as per the required schema

java parquet

Last synced: 17 Dec 2024

https://github.com/statisticsnorway/parquet-buddy

Utilities for working with the parquet file format

azure-pipeline dapla parquet

Last synced: 17 Dec 2024

https://github.com/muneeb706/django-samples

Sample implemention of different functions in django

django django-rest-framework fhir lua parquet python redis

Last synced: 04 Dec 2024

https://github.com/jaime-alv/into-parquet

CLI tool for giving CSV files a schema and transform them to Parquet format

cli-app csv-parser parquet spark

Last synced: 20 Dec 2024

https://github.com/derak-isaack/ubereatsanalytics

Analyze Uber Eats Menu big data for various analytics

apache duckdb kaggle olap-database parquet pyarrow python3 seaborn sql uber-eats

Last synced: 20 Nov 2024

https://github.com/sipemu/excel-to-parquet

A command-line tool written in Rust that converts Excel (XLSX) files to Parquet format. This tool is designed to be simple and efficient, making it easy to convert Excel data for use with big data tools.

excel parquet rust

Last synced: 15 Dec 2024

https://github.com/clearhanhui/parquetloader

A Distributed Streaming PyTorch Dataloader for Parquet.

distributed parquet pytorch

Last synced: 15 Nov 2024

https://github.com/fnu-ankit/uberdataengganalysis

Uber Data Engineering and Analysis. https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

csv data-analytics data-engineering parquet python

Last synced: 30 Nov 2024

https://github.com/yo-mah-ya/file_creator

create files which formats are like "orc", "parquet", "xlsx", "json" and so on with Python

orcfile pandas parquet parquet-files python3

Last synced: 20 Nov 2024

https://github.com/valarpirai/1brc

1 billion row challenge in Python

duckdb pandas parquet python

Last synced: 06 Dec 2024

https://github.com/hyparam/pypyrpyram

Hyperparam but for python people

dataset ml parquet viewer

Last synced: 19 Nov 2024

https://github.com/pprzetacznik/datalake-aws

Sample data lake pipeline on AWS implemented using Terraform

aws csv datalake parquet python terraform

Last synced: 08 Dec 2024

https://github.com/spektom/data-formats-samples

Spark-based different data formats samples generator

avro json orc parquet spark

Last synced: 19 Nov 2024

https://github.com/luminousmen/data-tools

Upgrade from avro-tools and parquet-tools jars to a more user-friendly Python package.

avro avro-tools parquet parquet-tools

Last synced: 12 Nov 2024

https://github.com/vara-co/home_sales

Module 22 challenge: Using Google Colab to work on Big Data queries with PySpark SQL, parquet, and cache partitions

big-data big-data-analytics cache google-colab google-colaboratory parquet pyspark pyspark-sql

Last synced: 07 Dec 2024

https://github.com/hwywl/business-tools

在开发中积攒下来的业务工具类,方便快速编写业务。

html parquet parquet-generator parquet-tools

Last synced: 10 Nov 2024

https://github.com/followthefourleafedclover/allen-mouse-brain-regional-marker-identifier

Find region specific markers within the Allen Institute for Brain Science: Mouse Whole Cortex and Hippocampus 10x Dataset

bioinformatics bioinformatics-pipeline pandas parquet python3 streamlit streamlit-webapp

Last synced: 08 Dec 2024

https://github.com/nikoshet/rust-dms-cdc-operator

The rust-dms-cdc-operator is a Rust-based utility for comparing the state of a list of tables in an Amazon RDS database with data stored in Parquet files on Amazon S3, particularly useful for change data capture (CDC) scenarios.

aws cdc data dms parquet pgdatadiff polars postgres rds rust s3 validation

Last synced: 09 Nov 2024

https://github.com/thehouseplant/axum-parquet

A small project to test the feasibility of utilizing parquet as a database for large datasets

axum parquet rust

Last synced: 28 Nov 2024

https://github.com/thehouseplant/hono-parquet

A small project to test the feasibility of utilizing parquet as a database for large datasets

duckdb hono nodejs parquet

Last synced: 09 Dec 2024

https://github.com/purcellcjp/home_sales

This project demonstrated the usage of SparkSQL to read, query, cache, and analyze home sales data, providing insights into average prices based on various criteria.

big-data cache parquet spark spark-sql sql

Last synced: 03 Dec 2024

https://github.com/sravanigodavarthi/automated-elt-pipeline-aws

An Apache Airflow data pipeline is designed to perform ELT operations, utilizing Amazon S3 and Amazon Redshift Serverless.

airflow aws datamodeling datapipeline dataprocessing dataqualitycheck docker elt-pipeline parquet python redshift-serverless s3-buckets sql

Last synced: 06 Nov 2024

https://github.com/tweedge/parquet2csv

Listen, sometimes you just need a dang CSV file, ok?

parquet parquet-tools python3

Last synced: 05 Nov 2024

https://github.com/johnymontana/hands-on-havasu-geoparquet

Notebook to accompany the "Hands-On With Havasu & GeoParquet" livestream

apache-iceberg apache-sedona geoparquet parquet sedonadb

Last synced: 18 Dec 2024