An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with data-processing

A curated list of projects in awesome lists tagged with data-processing .

https://onceupon.github.io/Bash-Oneliner/

A collection of handy Bash One-Liners and terminal tricks for data processing and Linux system maintenance.

bash data-processing grep hardware linux linux-administration one-liners oneliner-commands shell shell-oneliner system terminal variables xargs xwindow

Last synced: 16 Nov 2025

https://github.com/onceupon/bash-oneliner

A collection of handy Bash One-Liners and terminal tricks for data processing and Linux system maintenance.

bash data-processing grep hardware linux linux-administration one-liners oneliner-commands shell shell-oneliner system terminal variables xargs xwindow

Last synced: 14 May 2025

https://github.com/onceupon/Bash-Oneliner

A collection of handy Bash One-Liners and terminal tricks for data processing and Linux system maintenance.

bash data-processing grep hardware linux linux-administration one-liners oneliner-commands shell shell-oneliner system terminal variables xargs xwindow

Last synced: 26 Mar 2025

https://github.com/tomwright/dasel

Select, put and delete data from JSON, TOML, YAML, XML and CSV files with a single tool. Supports conversion between formats and can be used as a Go package.

cli config configuration data-processing data-structures data-wrangling devops-tools go golang json json-processing parser query selector toml update xml yaml yaml-processor

Last synced: 26 Dec 2025

https://github.com/TomWright/dasel

Select, put and delete data from JSON, TOML, YAML, XML and CSV files with a single tool. Supports conversion between formats and can be used as a Go package.

cli config configuration data-processing data-structures data-wrangling devops-tools go golang json json-processing parser query selector toml update xml yaml yaml-processor

Last synced: 12 Mar 2025

https://github.com/nvidia/dali

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

audio-processing data-augmentation data-processing deep-learning fast-data-pipeline gpu gpu-tensorflow image-augmentation image-processing machine-learning mxnet neural-network paddle python pytorch

Last synced: 13 May 2025

https://github.com/NVIDIA/DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

audio-processing data-augmentation data-processing deep-learning fast-data-pipeline gpu gpu-tensorflow image-augmentation image-processing machine-learning mxnet neural-network paddle python pytorch

Last synced: 15 Mar 2025

https://github.com/deepseek-ai/smallpond

A lightweight data processing framework built on DuckDB and 3FS.

data-processing duckdb

Last synced: 16 Jul 2025

https://github.com/dashbitco/broadway

Concurrent and multi-stage data ingestion and data processing with Elixir

broadway concurrent data-ingestion data-processing elixir genstage

Last synced: 14 May 2025

https://github.com/asyml/texar

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

bert casl-project data-processing deep-learning dialog-systems gpt-2 machine-learning machine-translation natural-language-processing python tensorflow texar text-data text-generation xlnet

Last synced: 14 May 2025

https://github.com/numaproj/numaflow

Kubernetes-native platform to run massively parallel data/streaming jobs

data-processing hacktoberfest k8s kubernetes map-reduce pipeline stream-processing

Last synced: 23 Oct 2025

https://github.com/googlecloudplatform/data-science-on-gcp

Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017

cloud-computing data-analysis data-engineering data-pipeline data-processing data-science data-visualization machine-learning

Last synced: 14 Apr 2025

https://github.com/allenai/dolma

Data and tools for generating and inspecting OLMo pre-training data.

data-processing large-language-models llm machile-learning nlp

Last synced: 13 Oct 2025

https://github.com/GoogleCloudPlatform/data-science-on-gcp

Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017

cloud-computing data-analysis data-engineering data-pipeline data-processing data-science data-visualization machine-learning

Last synced: 19 Jul 2025

https://github.com/cocoindex-io/cocoindex

ETL framework to turn your data AI-ready - with realtime incremental updates and support custom logic like lego.

ai change-data-capture data data-engineering data-indexing data-infrastructure data-processing dataflow etl help-wanted indexing knowledge-graph llm pipeline python rag real-time rust semantic-search streaming

Last synced: 14 May 2025

https://github.com/GoogleCloudPlatform/DataflowJavaSDK

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.

big-data data-analysis data-mining data-processing data-science google-cloud-dataflow

Last synced: 01 May 2025

https://github.com/googlecloudplatform/dataflowjavasdk

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.

big-data data-analysis data-mining data-processing data-science google-cloud-dataflow

Last synced: 03 Oct 2025

https://github.com/jofpin/synthBTC

A tool that uses advanced Monte Carlo simulations and Turbit parallel processing to create possible Bitcoin prediction scenarios.

bitcoin data-processing monte-carlo-simulation nodejs prediction synthetic-data turbit

Last synced: 27 Sep 2025

https://github.com/asyml/texar-pytorch

Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

bert casl-project data-processing deep-learning dialog-systems gpt-2 machine-learning machine-translation natural-language-processing python pytorch roberta texar texar-pytorch text-data text-generation xlnet

Last synced: 08 Oct 2025

https://github.com/chenghaomou/text-dedup

All-in-one text de-duplication

data-processing de-duplication nlp text-processing

Last synced: 14 Dec 2025

https://github.com/hstreamdb/hstream

HStreamDB is an open-source, cloud-native streaming database for IoT and beyond. Modernize your data stack for real-time applications.

data-processing database distributed-database distributed-systems financial-analysis haskell hstreamdb iot iot-database kafka materialized-view real-time realtime-database scale sql stream-processing streaming streaming-data streaming-database

Last synced: 15 May 2025

https://github.com/benibela/xidel

Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.

cli command-line css-selector curl data-processing datascraping html http httpie json rest scraper web webscraper webscraping wget xml xmlstarlet xpath xquery

Last synced: 15 May 2025

https://github.com/jofpin/synthbtc

A tool that uses advanced Monte Carlo simulations and Turbit parallel processing to create possible Bitcoin prediction scenarios.

bitcoin data-processing monte-carlo-simulation nodejs prediction synthetic-data turbit

Last synced: 16 May 2025

https://github.com/ChenghaoMou/text-dedup

All-in-one text de-duplication

data-processing de-duplication nlp text-processing

Last synced: 03 Apr 2025

https://github.com/msamogh/nonechucks

Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!

data-cleaning data-pipeline data-preprocessing data-processing machine-learning preprocessing pytorch torch

Last synced: 07 May 2025

https://github.com/flow-php/etl

PHP - ETL (Extract Transform Load) data processing library

data-engineering data-processing etl flow-php

Last synced: 12 Apr 2025

https://github.com/lithops-cloud/lithops

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀

big-data big-data-analytics cloud-computing data-processing distributed kubernetes multicloud multiprocessing object-storage parallel python serverless serverless-computing serverless-functions

Last synced: 03 Jan 2026

https://github.com/alttch/rapidtables

Super fast list of dicts to pre-formatted tables conversion library for Python 2/3

data-processing dictionary-data library python python3 text-formatting

Last synced: 05 Apr 2025

https://github.com/yord/pxi

🧚 pxi (pixie) is a small, fast, and magical command-line data processor similar to jq, mlr, and awk.

csv data-processing deserializer dsv json marshaller parser pixie pxi serializer ssv tsv

Last synced: 19 Jun 2025

https://github.com/svenkreiss/pysparkling

A pure Python implementation of Apache Spark's RDD and DStream interfaces.

apache-spark data-processing data-science python

Last synced: 07 Apr 2025

https://github.com/Yord/pxi

🧚 pxi (pixie) is a small, fast, and magical command-line data processor similar to jq, mlr, and awk.

csv data-processing deserializer dsv json marshaller parser pixie pxi serializer ssv tsv

Last synced: 06 Apr 2025

https://github.com/ColasGael/Machine-Learning-for-Solar-Energy-Prediction

Predict the Power Production of a solar panel farm from Weather Measurements using Machine Learning

data-processing machine-learning matlab neural-network python tensorflow

Last synced: 07 May 2025

https://github.com/scramjetorg/scramjet

Public tracker for Scramjet Cloud Platform, a platform that bring data from many environments together.

data-processing data-space data-stream edge-computing event-stream javascript python raspberry-pi reactive-programming transformations virtual-data-environment

Last synced: 07 Apr 2025

https://github.com/asyml/forte

Forte is a flexible and powerful ML workflow builder. This is part of the CASL project: http://casl-project.ai/

data-processing deep-learning information-retrieval machine-learning natural-language natural-language-processing pipeline python text-data

Last synced: 04 Apr 2025

https://github.com/airscholar/e2e-data-engineering

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

apache-airflow apache-kafka apache-spark apache-zookeeper big-data cassandra containerization data-engineering data-pipeline data-processing data-storage docker etl-pipeline postgresql real-time-analytics

Last synced: 16 May 2025

https://github.com/colasgael/machine-learning-for-solar-energy-prediction

Predict the Power Production of a solar panel farm from Weather Measurements using Machine Learning

data-processing machine-learning matlab neural-network python tensorflow

Last synced: 09 Apr 2025

https://github.com/senbox-org/snap-engine

ESA Earth Observation Toolbox and Java Development Platform

data-processing data-visualization earth-observation eo linux macos raster-data remote-sensing windows

Last synced: 12 Jul 2025

https://github.com/markus-wa/cq

Clojure Query: A Command-line Data Processor for JSON, YAML, EDN, XML and more

cli clojure command-line csv data-processing data-transformation edn hacktoberfest json msgpack transformation xml yaml

Last synced: 10 May 2025

https://github.com/iam-mhaseeb/skytrax-data-warehouse

A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.

airflow data-analysis data-analytics data-cleaning data-engineering data-orchestration data-processing data-visualization data-warehouse data-warehousing database docker metabase python python3 redshift s3 s3-bucket sql

Last synced: 12 Aug 2025

https://github.com/utdemir/distributed-dataset

A distributed data processing framework in Haskell.

aws-lambda data-processing distributed haskell spark

Last synced: 11 Dec 2025

https://github.com/siteimprove/alfa

:wheelchair: Suite of open and standards-based tools for performing reliable accessibility conformance testing at scale

a11y accessibility act aria customer-facing data-processing earl horizon2020 json-ld monorepo sarif testing typescript wcag

Last synced: 08 Apr 2025

https://github.com/Siteimprove/alfa

:wheelchair: Suite of open and standards-based tools for performing reliable accessibility conformance testing at scale

a11y accessibility act aria customer-facing data-processing earl horizon2020 json-ld monorepo sarif testing typescript wcag

Last synced: 15 Apr 2025

https://github.com/nvidia/nvimagecodec

A nvImageCodec library of GPU- and CPU- accelerated codecs featuring a unified interface

computer-vision cpp cuda dali data-processing deep-learning fast-data-pipeline gpu image-processing machine-learning nvidia python pytorch

Last synced: 16 May 2025

https://github.com/whoiskatrin/financial-statement-pdf-extractor

Python script to extract as much structured information as possible from annual/quarterly reports.

balance-sheet cash-flow cash-flow-statement data-processing extract financial-analysis financial-statements pdf quarterly-reports

Last synced: 04 Apr 2025

https://github.com/asavinov/prosto

Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby

business-intelligence data-preparation data-preprocessing data-processing data-science data-wrangling feature-engineering map-reduce olap pandas python spark workflow

Last synced: 11 Apr 2025

https://github.com/pauliacomi/pygaps

A framework for processing adsorption data and isotherm fitting

adsorption data-processing materials-science

Last synced: 21 Oct 2025

https://github.com/aces/cbrain

CBRAIN is a flexible Ruby on Rails framework for accessing and processing of large data on high-performance computing infrastructures.

cbrain cbrain-api cbrain-architecture cbrain-platform cbrain-service data-processing hpc rails-application ruby science

Last synced: 05 Apr 2025

https://github.com/kubeflow/mcp-apache-spark-history-server

MCP Server for Apache Spark History Server. The bridge between Agnetic AI and Apache Spark.

apache-spark big-data data-processing kubernetes mcp mcp-server

Last synced: 19 Sep 2025

https://github.com/urbanos-public/smartcitiesdata

The core micro services of UrbanOS as an umbrella project with component documentation

data-analytics data-processing data-visualization elixir elixir-phoenix

Last synced: 06 Apr 2025

https://github.com/unidentifieddeveloper/blaze

A blazing fast exporter for your Elasticsearch data.

data-dump data-export data-processing devops devops-tools elasticsearch libcurl rapidjson

Last synced: 09 Apr 2025

https://github.com/wq/itertable

⇔ IterTable is a Pythonic API for iterating through tabular data formats, including CSV, XLSX, XML, and JSON.

csv data-processing excel export import iterable json openpyxl pandas pythonic spreadsheet tabular-data xml

Last synced: 03 Apr 2025

https://github.com/jqnpm/jqnpm

A package manager built for the command-line JSON processor jq.

command-line-tool data data-processing jq json package-manager

Last synced: 21 Jul 2025

https://github.com/soumyadip007/data-science-using-python-university-course-module

“Data science” is just about as broad of a term as they come. It may be easiest to describe what it is by listing its more concrete components: Data exploration & analysis. Included here: Pandas; NumPy; SciPy; a helping hand from Python's Standard Library.

data-preparation data-preprocessing data-processing data-science data-visualization jupyter-notebook knn numpy panda plotting python

Last synced: 23 Jun 2025

https://github.com/jeffgrunewald/stargate

An Apache Pulsar client written in Elixir

data-processing elixir pulsar-client

Last synced: 11 Jul 2025

https://github.com/jpkli/p4

P4: Portable Parallel Processing Pipeline

data-processing gpu visualizations

Last synced: 02 May 2025

https://github.com/zakarialaoui10/zikomatrix

Arduino library for creating and manipulating matrices of arbitrary size and data type. The library provides a Matrix class that can be used to create matrices, perform basic matrix operations

arduino cpp data-processing esp32 esp8266 hardware library morocco std

Last synced: 09 Apr 2025

https://github.com/getstrm/pace

Data policy IN, dynamic view OUT: PACE is the Policy As Code Engine. It helps you to programatically create and apply a data policy to a processing platform like Databricks, Snowflake or BigQuery (or plain 'ol Postgres, even!) with definitions imported from Collibra, Datahub, ODD and the like.

bigquery data-catalog data-contracts data-governance data-processing databricks policy-enforcement snowflake

Last synced: 13 Oct 2025

https://github.com/greenelab/tdm

R package for normalizing RNA-seq data to make them comparable to microarray data.

data-processing microarray package r rna-seq

Last synced: 11 Jun 2025

https://github.com/m-clark/data-processing-and-visualization

This document forms the basis of several workshops/talks that get into everyday programming with R, but also includes mirrored code in Python as Jupyter notebooks.

data-processing data-science datatable dplyr ggplot2 htmlwidgets jupyter-notebooks machine-learning model-criticism modeling numpy pandas programming programming-exercises python r tidyverse visualization workshop workshops

Last synced: 02 Sep 2025

https://github.com/zazuko/barnard59

An intuitive and flexible RDF pipeline solution designed to simplify and automate ETL processes for efficient data management.

data-integration data-pipeline data-processing etl json-ld linked-data pipeline rdf semantic-web

Last synced: 06 Apr 2025

https://github.com/zakarialaoui10/ZikoMatrix

Arduino library for creating and manipulating matrices of arbitrary size and data type. The library provides a Matrix class that can be used to create matrices, perform basic matrix operations

arduino cpp data-processing esp32 esp8266 hardware library morocco std

Last synced: 29 Apr 2025

https://github.com/wandersoncferreira/meta-schema

Little DSL to make data processing sane with clojure.spec and spec-tools

clojure clojure-spec data-processing dsl edn spec

Last synced: 05 May 2025