Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/apache/bigtop

Bigtop is an Apache Foundation project for Infrastructure Engineers and Data Scientists looking for comprehensive packaging, testing, and configuration of the leading open source big data components.

big-data bigtop java

Last synced: 03 Jul 2024

https://github.com/lakesoul-io/LakeSoul

LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.

arrow big-data datafusion datalake flink huggingface lakehouse lakesoul postgresql python pytorch rust spark sql streaming vectorized velox

Last synced: 03 Jul 2024

https://github.com/u2i/egis

Egis - a handy Ruby interface for AWS Athena

aws aws-athena big-data big-data-analytics ruby ruby-gem

Last synced: 03 Jul 2024

https://github.com/chitralverma/scala-polars

Polars for Scala & Java projects!

arrow big-data dataframe dataframe-library java jni polars rust scala

Last synced: 29 Jun 2024

https://github.com/maximveksler/awesome-serialization

Data formats useful for API, Big Data, ML, Graph & co

awesome-list big-data data-science serialization-formats

Last synced: 29 Jun 2024

https://github.com/fluid-cloudnative/fluid

Fluid, elastic data abstraction and acceleration for BigData/AI applications in cloud. (Project under CNCF)

ai-framework alluxio big-data data-abstraction distributed-cache kubernetes

Last synced: 27 Jun 2024

https://github.com/yash1994/auto-awesome-list

:zap: An automated list of Machine Learning and Data Science tools from research organizations

artificial-intelligence big-data data-science machine-learning

Last synced: 24 Jun 2024

https://github.com/IntelPython/sdc

Numba extension for compiling Pandas data frames, Intel® Scalable Dataframe Compiler

big-data compilers machine-learning numpy pandas parallel-computing python

Last synced: 23 Jun 2024

https://intel.github.io/scikit-learn-intelex/

Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application

ai-inference ai-machine-learning ai-training analytics big-data data-analysis gpu intel machine-learning machine-learning-algorithms oneapi python scikit-learn swrepo

Last synced: 21 Jun 2024

https://github.com/apache/helix

Mirror of Apache Helix

big-data cloud helix java

Last synced: 20 Jun 2024

https://github.com/CognonicLabs/awesome-AI-kubernetes

:snowflake: :whale: Awesome tools and libs for AI, Deep Learning, Machine Learning, Computer Vision, Data Science, Data Analytics and Cognitive Computing that are baked in the oven to be Native on Kubernetes and Docker with Python, R, Scala, Java, C#, Go, Julia, C++ etc

ai analytics big-data cognitive-science data-science docker kubeflow kubernetes kubernetes-ai kubernetes-analytics kubernetes-data-science kubernetes-ml ml pachyderm python-ml scala seldon-core spark spark-kubernetes spark-ml

Last synced: 20 Jun 2024

https://github.com/GoogleCloudPlatform/DataflowJavaSDK

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.

big-data data-analysis data-mining data-processing data-science google-cloud-dataflow

Last synced: 20 Jun 2024

https://github.com/apache/tajo

Mirror of Apache Tajo

big-data java tajo

Last synced: 20 Jun 2024

https://github.com/tspannhw/linkextractorprocessor

Apache NiFi Custom Processor For Link Extracting

apache-nifi big-data java links nifi-processors parser

Last synced: 19 Jun 2024

https://github.com/Chabane/bigdata-playground

A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL

angular apache-flink apache-spark avro big-data docker graphql hadoop hbase kafka kops machine-learning mongodb nodejs parquet python scala spark-sql spark-streaming twitter-api

Last synced: 17 Jun 2024

https://github.com/ubisoft/mobydq

:whale: Tool to automate data quality checks on data pipelines

big-data data-pipeline data-quality data-quality-checks data-quality-monitoring data-warehouse

Last synced: 17 Jun 2024

https://github.com/thrill/thrill

Thrill - An EXPERIMENTAL Algorithmic Distributed Big Data Batch Processing Framework in C++

big-data c-plus-plus distributed-computing thrill

Last synced: 17 Jun 2024

https://github.com/iamabug/BigDataParty

大数据组件 All-in-One 的 Dockerfile

big-data dockerfile hadoop kafka spark

Last synced: 16 Jun 2024

https://github.com/elastic/eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch

big-data data-analysis dataframe dataframes eland elasticsearch etl lightgbm machine-learning pandas python scikit-learn time-series-forecasting

Last synced: 16 Jun 2024

https://github.com/talariadb/talaria

TalariaDB is a distributed, highly available, and low latency time-series database for Presto

big-data column-store database prestodb real-time stream-processing time-series

Last synced: 16 Jun 2024

https://github.com/kwai/blaze

Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.

arrow-datafusion big-data data-engineering execution-engine rust spark sql

Last synced: 13 Jun 2024

https://github.com/yahoo/HaloDB

A fast, log structured key-value store.

big-data embedded-database java key-value-store storage-engine

Last synced: 11 Jun 2024

https://github.com/Qihoo360/poseidon

A search engine which can hold 100 trillion lines of log data.

big-data golang map-reduce poseidon search-engine

Last synced: 11 Jun 2024

https://github.com/apache/kudu

Mirror of Apache Kudu

big-data cplusplus kudu

Last synced: 11 Jun 2024

https://github.com/ropensci-archive/cleanEHR

:warning: ARCHIVED :warning: Essential tools and utility functions to facilitate the data processing pipeline, data cleaning and data analysing of clinical data from CC-HIC

big-data critical-care electronic-health-record healthcare intensive-care r r-package rstats

Last synced: 10 Jun 2024

https://github.com/KlugerLab/FIt-SNE

Fast Fourier Transform-accelerated Interpolation-based t-SNE (FIt-SNE)

big-data fast-algorithm t-sne visualization

Last synced: 09 Jun 2024

https://github.com/apache/ambari

Apache Ambari simplifies provisioning, managing, and monitoring of Apache Hadoop clusters.

ambari big-data java javascript python

Last synced: 08 Jun 2024

https://github.com/apache/flink-shaded

Apache Flink shaded artifacts repository

big-data flink java scala

Last synced: 08 Jun 2024

https://github.com/policratus/sparkmage

🐘 A tool for blazing fast analysis and clustering of similar images using 🐘 Hadoop and ⚡ Spark.

big-data computer-vision hadoop image-processing spark

Last synced: 07 Jun 2024

https://github.com/NVIDIA/spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs

big-data gpu rapids spark

Last synced: 07 Jun 2024

https://github.com/apache/flink-kubernetes-operator

Apache Flink Kubernetes Operator

big-data flink java

Last synced: 07 Jun 2024

https://github.com/alldatacenter/alldata

🔥🔥 AllData大数据产品是可定义数据中台,以数据平台为底座,以数据中台为桥梁,以机器学习平台为中层框架,以大模型应用为上游产品,提供全链路数字化解决方案。全新会员商业版 X 微信群:https://docs.qq.com/doc/DVHlkSEtvVXVCdEFo

artificial-intelligence big-data chatgpt cloudeon cube-studio datart datasophon dinky dolphinscheduler flink griffin hudi iceberg kong mlops mlrun paimon ranger streampark tis

Last synced: 07 Jun 2024

https://github.com/StarRocks/starrocks

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries. InfoWorld’s 2023 BOSSIE Award for best open source software.

analytics big-data cloudnative database datalake delta-lake distributed-database hudi iceberg join lakehouse lakehouse-platform mpp olap real-time-analytics real-time-updates realtime-database sql star-schema vectorized

Last synced: 07 Jun 2024

https://github.com/apache/bookkeeper

Apache BookKeeper - a scalable, fault tolerant and low latency storage service optimized for append-only workloads

apache big-data bookkeeper distributed-log distributed-systems wal

Last synced: 07 Jun 2024

https://github.com/apache/calcite-avatica

Apache Calcite Avatica

big-data calcite geospatial hadoop java sql

Last synced: 07 Jun 2024

https://github.com/ClickHouse/ClickBench

ClickBench: a Benchmark For Analytical Databases

analytics benchmark big-data databases olap sql

Last synced: 07 Jun 2024

https://github.com/bytedance/bitsail

BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.

big-data data-integration data-lake data-pipeline data-synchronization flink high-performance real-time

Last synced: 07 Jun 2024

https://github.com/microsoft/hyperspace

An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.

acceleration analytics big-data databases indexing spark

Last synced: 04 Jun 2024

https://github.com/raycad/devops-roadmap

DevOps methodology & roadmap for a devops developer in 2019. Interesting books to learn new technologies.

ai big-data books deep-learning devops experience expert-system machine-learning programming

Last synced: 04 Jun 2024

https://github.com/sigmf/SigMF

The Signal Metadata Format Specification

big-data metadata signals specification standard

Last synced: 02 Jun 2024

https://github.com/ging/fiware-cosmos

The Cosmos Generic Enabler enables an easier BigData analysis over context integrated with some of the most popular BigData platforms.

analysis big-data fiware fiware-cosmos flink processing real-time-analytics spark streaming-engine

Last synced: 01 Jun 2024

https://github.com/traildb/traildb

TrailDB is an efficient tool for storing and querying series of events

big-data c data-analytics database event-data time-series traildb

Last synced: 01 Jun 2024

https://github.com/foochane/books

整理一些书籍 ,包含 C&C++ 、git 、Java、Keras 、Linux 、NLP 、Python 、Scala 、TensorFlow 、大数据 、推荐系统、数据库、数据挖掘 、机器学习 、深度学习 、算法等。

big-data c cpp database datamining dl git java keras ml nlp python scala tensorflow

Last synced: 01 Jun 2024

https://github.com/rimolive/mapa-crime-sp

Visualização dos dados de criminalidade da cidade de São Paulo

big-data crime sao-paulo

Last synced: 01 Jun 2024

https://github.com/TuiQiao/CBoard

An easy to use, self-service open BI reporting and BI dashboard platform.

big-data business-intelligence cboard dashboard data-visualization metabase olap superset

Last synced: 31 May 2024

https://github.com/apache/calcite

Apache Calcite

big-data calcite geospatial hadoop java sql

Last synced: 31 May 2024

https://github.com/danielbeeke/influence

A little webapp to view relationships of influence between people that are on Wikipedia.

big-data dbpedia influence rdf sparql sparql-query

Last synced: 30 May 2024

https://github.com/man-group/arcticdb

ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.

big-data data data-analysis data-science database dataframe pandas quantitative-analysis quantitative-finance quantitative-trading

Last synced: 30 May 2024

https://github.com/commsor/titanoboa

Titanoboa makes complex workflows easy. It is a low-code workflow orchestration platform for JVM - distributed, highly scalable and fault tolerant.

big-data distributed distributed-systems esb integrations ipaas jvm low-code service-bus titanoboa workflow workflow-engine workflow-platform

Last synced: 28 May 2024

https://github.com/ExpediaGroup/beekeeper

Service for automatically managing and cleaning up unreferenced data

big-data cleanup hive hive-metastore java maintenance metastore oss-portal-featured s3

Last synced: 26 May 2024

https://github.com/ExpediaGroup/circus-train

Circus Train is a dataset replication tool that copies Hive tables between clusters and clouds.

big-data bigquery hive hive-metastore hive-table replicate-data replication s3

Last synced: 26 May 2024

https://github.com/ExpediaGroup/shunting-yard

Shunting Yard is a real-time data replication tool that copies data between Hive Metastores.

big-data circus-train hive hive-metastore hive-table replicate-data replication

Last synced: 26 May 2024

https://github.com/jadianes/spark-py-notebooks

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

big-data bigdata data-analysis data-science ipython ipython-notebook machine-learning mllib notebook pyspark python spark

Last synced: 26 May 2024

https://github.com/DrSnowbird/openrefine

OpenRefine Docker for Data ETL/ELT

big-data docker etl-framework openrefine

Last synced: 26 May 2024

https://github.com/yahoo/maha

A framework for rapid reporting API development; with out of the box support for high cardinality dimension lookups with druid.

analytics api-framework big-data druid druid-lookups druid-manager hive oracle postgresql presto scala sql star-schema

Last synced: 26 May 2024

https://github.com/yahoo/fili

Easily make RESTful web services for time series reporting with Big Data analytics engines like Druid and SQL Databases.

analytics big-data druid featured fili restful-api web webservice

Last synced: 26 May 2024

https://github.com/ooni/pipeline

OONI data processing pipeline

big-data data-pipeline open-data

Last synced: 26 May 2024

https://github.com/matanolabs/matano

Open source security data lake for threat hunting, detection & response, and cybersecurity analytics at petabyte scale on AWS

alerting apache-iceberg aws aws-security big-data cloud cloud-native cloud-security cybersecurity detection-engineering dfir log-analytics log-management rust secops security security-tools serverless siem threat-hunting

Last synced: 26 May 2024

https://github.com/scanner-research/esper-tv

Esper instance for TV news analysis

big-data docker google-cloud video visualization

Last synced: 23 May 2024

https://github.com/Eventual-Inc/Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust

big-data data-engineering data-science dataframe distributed-computing machine-learning python rust

Last synced: 22 May 2024

https://github.com/hugegraph/hugegraph

A graph database that supports more than 100+ billion data, high performance and scalability (Include OLTP Engine & REST-API & Backends)

big-data database graph graph-database graphdb gremlin

Last synced: 21 May 2024

https://github.com/privefl/bigstatsr

R package for statistical tools with big matrices stored on disk.

big-data large-matrices memory-mapped-file parallel-computing r r-package statistical-methods

Last synced: 20 May 2024

https://github.com/databricks/koalas

Koalas: pandas API on Apache Spark

big-data data-science dataframe mlflow pandas pydata spark

Last synced: 18 May 2024

https://github.com/r-barnes/richdem

High-performance Terrain and Hydrology Analysis

big-data digital-elevation-model geosciences geospatial hydrologic-modeling hydrology

Last synced: 16 May 2024

https://github.com/rakam-io/rakam-api

📈 Collect customer event data from your apps. (Note that this project only includes the API collector, not the visualization platform)

analytics analytics-platform bi-server big-data java

Last synced: 16 May 2024

https://github.com/Hydrospheredata/mist

Serverless proxy for Spark cluster

apache-spark api big-data serverless

Last synced: 16 May 2024

https://github.com/gchq/Gaffer

A large-scale entity and relation database supporting aggregation of properties

accumulo aggregation big-data graph graph-database hadoop hbase parquet spark

Last synced: 15 May 2024

https://github.com/FeatureBaseDB/featurebase

A crazy fast analytical database, built on bitmaps. Perfect for ML applications. Learn more at: http://docs.featurebase.com/. Start a Docker instance: https://hub.docker.com/r/featurebasedb/featurebase

big-data bitmap database go index pilosa sql

Last synced: 15 May 2024

https://github.com/datumbox/datumbox-framework

Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.

big-data data-science java machine-learning nlp statistics

Last synced: 15 May 2024

https://github.com/apache/hive

Apache Hive

apache big-data database hadoop hive java sql

Last synced: 15 May 2024