Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/apache/bigtop
Bigtop is an Apache Foundation project for Infrastructure Engineers and Data Scientists looking for comprehensive packaging, testing, and configuration of the leading open source big data components.
Last synced: 03 Jul 2024
https://github.com/lakesoul-io/LakeSoul
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
arrow big-data datafusion datalake flink huggingface lakehouse lakesoul postgresql python pytorch rust spark sql streaming vectorized velox
Last synced: 03 Jul 2024
https://github.com/u2i/egis
Egis - a handy Ruby interface for AWS Athena
aws aws-athena big-data big-data-analytics ruby ruby-gem
Last synced: 03 Jul 2024
https://github.com/provectus/kafka-ui
Open-Source Web UI for Apache Kafka Management
apache-kafka big-data cluster-management event-streaming hacktoberfest kafka kafka-brokers kafka-client kafka-cluster kafka-connect kafka-manager kafka-producer kafka-streams kafka-ui opensource streaming-data streams web-ui
Last synced: 01 Jul 2024
https://github.com/ChrisCummins/clgen
Deep learning program generator
benchmarking big-data deep-learning gpu lstm machine-learning neural-network opencl synthetic-programs
Last synced: 01 Jul 2024
https://github.com/chitralverma/scala-polars
Polars for Scala & Java projects!
arrow big-data dataframe dataframe-library java jni polars rust scala
Last synced: 29 Jun 2024
https://github.com/maximveksler/awesome-serialization
Data formats useful for API, Big Data, ML, Graph & co
awesome-list big-data data-science serialization-formats
Last synced: 29 Jun 2024
https://github.com/fluid-cloudnative/fluid
Fluid, elastic data abstraction and acceleration for BigData/AI applications in cloud. (Project under CNCF)
ai-framework alluxio big-data data-abstraction distributed-cache kubernetes
Last synced: 27 Jun 2024
https://github.com/KennethanCeyer/awesome-data-pipeline
Awesome list for datapipeline
architecture awesome awesome-list big-data bigdata cloud data data-engineering dataeng datalake datapipeline datawarehouse hadoop hive opensource query spark
Last synced: 25 Jun 2024
https://github.com/yash1994/auto-awesome-list
:zap: An automated list of Machine Learning and Data Science tools from research organizations
artificial-intelligence big-data data-science machine-learning
Last synced: 24 Jun 2024
https://github.com/varchar-io/nebula
A distributed block-based data storage and compute engine
access-control analytics big-data data-analysis data-visualization distributed-computing distributed-systems real-time
Last synced: 23 Jun 2024
https://github.com/IntelPython/sdc
Numba extension for compiling Pandas data frames, Intel® Scalable Dataframe Compiler
big-data compilers machine-learning numpy pandas parallel-computing python
Last synced: 23 Jun 2024
https://github.com/apache/couchdb-fauxton
Fauxton is the new Web UI for CouchDB
apache big-data cloud couchdb database erlang fauxton http javascript network-client
Last synced: 21 Jun 2024
https://intel.github.io/scikit-learn-intelex/
Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application
ai-inference ai-machine-learning ai-training analytics big-data data-analysis gpu intel machine-learning machine-learning-algorithms oneapi python scikit-learn swrepo
Last synced: 21 Jun 2024
https://kantord.github.io/just-dashboard/
:bar_chart: :clipboard: Dashboards using YAML or JSON files
big-data business-intelligence chart csv d3 d3js dashboard data data-driven data-engineering data-science data-visualization gist github-gist json just-dashboard yaml
Last synced: 21 Jun 2024
https://github.com/CognonicLabs/awesome-AI-kubernetes
:snowflake: :whale: Awesome tools and libs for AI, Deep Learning, Machine Learning, Computer Vision, Data Science, Data Analytics and Cognitive Computing that are baked in the oven to be Native on Kubernetes and Docker with Python, R, Scala, Java, C#, Go, Julia, C++ etc
ai analytics big-data cognitive-science data-science docker kubeflow kubernetes kubernetes-ai kubernetes-analytics kubernetes-data-science kubernetes-ml ml pachyderm python-ml scala seldon-core spark spark-kubernetes spark-ml
Last synced: 20 Jun 2024
https://github.com/GoogleCloudPlatform/DataflowJavaSDK
Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
big-data data-analysis data-mining data-processing data-science google-cloud-dataflow
Last synced: 20 Jun 2024
https://github.com/tspannhw/linkextractorprocessor
Apache NiFi Custom Processor For Link Extracting
apache-nifi big-data java links nifi-processors parser
Last synced: 19 Jun 2024
https://microsoft.github.io/SynapseML/
Simple and Distributed Machine Learning
ai apache-spark azure big-data cognitive-services data-science databricks deep-learning http lightgbm machine-learning microsoft ml model-deployment onnx opencv pyspark scala spark synapse
Last synced: 19 Jun 2024
https://github.com/joshday/OnlineStats.jl
⚡ Single-pass algorithms for statistics
big-data julia julia-language julialang online-algorithms onlinestats statistics stochastic-approximation streaming-data
Last synced: 18 Jun 2024
https://github.com/Chabane/bigdata-playground
A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
angular apache-flink apache-spark avro big-data docker graphql hadoop hbase kafka kops machine-learning mongodb nodejs parquet python scala spark-sql spark-streaming twitter-api
Last synced: 17 Jun 2024
https://github.com/ubisoft/mobydq
:whale: Tool to automate data quality checks on data pipelines
big-data data-pipeline data-quality data-quality-checks data-quality-monitoring data-warehouse
Last synced: 17 Jun 2024
https://github.com/apache/couchdb-docker
Semi-official Apache CouchDB Docker images
apache big-data cloud couchdb cplusplus database erlang http javascript network-client network-server
Last synced: 17 Jun 2024
https://github.com/thrill/thrill
Thrill - An EXPERIMENTAL Algorithmic Distributed Big Data Batch Processing Framework in C++
big-data c-plus-plus distributed-computing thrill
Last synced: 17 Jun 2024
https://github.com/iamabug/BigDataParty
大数据组件 All-in-One 的 Dockerfile
big-data dockerfile hadoop kafka spark
Last synced: 16 Jun 2024
https://github.com/elastic/eland
Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
big-data data-analysis dataframe dataframes eland elasticsearch etl lightgbm machine-learning pandas python scikit-learn time-series-forecasting
Last synced: 16 Jun 2024
https://github.com/opendatadiscovery/awesome-data-catalogs
📙 Awesome Data Catalogs and Observability Platforms.
awesome awesome-list big-data data-catalog data-discovery data-engineering data-quality datacatalog datadiscovery dataops metadata metadata-management ml observability open-source opendata opensource oss
Last synced: 16 Jun 2024
https://github.com/talariadb/talaria
TalariaDB is a distributed, highly available, and low latency time-series database for Presto
big-data column-store database prestodb real-time stream-processing time-series
Last synced: 16 Jun 2024
https://github.com/kantord/just-dashboard
:bar_chart: :clipboard: Dashboards using YAML or JSON files
big-data business-intelligence chart csv d3 d3js dashboard data data-driven data-engineering data-science data-visualization gist github-gist json just-dashboard yaml
Last synced: 14 Jun 2024
https://github.com/LLNL/merlin
Machine Learning for HPC Workflows
big-data celery-workers hpc machine-learning radiuss redis-server simulation workflow workflows
Last synced: 14 Jun 2024
https://github.com/kwai/blaze
Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.
arrow-datafusion big-data data-engineering execution-engine rust spark sql
Last synced: 13 Jun 2024
https://github.com/apache/iotdb
Apache IoTDB
big-data database iot java nosql timeseries tsdb
Last synced: 11 Jun 2024
https://github.com/yahoo/HaloDB
A fast, log structured key-value store.
big-data embedded-database java key-value-store storage-engine
Last synced: 11 Jun 2024
https://github.com/Qihoo360/poseidon
A search engine which can hold 100 trillion lines of log data.
big-data golang map-reduce poseidon search-engine
Last synced: 11 Jun 2024
https://github.com/ropensci-archive/cleanEHR
:warning: ARCHIVED :warning: Essential tools and utility functions to facilitate the data processing pipeline, data cleaning and data analysing of clinical data from CC-HIC
big-data critical-care electronic-health-record healthcare intensive-care r r-package rstats
Last synced: 10 Jun 2024
https://github.com/kaleidicassociates/lubeck
High level linear algebra library for Dlang
big-data blas dlang hedgefund high-performance linear-algebra matlab native-code ndslice numerical-methods numpy octave quantitative-finance symmetry-investments
Last synced: 10 Jun 2024
https://github.com/KlugerLab/FIt-SNE
Fast Fourier Transform-accelerated Interpolation-based t-SNE (FIt-SNE)
big-data fast-algorithm t-sne visualization
Last synced: 09 Jun 2024
https://github.com/apache/ambari
Apache Ambari simplifies provisioning, managing, and monitoring of Apache Hadoop clusters.
ambari big-data java javascript python
Last synced: 08 Jun 2024
https://github.com/apache/flink-shaded
Apache Flink shaded artifacts repository
Last synced: 08 Jun 2024
https://github.com/policratus/sparkmage
🐘 A tool for blazing fast analysis and clustering of similar images using 🐘 Hadoop and ⚡ Spark.
big-data computer-vision hadoop image-processing spark
Last synced: 07 Jun 2024
https://github.com/NVIDIA/spark-rapids
Spark RAPIDS plugin - accelerate Apache Spark with GPUs
Last synced: 07 Jun 2024
https://github.com/apache/flink-kubernetes-operator
Apache Flink Kubernetes Operator
Last synced: 07 Jun 2024
https://github.com/alldatacenter/alldata
🔥🔥 AllData大数据产品是可定义数据中台,以数据平台为底座,以数据中台为桥梁,以机器学习平台为中层框架,以大模型应用为上游产品,提供全链路数字化解决方案。全新会员商业版 X 微信群:https://docs.qq.com/doc/DVHlkSEtvVXVCdEFo
artificial-intelligence big-data chatgpt cloudeon cube-studio datart datasophon dinky dolphinscheduler flink griffin hudi iceberg kong mlops mlrun paimon ranger streampark tis
Last synced: 07 Jun 2024
https://github.com/StarRocks/starrocks
StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries. InfoWorld’s 2023 BOSSIE Award for best open source software.
analytics big-data cloudnative database datalake delta-lake distributed-database hudi iceberg join lakehouse lakehouse-platform mpp olap real-time-analytics real-time-updates realtime-database sql star-schema vectorized
Last synced: 07 Jun 2024
https://github.com/alibaba/fastjson2
🚄 FASTJSON2 is a Java JSON library with excellent performance.
android big-data deserialization fastjson fastjson2 graal graalvm-native-image high-performance java java-json json json-deserialization json-parser json-path json-serialization json-serializer jsonb serialization
Last synced: 07 Jun 2024
https://github.com/apache/bookkeeper
Apache BookKeeper - a scalable, fault tolerant and low latency storage service optimized for append-only workloads
apache big-data bookkeeper distributed-log distributed-systems wal
Last synced: 07 Jun 2024
https://github.com/apache/calcite-avatica
Apache Calcite Avatica
big-data calcite geospatial hadoop java sql
Last synced: 07 Jun 2024
https://github.com/paradedb/paradedb
Postgres for Search and Analytics
aggregations analytics big-data bm25 database datalake elasticsearch faceting full-text-search htap hybrid-search object-storage olap postgres real-time-analytics similarity-search sparse-vector splade sql
Last synced: 07 Jun 2024
https://github.com/bytedance/bitsail
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.
big-data data-integration data-lake data-pipeline data-synchronization flink high-performance real-time
Last synced: 07 Jun 2024
https://github.com/microsoft/hyperspace
An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.
acceleration analytics big-data databases indexing spark
Last synced: 04 Jun 2024
https://github.com/raycad/devops-roadmap
DevOps methodology & roadmap for a devops developer in 2019. Interesting books to learn new technologies.
ai big-data books deep-learning devops experience expert-system machine-learning programming
Last synced: 04 Jun 2024
https://github.com/privefl/bigsnpr
R package for the analysis of massive SNP arrays.
big-data bioinformatics memory-mapped-file parallel-computing polygenic-scores population-structure-inference r r-package snp-data statistical-methods
Last synced: 04 Jun 2024
https://github.com/AvinashSingh786/RegSmart
Windows Registry Analysis Tool
big-data data-processing forensic-analysis parsing windows-registry
Last synced: 03 Jun 2024
https://github.com/opendatadiscovery/opendatadiscovery-specification
ODD Specification is a universal open standard for collecting metadata.
api big-data big-data-platform data-discovery data-engineering data-governance data-mesh data-platform metadata metadata-management metadata-parser open-source opensource spec specification
Last synced: 02 Jun 2024
https://github.com/sigmf/SigMF
The Signal Metadata Format Specification
big-data metadata signals specification standard
Last synced: 02 Jun 2024
https://github.com/ging/fiware-cosmos
The Cosmos Generic Enabler enables an easier BigData analysis over context integrated with some of the most popular BigData platforms.
analysis big-data fiware fiware-cosmos flink processing real-time-analytics spark streaming-engine
Last synced: 01 Jun 2024
https://github.com/traildb/traildb
TrailDB is an efficient tool for storing and querying series of events
big-data c data-analytics database event-data time-series traildb
Last synced: 01 Jun 2024
https://github.com/foochane/books
整理一些书籍 ,包含 C&C++ 、git 、Java、Keras 、Linux 、NLP 、Python 、Scala 、TensorFlow 、大数据 、推荐系统、数据库、数据挖掘 、机器学习 、深度学习 、算法等。
big-data c cpp database datamining dl git java keras ml nlp python scala tensorflow
Last synced: 01 Jun 2024
https://github.com/rimolive/mapa-crime-sp
Visualização dos dados de criminalidade da cidade de São Paulo
Last synced: 01 Jun 2024
https://github.com/TuiQiao/CBoard
An easy to use, self-service open BI reporting and BI dashboard platform.
big-data business-intelligence cboard dashboard data-visualization metabase olap superset
Last synced: 31 May 2024
https://github.com/Moataz-Elmesmary/Data-Science-Roadmap
Data Science Roadmap from A to Z
big-data chatgpt cheatsheet cv-template data-analysis data-engineering data-science data-visualization deep-learning interview-questions linear-algebra llms machine-learning mathematics neural-network nlp probability python sql statistics
Last synced: 31 May 2024
https://github.com/Lonero-Team/Decentralized-Internet
A SDK/library for decentralized web and distributing computing projects
big-data biostatistical-computing blockchain cryptography dapps decentralized decentralized-applications decentralized-internet developer-tools dweb grid-computing iot library mesh-networks offline-first p2p peer-to-peer protocols saas sdk
Last synced: 31 May 2024
https://github.com/apache/calcite
Apache Calcite
big-data calcite geospatial hadoop java sql
Last synced: 31 May 2024
https://github.com/danielbeeke/influence
A little webapp to view relationships of influence between people that are on Wikipedia.
big-data dbpedia influence rdf sparql sparql-query
Last synced: 30 May 2024
https://github.com/man-group/arcticdb
ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
big-data data data-analysis data-science database dataframe pandas quantitative-analysis quantitative-finance quantitative-trading
Last synced: 30 May 2024
https://github.com/commsor/titanoboa
Titanoboa makes complex workflows easy. It is a low-code workflow orchestration platform for JVM - distributed, highly scalable and fault tolerant.
big-data distributed distributed-systems esb integrations ipaas jvm low-code service-bus titanoboa workflow workflow-engine workflow-platform
Last synced: 28 May 2024
https://github.com/aronszanto/sLSM-Tree
High-Performance C++ Data System
big-data data-system high-performance high-performance-computing lsm-tree multithreading skiplist
Last synced: 28 May 2024
https://github.com/ExpediaGroup/beekeeper
Service for automatically managing and cleaning up unreferenced data
big-data cleanup hive hive-metastore java maintenance metastore oss-portal-featured s3
Last synced: 26 May 2024
https://github.com/ExpediaGroup/circus-train
Circus Train is a dataset replication tool that copies Hive tables between clusters and clouds.
big-data bigquery hive hive-metastore hive-table replicate-data replication s3
Last synced: 26 May 2024
https://github.com/ExpediaGroup/shunting-yard
Shunting Yard is a real-time data replication tool that copies data between Hive Metastores.
big-data circus-train hive hive-metastore hive-table replicate-data replication
Last synced: 26 May 2024
https://github.com/jadianes/spark-py-notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
big-data bigdata data-analysis data-science ipython ipython-notebook machine-learning mllib notebook pyspark python spark
Last synced: 26 May 2024
https://github.com/jostmey/NakedTensor
Bare bone examples of machine learning in TensorFlow
big-data distributed-computing linear-regression simple tensorflow tensorflow-examples tensorflow-exercises tensorflow-tutorials
Last synced: 26 May 2024
https://github.com/DrSnowbird/openrefine
OpenRefine Docker for Data ETL/ELT
big-data docker etl-framework openrefine
Last synced: 26 May 2024
https://github.com/yahoo/maha
A framework for rapid reporting API development; with out of the box support for high cardinality dimension lookups with druid.
analytics api-framework big-data druid druid-lookups druid-manager hive oracle postgresql presto scala sql star-schema
Last synced: 26 May 2024
https://github.com/yahoo/fili
Easily make RESTful web services for time series reporting with Big Data analytics engines like Druid and SQL Databases.
analytics big-data druid featured fili restful-api web webservice
Last synced: 26 May 2024
https://github.com/ooni/pipeline
OONI data processing pipeline
big-data data-pipeline open-data
Last synced: 26 May 2024
https://github.com/matanolabs/matano
Open source security data lake for threat hunting, detection & response, and cybersecurity analytics at petabyte scale on AWS
alerting apache-iceberg aws aws-security big-data cloud cloud-native cloud-security cybersecurity detection-engineering dfir log-analytics log-management rust secops security security-tools serverless siem threat-hunting
Last synced: 26 May 2024
https://github.com/scanner-research/esper-tv
Esper instance for TV news analysis
big-data docker google-cloud video visualization
Last synced: 23 May 2024
https://github.com/Eventual-Inc/Daft
Distributed DataFrame for Python designed for the cloud, powered by Rust
big-data data-engineering data-science dataframe distributed-computing machine-learning python rust
Last synced: 22 May 2024
https://github.com/hugegraph/hugegraph
A graph database that supports more than 100+ billion data, high performance and scalability (Include OLTP Engine & REST-API & Backends)
big-data database graph graph-database graphdb gremlin
Last synced: 21 May 2024
https://github.com/privefl/bigstatsr
R package for statistical tools with big matrices stored on disk.
big-data large-matrices memory-mapped-file parallel-computing r r-package statistical-methods
Last synced: 20 May 2024
https://github.com/mlcraft-io/mlcraft
Synmetrix – open source semantic layer / Boost your LLM precision
big-data bigquery business-intelligence clickhouse cube cubejs data-engineering databricks dremio druid firebolt llm prestodb redshift semantic-layer snowflake vertica
Last synced: 19 May 2024
https://github.com/shainakrumme/open-source-handbook
⭐️ Open source projects for all skill levels
advanced-project android awesome-list awesome-lists beginner-project big-data frameworks gaming intermediate-projects ios languages machine-learning open-source open-source-community open-source-project open-source-software projects security trending web-development
Last synced: 19 May 2024
https://github.com/databricks/koalas
Koalas: pandas API on Apache Spark
big-data data-science dataframe mlflow pandas pydata spark
Last synced: 18 May 2024
https://github.com/r-barnes/richdem
High-performance Terrain and Hydrology Analysis
big-data digital-elevation-model geosciences geospatial hydrologic-modeling hydrology
Last synced: 16 May 2024
https://github.com/rakam-io/rakam-api
📈 Collect customer event data from your apps. (Note that this project only includes the API collector, not the visualization platform)
analytics analytics-platform bi-server big-data java
Last synced: 16 May 2024
https://github.com/mtth/avsc
Avro for JavaScript :zap:
avro big-data binary-format encoding javascript schema-evolution serialization typescript
Last synced: 16 May 2024
https://github.com/selinon/selinon
An advanced distributed task flow management on top of Celery
big-data celery distributed-computing flow-management hacktoberfest hacktoberfest-accepted hacktoberfest2021 kubernetes openshift python schedule-tasks selinon task workflow-management workflow-management-system workflow-scheduler
Last synced: 16 May 2024
https://github.com/Hydrospheredata/mist
Serverless proxy for Spark cluster
apache-spark api big-data serverless
Last synced: 16 May 2024
https://github.com/andkret/Cookbook
The Data Engineering Cookbook
best-practices big-data cookbook data-engineer data-engineering
Last synced: 16 May 2024
https://github.com/gchq/Gaffer
A large-scale entity and relation database supporting aggregation of properties
accumulo aggregation big-data graph graph-database hadoop hbase parquet spark
Last synced: 15 May 2024
https://github.com/FeatureBaseDB/featurebase
A crazy fast analytical database, built on bitmaps. Perfect for ML applications. Learn more at: http://docs.featurebase.com/. Start a Docker instance: https://hub.docker.com/r/featurebasedb/featurebase
big-data bitmap database go index pilosa sql
Last synced: 15 May 2024
https://github.com/datumbox/datumbox-framework
Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.
big-data data-science java machine-learning nlp statistics
Last synced: 15 May 2024