Projects in Awesome Lists tagged with bigdata
A curated list of projects in awesome lists tagged with bigdata .
https://github.com/taosdata/tdengine
High-performance, scalable time-series database designed for Industrial IoT (IIoT) scenarios
bigdata cloud-native cluster connected-vehicles database distributed financial-analysis industrial-iot iot metrics monitoring scalability sql tdengine time-series time-series-database tsdb
Last synced: 09 Sep 2025
https://github.com/taosdata/TDengine
High-performance, scalable time-series database designed for Industrial IoT (IIoT) scenarios
bigdata cloud-native cluster connected-vehicles database distributed financial-analysis industrial-iot iot metrics monitoring scalability sql tdengine time-series time-series-database tsdb
Last synced: 24 Mar 2025
https://github.com/apache/shardingsphere
Empowering Data Intelligence with Distributed SQL for Sharding, Scalability, and Security Across All Databases.
bigdata data-encryption data-pipeline database database-cluster database-gateway database-middleware distributed-database distributed-sql-database distributed-transaction encrypt mysql postgresql read-write-splitting shard sql
Last synced: 09 Sep 2025
https://github.com/onurakpolat/awesome-bigdata
A curated list of awesome big data frameworks, ressources and other awesomeness.
awesome awesome-list bigdata data data-analytics data-science data-stream data-visualization data-warehouse database distributed-database series-database stream-processing streaming-data visualize-data
Last synced: 22 Mar 2025
https://github.com/dataexpert-io/data-engineer-handbook
This is a repo with links to everything you'd ever want to learn about data engineering
apachespark awesome bigdata data dataengineering sql
Last synced: 28 Sep 2025
https://github.com/DataExpert-io/data-engineer-handbook
This is a repo with links to everything you'd ever want to learn about data engineering
apachespark awesome bigdata data dataengineering sql
Last synced: 04 Apr 2025
https://github.com/juicedata/juicefs
JuiceFS is a distributed POSIX file system built on top of Redis and S3.
bigdata cloud-native distributed-systems filesystem go golang hdfs object-storage posix redis s3 storage
Last synced: 12 May 2025
https://github.com/databendlabs/databend
𝗗𝗮𝘁𝗮, 𝗔𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝘀 & 𝗔𝗜. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
ai bigdata database lakehouse olap rust serverless snowflake
Last synced: 05 Jan 2026
https://github.com/vaexio/vaex
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
bigdata data-science dataframe hdf5 machine-learning machinelearning memory-mapped-file pyarrow python tabular-data visualization
Last synced: 12 Dec 2025
https://github.com/apache/hudi
Upserts, Deletes And Incremental Processing on Big Data.
apacheflink apachehudi apachespark bigdata data-integration datalake hudi incremental-processing stream-processing
Last synced: 12 May 2025
https://github.com/volcano-sh/volcano
A Cloud Native Batch System (Project under CNCF)
ai batch-systems bigdata gene golang hpc kubernetes machine-learning serving training
Last synced: 03 Oct 2025
https://github.com/dtstack/chunjun
A data integration framework
bigdata data-integration flink framework java
Last synced: 13 May 2025
https://github.com/DTStack/chunjun
A data integration framework
bigdata data-integration flink framework java
Last synced: 14 Mar 2025
https://github.com/igaowei/bigdataview
100+套大数据可视化炫酷大屏Html5模板;包含行业:社区、物业、政务、交通、金融银行等,全网最新、最多,最全、最酷、最炫大数据可视化模板。陆续更新中
bigdata bigdataviewer echarts html-template viewmodel
Last synced: 14 May 2025
https://github.com/liyupi/sql-generator
🔨 用 JSON 来生成结构化的 SQL 语句,基于 Vue3 + TypeScript + Vite + Ant Design + MonacoEditor 实现,项目简单(重逻辑轻页面)、适合练手~
ant-design bigdata hive javascript json monaco-editor mysql spark sql typescript vite vue vue3
Last synced: 14 May 2025
https://github.com/douban/dpark
Python clone of Spark, a MapReduce alike framework in Python
bigdata dpark mapreduce python spark stream-processing
Last synced: 29 Oct 2025
https://github.com/griddb/griddb
GridDB is a next-generation open source database that makes time series IoT and big data fast,and easy.
bigdata database fast griddb iot newsql nosql sql time-series timeseries
Last synced: 13 May 2025
https://github.com/dotnet/spark
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
analytics apache-spark azure bigdata csharp databricks dotnet dotnet-core dotnet-standard emr fsharp hdinsight machine-learning microsoft spark spark-sql spark-streaming streaming tpcds tpch
Last synced: 11 May 2025
https://github.com/dtstack/flinkstreamsql
基于开源的flink,对其实时sql进行扩展;主要实现了流与维表的join,支持原生flink SQL所有的语法
Last synced: 15 May 2025
https://github.com/DTStack/flinkStreamSQL
基于开源的flink,对其实时sql进行扩展;主要实现了流与维表的join,支持原生flink SQL所有的语法
Last synced: 27 Mar 2025
https://github.com/shzlw/poli
An easy-to-use BI server built for SQL lovers. Power data analysis in SQL and gain faster business insights.
bigdata business-intelligence dashboard data-visualization jdbc reactjs reporting spring-boot sql sql-editor
Last synced: 15 May 2025
https://github.com/byzer-org/byzer-lang
Byzer (former MLSQL): A low-code open-source programming language for data pipeline, analytics and AI.
bigdata machine-learning mlsql sql-like-dsl
Last synced: 15 May 2025
https://netflix.github.io/genie/
Distributed Big Data Orchestration Service
big-data bigdata cloud configuration configuration-management distributed-systems java microservice microservices netflix-oss netflixoss orchestration spring-boot
Last synced: 16 Nov 2025
https://github.com/netflix/genie
Distributed Big Data Orchestration Service
big-data bigdata cloud configuration configuration-management distributed-systems java microservice microservices netflix-oss netflixoss orchestration spring-boot
Last synced: 13 May 2025
https://github.com/Netflix/genie
Distributed Big Data Orchestration Service
big-data bigdata cloud configuration configuration-management distributed-systems java microservice microservices netflix-oss netflixoss orchestration spring-boot
Last synced: 04 Apr 2025
https://github.com/yoongikim/autocrawler
Google, Naver multiprocess image web crawler (Selenium)
bigdata chromedriver crawler customizable deep-learning google image-crawler multiprocess python selenium thread
Last synced: 06 Oct 2025
https://github.com/YoongiKim/AutoCrawler
Google, Naver multiprocess image web crawler (Selenium)
bigdata chromedriver crawler customizable deep-learning google image-crawler multiprocess python selenium thread
Last synced: 19 Apr 2025
https://github.com/rustfs/rustfs
🚀 High-performance distributed object storage for MinIO alternative.
bigdata cloud-native distributed-systems filesystem minio object-storage oss rust s3
Last synced: 25 Dec 2025
https://github.com/jadianes/spark-py-notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
big-data bigdata data-analysis data-science ipython ipython-notebook machine-learning mllib notebook pyspark python spark
Last synced: 15 May 2025
https://github.com/hi-primus/optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
big-data-cleaning bigdata cudf dask dask-cudf data-analysis data-cleaner data-cleaning data-cleansing data-exploration data-extraction data-preparation data-profiling data-science data-transformation data-wrangling machine-learning pyspark spark
Last synced: 14 May 2025
https://github.com/tensorbase/tensorbase
TensorBase is a new big data warehousing with modern efforts.
analytics bigdata data data-infrastructure data-warehouse database engineering high-performance infrastructure modern rust rust-lang warehouse
Last synced: 06 Apr 2025
https://github.com/opendatadiscovery/odd-platform
First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.
alerting bigdata data-catalog data-discovery data-engineering data-exploration data-governance data-lineage data-observability data-pipelines data-platform data-profiling data-quality data-science datacatalog lineage metadata metadata-management observability oss
Last synced: 15 May 2025
https://github.com/kubernetes-retired/kube-batch
A batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC
bigdata hpc k8s-sig-scheduling kubernetes machine-learning
Last synced: 29 Sep 2025
https://github.com/josonle/coding-now
学习记录的一些笔记,以及所看得一些电子书eBooks、视频资源和平常收纳的一些自己认为比较好的博客、网站、工具。涉及大数据几大组件、Python机器学习和数据分析、Linux、操作系统、算法、网络等
bigdata coding ebook-collection hadoop-hive java linux notes spark
Last synced: 16 May 2025
https://github.com/zeromicro/cds
Data syncing in golang for ClickHouse.
bigdata clickhouse go golang kafka-consumer
Last synced: 16 May 2025
https://github.com/apache/amoro
Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.
Last synced: 14 May 2025
https://github.com/apache/celeborn
Apache Celeborn is an elastic and high-performance service for shuffle and spilled data.
Last synced: 14 May 2025
https://github.com/microsoft/mobius
C# and F# language binding and extensions to Apache Spark
apache-spark bigdata csharp dataframe dataset dstream eventhubs fsharp kafka-streaming mapreduce mobius near-real-time rdd spark spark-streaming streaming
Last synced: 14 May 2025
https://github.com/microsoft/Mobius
C# and F# language binding and extensions to Apache Spark
apache-spark bigdata csharp dataframe dataset dstream eventhubs fsharp kafka-streaming mapreduce mobius near-real-time rdd spark spark-streaming streaming
Last synced: 08 Apr 2025
https://github.com/Microsoft/Mobius
C# and F# language binding and extensions to Apache Spark
apache-spark bigdata csharp dataframe dataset dstream eventhubs fsharp kafka-streaming mapreduce mobius near-real-time rdd spark spark-streaming streaming
Last synced: 14 Mar 2025
https://github.com/apache/incubator-livy
Apache Livy is an open source REST interface for interacting with Apache Spark from anywhere.
Last synced: 12 May 2025
https://github.com/visualpython/visualpython
GUI-based Python code generator for data science, extension to Jupyter Lab, Jupyter Notebook and Google Colab.
bigdata chrome-extension code-generator data-analysis jupyter-lab-extension jupyter-notebook-extension jupyterlab-extension pandas python visual-coding
Last synced: 15 May 2025
https://github.com/pingcap/tispark
TiSpark is built for running Apache Spark on top of TiDB/TiKV
Last synced: 14 May 2025
https://github.com/jadianes/spark-movie-lens
An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset
big-data bigdata flask movie-recommendation movielens-dataset python spark
Last synced: 12 Apr 2025
https://github.com/fdv/running-elasticsearch-fun-profit
A book about running Elasticsearch
bigdata documentation ebook elasticsearch sysadmin
Last synced: 17 Nov 2025
https://github.com/intsmaze/flink-boot
懒松鼠Flink-Boot 脚手架让Flink全面拥抱Spring生态体系,使得开发者可以以Java WEB开发模式开发出分布式运行的流处理程序,懒松鼠让跨界变得更加简单。懒松鼠旨在让开发者以更底上手成本(不需要理解分布式计算的理论知识和Flink框架的细节)便可以快速编写业务代码实现。为了进一步提升开发者使用懒松鼠脚手架开发大型项目的敏捷的度,该脚手架默认集成Spring框架进行Bean管理,同时将微服务以及WEB开发领域中经常用到的框架集成进来,进一步提升开发速度。比如集成Mybatis ORM框架,Hibernate Validator校验框架,Spring Retry重试框架等,具体见下面的脚手架特性。
bigdata flink flink-boot java java-flink mcv mybatis sping spring-boot spring-retry
Last synced: 27 Mar 2025
https://github.com/gearpump/gearpump
Lightweight real-time big data streaming engine over Akka
akka bigdata scala stream-processing
Last synced: 16 Dec 2025
https://github.com/bigartm/bigartm
Fast topic modeling platform
bigartm bigdata c-plus-plus machine-learning python python-api regularizer text-mining topic-modeling
Last synced: 08 Apr 2025
https://github.com/WeBankFinTech/WeDataSphere
WeDataSphere is a financial grade, one-stop big data platform suite.
analytics bigdata data-analysis datafabric datagovernance dataspherestudio exchangis flink hadoop hive ide linkis prophecis qualitis schedulis scriptis spark streamis visualis
Last synced: 27 Mar 2025
https://github.com/Seagate/cortx
CORTX Community Object Storage is 100% open source object storage uniquely optimized for mass capacity storage devices.
big-data bigdata cortx-community distributed-storage distributed-systems hackathons hacktoberfest hacktoberfest2020 inclusivity object-storage object-storage-service objectstorage objectstore open-source opensource s3 s3-storage software-defined-storage storage storage-api
Last synced: 30 Mar 2025
https://github.com/absaoss/spline
Data Lineage Tracking And Visualization Solution
bigdata hadoop lineage scala spark tracking visualization
Last synced: 16 May 2025
https://github.com/nationalsecurityagency/datawave
DataWave is an ingest/query framework that leverages Apache Accumulo to provide fast, secure data access.
Last synced: 15 May 2025
https://github.com/unum-cloud/ustore
Multi-Modal Database replacing MongoDB, Neo4J, and Elastic with 1 faster ACID solution, with NetworkX and Pandas interfaces, and bindings for C 99, C++ 17, Python 3, Java, GoLang 🗄️
acid apache-arrow arrow big-data bigdata database dataloader document-database graph-database iouring json key-value-store knn-search networkx nosql pandas python search spdk vector-search
Last synced: 11 Apr 2025
https://github.com/simbafl/datawarehouse
从数据仓库到用户画像,从数据建设到数据应用
bigdata datawarehouse olap presto sql userprofile
Last synced: 23 Apr 2025
https://github.com/AbsaOSS/spline
Data Lineage Tracking And Visualization Solution
bigdata hadoop lineage scala spark tracking visualization
Last synced: 04 Apr 2025
https://github.com/apconw/sanic-web
一个轻量级、支持全链路且易于二次开发的大模型应用项目(Large Model Data Assistant) 支持DeepSeek/Qwen2.5等大模型 基于 Dify 、Ollama&Vllm、Sanic 和 Text2SQL 📊 等技术构建的一站式大模型应用开发项目,采用 Vue3、TypeScript 和 Vite 5 打造现代UI。它支持通过 ECharts 📈 实现基于大模型的数据图形化问答,具备处理 CSV 文件 📂 表格问答的能力。同时,能方便对接第三方开源 RAG 系统 检索系统 🌐等,以支持广泛的通用知识问答。
ai bigdata chat chatgpt deepseek-r1 dify echarts large-model-data-assistant llm ollama python qwen rag sanic text2sql vllm vue3
Last synced: 16 May 2025
https://github.com/mvillarrealb/docker-spark-cluster
A simple spark standalone cluster for your testing environment purposses
bigdata developer-tools docker-compose spark
Last synced: 16 May 2025
https://github.com/simbafl/DataWarehouse
从数据仓库到用户画像,从数据建设到数据应用
bigdata datawarehouse olap presto sql userprofile
Last synced: 27 Mar 2025
https://github.com/minio/sidekick
High Performance HTTP Sidecar Load Balancer
bigdata kubernetes load-balancer minio-servers proxy sidecar sidekick spark
Last synced: 20 Jun 2025
https://github.com/NationalSecurityAgency/datawave
DataWave is an ingest/query framework that leverages Apache Accumulo to provide fast, secure data access.
Last synced: 01 Apr 2025
https://github.com/grailbio/bigslice
A serverless cluster computing system for the Go programming language
bigdata cluster computing etl go golang machinelearning mapreduce
Last synced: 21 Apr 2025
https://github.com/leesf/hudi-resources
汇总Apache Hudi相关资料
apache apachehudi bigdata data-integration datalake hudi hudi-resources incremental-processing stream-processing
Last synced: 27 Mar 2025
https://github.com/nicgirault/circosjs
d3 library to build circular graphs
big-data bigdata bioinformatics bioinformatics-data circos circos-graphs circular d3js javascript
Last synced: 09 Apr 2025
https://github.com/nicgirault/circosJS
d3 library to build circular graphs
big-data bigdata bioinformatics bioinformatics-data circos circos-graphs circular d3js javascript
Last synced: 07 May 2025
https://github.com/rdkmaster/jigsaw
Jigsaw七巧板 provides a set of web components based on Angular5/8/9+. The main purpose of Jigsaw is to help the application developers to construct complex & intensive interacting & user friendly web pages. Jigsaw is supporting the development of all applications of Big Data Product of ZTE.
angular bigdata component jigsaw jigsaw-seed typescript webui zte
Last synced: 16 May 2025
https://github.com/Kotlin/kotlin-spark-api
This projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to have this as a part of Apache Spark 3.x
bigdata kotlin nullability scala spark
Last synced: 13 May 2025
https://github.com/kotlin/kotlin-spark-api
This projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to have this as a part of Apache Spark 3.x
bigdata kotlin nullability scala spark
Last synced: 12 Apr 2025
https://github.com/dromara/cloudeon
CloudEon uses Kubernetes to install and deploy open-source big data components, enabling the containerized operation of an open-source big data platform. This allows you to reduce your focus on underlying resource management and maintenance.
bigdata cloudnative doris hadoop hdfs kubernetes yarn
Last synced: 15 May 2025
https://github.com/dromara/CloudEon
CloudEon uses Kubernetes to install and deploy open-source big data components, enabling the containerized operation of an open-source big data platform. This allows you to reduce your focus on underlying resource management and maintenance.
bigdata cloudnative doris hadoop hdfs kubernetes yarn
Last synced: 04 Apr 2025
https://github.com/zhaoyachao/zdh_web
大数据采集,抽取平台,zdh_web是zdh系列服务的可视化管理平台,包含数据采集,调度,权限,审批流,私域营销等模块
bigdata collection data data-collection datapipeline datax-web etl pipline scheduler spark sparketl
Last synced: 04 Apr 2025
https://github.com/pierre94/flink-notes
flink学习笔记
bigdata flink flink-notes flinkx
Last synced: 04 Apr 2025
https://github.com/binghe001/bingheguide
🔥🔥🔥 📚 本代码库是作者冰河多年从事互联网大厂开发、架构的学习历程技术汇总,旨在为大家提供一个清晰详细的学习教程,侧重点更倾向编写Java核心内容、底层原理、架构知识、渗透技术。如果本仓库能为您提供帮助,请给予支持(关注、点赞、分享)!
aop asm bigdata bytecode ddd dubbo hacker ioc java javafx javassist mybatis mysql spring springboot springcloud springcloudalibaba
Last synced: 16 May 2025
https://github.com/davidesantangelo/api.rss
RSS as RESTful. This service allows you to transform RSS feed into an awesome API.
api bigdata dandelion-api elasticsearch feed machine-learning rails rest-api rss rss-feed ruby semantic-web sidekiq
Last synced: 20 Apr 2025
https://github.com/dtstack/dt-sql-parser
SQL Parsers for BigData, built with antlr4.
antlr4 autocompletion bigdata flink hive impala mysql parser postgresql spark sql sql-validation trino
Last synced: 14 May 2025
https://github.com/sderosiaux/every-single-day-i-tldr
A daily digest of the articles or videos I've found interesting, that I want to share with you.
akka architecture bigdata category-theory data-engineering ddd googlecloudplatform java javascript kafka kubernetes microservices reactjs scala spark technology watch
Last synced: 16 May 2025
https://github.com/sirkon/ldetool
Code generator for fast log file parsers
bigdata datamining log-parsing logs-analysis logs-parsing parsing parsing-csv
Last synced: 07 Apr 2025
https://github.com/kkyon/Simple-IT-English
Simple-IT-English: smart wordbook from community for community
bigdata dictonary english-learning english-word simple-it-english site
Last synced: 28 Mar 2025
https://github.com/curvineio/curvine
High performance distributed cache system. Built by Rust.
ai ai-infra bigdata cache-storage cloud-native hdfs high-performance-computing io rust s3 shuffle spark train-acceleration
Last synced: 11 Aug 2025
https://github.com/DTStack/dt-sql-parser
SQL Parsers for BigData, built with antlr4.
antlr4 autocompletion bigdata flink hive impala mysql parser postgresql spark sql sql-validation trino
Last synced: 01 Apr 2025