Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/huangyueranbbc/SparkDemo

spark全示例代码(java、scala) Spark most full instance code DEMO (java、scala)

bigdata hadoop operator spark spark-sql spark-streaming sparkfun-products sparkjava sparkline sparkp

Last synced: 03 Jul 2024

https://github.com/intsmaze/flink-boot

懒松鼠Flink-Boot 脚手架让Flink全面拥抱Spring生态体系,使得开发者可以以Java WEB开发模式开发出分布式运行的流处理程序,懒松鼠让跨界变得更加简单。懒松鼠旨在让开发者以更底上手成本(不需要理解分布式计算的理论知识和Flink框架的细节)便可以快速编写业务代码实现。为了进一步提升开发者使用懒松鼠脚手架开发大型项目的敏捷的度,该脚手架默认集成Spring框架进行Bean管理,同时将微服务以及WEB开发领域中经常用到的框架集成进来,进一步提升开发速度。比如集成Mybatis ORM框架,Hibernate Validator校验框架,Spring Retry重试框架等,具体见下面的脚手架特性。

bigdata flink flink-boot java java-flink mcv mybatis sping spring-boot spring-retry

Last synced: 03 Jul 2024

https://github.com/raystack/meteor

Meteor is an easy-to-use, plugin-driven metadata collection framework to extract data from different sources and sink to any data catalog.

bigdata collector data-catalog data-management dataops extractors metadata scraper sinks

Last synced: 01 Jul 2024

https://github.com/byzer-org/byzer-lang

Byzer (former MLSQL): A low-code open-source programming language for data pipeline, analytics and AI.

bigdata machine-learning mlsql sql-like-dsl

Last synced: 28 Jun 2024

https://github.com/pbreheny/biglasso

biglasso: Extending Lasso Model Fitting to Big Data in R

bigdata lasso out-of-core parallel-computing r

Last synced: 27 Jun 2024

https://github.com/pushshift/reddit_sse_stream

A Server Side Event stream to deliver Reddit comments and submissions in near real-time to a client.

bigdata flask reddit server-side-events sse stream

Last synced: 24 Jun 2024

https://github.com/AbsaOSS/spline

Data Lineage Tracking And Visualization Solution

bigdata hadoop lineage scala spark tracking visualization

Last synced: 21 Jun 2024

https://github.com/VulknData/vulkn

Love your Data. Love the Environment. Love VULKИ.

bigdata clickhouse dataops pandas python vulkn vulkndata

Last synced: 17 Jun 2024

https://github.com/labex-labs/bigdata-free-tutorials

[Practice 16 Big Data Free Tutorials]-This repository collects 16 of free tutorials for Big Data. It offers comprehensive tutorials and hands-on labs tailored for learners of all levels, from students to professionals and enthusiasts.

awesome awesome-list bigdata education free free-tutorials hands-on labex programming tutorials

Last synced: 13 Jun 2024

https://github.com/labex-labs/practice-bigdata-programming-courses

[Big Data Programming Courses] This repository collects 2 of programming courses for Big Data.

awesome awesome-list bigdata courses education hands-on labex programming

Last synced: 13 Jun 2024

https://github.com/labex-labs/awesome-programming-projects

[Practice 844 Programming Projects] Awesome Programming Projects collects 844 of programming projects for different tech stacks.

alibabacloud ansible awesome awesome-list bigdata cysec data-science devops docker education git hands-on kubernetes labex linux ml programming projects python web-development

Last synced: 13 Jun 2024

https://github.com/collabH/bigdata-growth

大数据知识仓库涉及到数据仓库建模、实时计算、大数据、数据中台、系统设计、Java、算法等。

bigdata bigdatalearning debezium flink hadoop hbase hdfs hive hudi kafka kudu mapreduce olap spark

Last synced: 08 Jun 2024

https://github.com/pierre94/flink-notes

flink学习笔记

bigdata flink flink-notes flinkx

Last synced: 07 Jun 2024

https://github.com/wangzhiwubigdata/God-Of-BigData

专注大数据学习面试,大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...

azkaban bigdata flink flume hadoop hbase hdfs hive kafka spark zookeeper

Last synced: 07 Jun 2024

https://github.com/LinMingQiang/flink-learn

Learning Flink : Flink CEP,Flink Core,Flink SQL

bigdata flink sql stream

Last synced: 07 Jun 2024

https://github.com/DTStack/flinkStreamSQL

基于开源的flink,对其实时sql进行扩展;主要实现了流与维表的join,支持原生flink SQL所有的语法

bigdata flink sql stream

Last synced: 07 Jun 2024

https://github.com/MoRan1607/BigDataGuide

大数据学习,从零开始学习大数据,包含大数据学习各阶段学习视频、面试资料

bigdata flink flume hadoop hbase hive javase kafka scala spark zookeeper

Last synced: 07 Jun 2024

https://github.com/zhaoyachao/zdh_web

大数据采集,抽取平台,zdh_web是zdh系列服务的可视化管理平台,包含数据采集,调度,权限,审批流,私域营销等模块

bigdata collection data data-collection datapipeline datax-web etl pipline scheduler spark sparketl

Last synced: 07 Jun 2024

https://github.com/dromara/CloudEon

CloudEon uses Kubernetes to install and deploy open-source big data components, enabling the containerized operation of an open-source big data platform. This allows you to reduce your focus on underlying resource management and maintenance.

bigdata cloudnative doris hadoop hdfs kubernetes yarn

Last synced: 07 Jun 2024

https://github.com/martymac/fpart

Sort files and pack them into partitions

bigdata cpio data migration packing parallel rsync tar

Last synced: 07 Jun 2024

https://github.com/cubefs/compass

Compass is a task diagnosis platform for bigdata

airflow bigdata diagnose dolphinscheduler flink hadoop mapreduce scheduler spark sql

Last synced: 07 Jun 2024

https://github.com/grailbio/bigslice

A serverless cluster computing system for the Go programming language

bigdata cluster computing etl go golang machinelearning mapreduce

Last synced: 05 Jun 2024

https://github.com/fmarotta/fplyr

Apply Functions to Blocks of Files

bigdata cran r rstats

Last synced: 03 Jun 2024

https://github.com/minio/sidekick

High Performance HTTP Sidecar Load Balancer

bigdata kubernetes load-balancer minio-servers proxy sidecar sidekick spark

Last synced: 02 Jun 2024

https://github.com/iGaoWei/BigDataView

100+套大数据可视化炫酷大屏Html5模板;包含行业:社区、物业、政务、交通、金融银行等,全网最新、最多,最全、最酷、最炫大数据可视化模板。陆续更新中

bigdata bigdataviewer echarts html-template viewmodel

Last synced: 31 May 2024

https://github.com/www-zerocode-net-cn/ERD-Online

ERD Online is an online collaborative data warehouse design software. It does not need to install applications locally and operate databases online. It is an excellent alternative to desktop data modeling tools.

bigdata collaborative data database design erd java lowcode metadata nocode online sql

Last synced: 31 May 2024

https://github.com/water8394/BigData-Interview

:dart: :star2:[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结

bigdata flink hadoop hbase hdfs interview interview-questions kafka mapreduce spark yarn

Last synced: 31 May 2024

https://github.com/liyupi/sql-generator

🔨 用 JSON 来生成结构化的 SQL 语句,基于 Vue3 + TypeScript + Vite + Ant Design + MonacoEditor 实现,项目简单(重逻辑轻页面)、适合练手~

ant-design bigdata hive javascript json monaco-editor mysql spark sql typescript vite vue vue3

Last synced: 30 May 2024

https://github.com/ricardolsmendes/datacatalog-custom-model-manager

Python package to load user-specified metadata models into Google Cloud Data Catalog, comprising Custom Entries, Tag Templates, and Tags

bigdata csv-import dataanalytics datacatalog datagovernance gcp gcp-datacatalog google-cloud python

Last synced: 27 May 2024

https://github.com/gangly/datafaker

Datafaker is a large-scale test data and flow test data generation tool. Datafaker fakes data and inserts to varied data sources. 测试数据生成工具

bigdata datafaker fakedata faker hbase hive kafka mysql oracle postgresql python testing

Last synced: 26 May 2024

https://github.com/jadianes/spark-py-notebooks

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

big-data bigdata data-analysis data-science ipython ipython-notebook machine-learning mllib notebook pyspark python spark

Last synced: 26 May 2024

https://github.com/NationalSecurityAgency/datawave

DataWave is an ingest/query framework that leverages Apache Accumulo to provide fast, secure data access.

accumulo bigdata hacktoberfest java

Last synced: 18 May 2024

https://github.com/juicedata/juicefs

JuiceFS is a distributed POSIX file system built on top of Redis and S3.

bigdata cloud-native distributed-systems filesystem go golang hdfs object-storage posix redis s3 storage

Last synced: 13 May 2024

https://github.com/taosdata/TDengine

TDengine is an open source, high-performance, cloud native time-series database optimized for Internet of Things (IoT), Connected Cars, Industrial IoT and DevOps.

bigdata cloud-native cluster connected-vehicles database distributed financial-analysis industrial-iot iot metrics monitoring scalability sql tdengine time-series time-series-database tsdb

Last synced: 13 May 2024

https://github.com/rdkmaster/jigsaw

Jigsaw七巧板 provides a set of web components based on Angular5/8/9+. The main purpose of Jigsaw is to help the application developers to construct complex & intensive interacting & user friendly web pages. Jigsaw is supporting the development of all applications of Big Data Product of ZTE.

angular bigdata component jigsaw jigsaw-seed typescript webui zte

Last synced: 12 May 2024

https://github.com/davidesantangelo/api.rss

RSS as RESTful. This service allows you to transform RSS feed into an awesome API.

api bigdata dandelion-api elasticsearch feed machine-learning rails rest-api rss rss-feed ruby semantic-web sidekiq

Last synced: 11 May 2024

https://github.com/volcano-sh/volcano

A Cloud Native Batch System (Project under CNCF)

batch-systems bigdata gene golang hpc kubernetes machine-learning

Last synced: 08 May 2024

https://github.com/ConservationInternational/resilienceatlas

Resilience Atlas - Evidence-based decision-making around resilience

bigdata climate-change conservation resilience sustainability

Last synced: 08 May 2024

https://github.com/griddb/griddb

GridDB is a next-generation open source database that makes time series IoT and big data fast,and easy.

bigdata database fast griddb iot newsql nosql sql time-series timeseries

Last synced: 07 May 2024

https://github.com/arvados/arvados

An open source platform for managing and analyzing biomedical big data

arvados aws azure bigdata bioinformatics cloud cluster cwl docker gcp genomics go python ruby workflow workflow-engine

Last synced: 07 May 2024

https://github.com/kubernetes-retired/kube-batch

A batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC

bigdata hpc k8s-sig-scheduling kubernetes machine-learning

Last synced: 07 May 2024

https://github.com/chatnoir-eu/chatnoir-resiliparse

A robust web archive analytics toolkit

bigdata cpp cython extraction htmlparser python warc web webarchive

Last synced: 07 May 2024

https://github.com/oldratlee/big-data-study

:whale: big data study

awesome big-data bigdata study

Last synced: 02 May 2024

https://github.com/Clustering4Ever/Clustering4Ever

C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.

ai artificial-intelligence big-data bigdata clustering clustering-algorithm clustering-evaluation scala scalability spark

Last synced: 30 Apr 2024

https://github.com/datafuselabs/databend

𝗗𝗮𝘁𝗮, 𝗔𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝘀 & 𝗔𝗜. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com

ai bigdata database rust serverless snowflake

Last synced: 29 Apr 2024

https://github.com/vaexio/vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

bigdata data-science dataframe hdf5 machine-learning machinelearning memory-mapped-file pyarrow python tabular-data visualization

Last synced: 28 Apr 2024

https://github.com/apache/avro

Apache Avro is a data serialization system.

avro bigdata c cplusplus csharp dotnet java perl php python ruby rust

Last synced: 27 Apr 2024

https://github.com/alkihis/twitter-archive-reader

Full featured TypeScript Twitter archive reader and browser

big-data bigdata tweets twitter twitter-archives

Last synced: 27 Apr 2024

https://github.com/unum-cloud/ustore

Multi-Modal Database replacing MongoDB, Neo4J, and Elastic with 1 faster ACID solution, with NetworkX and Pandas interfaces, and bindings for C 99, C++ 17, Python 3, Java, GoLang 🗄️

acid apache-arrow arrow big-data bigdata database dataloader document-database graph-database iouring json key-value-store knn-search networkx nosql pandas python search spdk vector-search

Last synced: 26 Apr 2024

https://github.com/dbcli/athenacli

AthenaCLI is a CLI tool for AWS Athena service that can do auto-completion and syntax highlighting.

athena-cli autocompletion aws-athena aws-cli bigdata cli command-line python syntax-highlighting

Last synced: 25 Apr 2024

https://github.com/shouc/daudit

🌲 Configuration flaws detector for Hadoop, MongoDB, MySQL, and more!

auditing bigdata hadoop-spark mongodb redis security

Last synced: 22 Apr 2024

https://github.com/visualpython/visualpython

GUI-based Python code generator for data science, extension to Jupyter Lab, Jupyter Notebook and Google Colab.

bigdata chrome-extension code-generator data-analysis jupyter-lab-extension jupyter-notebook-extension jupyterlab-extension pandas python visual-coding

Last synced: 22 Apr 2024

https://github.com/rapiddweller/rapiddweller-benerator-ce

BENERATOR is a leading software solution to generate, obfuscate, pseudonymize and migrate data for development, testing, and training purposes with a model-driven approach.

anonymization benerator big-data bigdata data data-generation data-masking data-modelling database databene faker java masking mass-data-migration migration model-driven obfuscate performance-testing synthetic-data testdata

Last synced: 14 Apr 2024

https://github.com/shzlw/poli

An easy-to-use BI server built for SQL lovers. Power data analysis in SQL and gain faster business insights.

bigdata business-intelligence dashboard data-visualization jdbc reactjs reporting spring-boot sql sql-editor

Last synced: 14 Apr 2024

https://github.com/apache/amoro

Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.

bigdata datalake lakehouse

Last synced: 11 Apr 2024

https://github.com/apache/incubator-livy

Apache Livy is an open source REST interface for interacting with Apache Spark from anywhere.

apachelivy bigdata livy spark

Last synced: 11 Apr 2024

https://github.com/Kotlin/kotlin-spark-api

This projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to have this as a part of Apache Spark 3.x

bigdata kotlin nullability scala spark

Last synced: 11 Apr 2024

https://github.com/gearpump/gearpump

Lightweight real-time big data streaming engine over Akka

akka bigdata scala stream-processing

Last synced: 11 Apr 2024

https://github.com/wavestone-cdt/hadoop-attack-library

A collection of pentest tools and resources targeting Hadoop environments

bigdata hadoop pentest

Last synced: 05 Apr 2024

https://github.com/apache/incubator-amoro

Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.

bigdata datalake lakehouse

Last synced: 02 Apr 2024

https://github.com/Xmader/musescore-dataset

The dataset of all music sheets and users on musescore.com (unmaintained/discontinued since Sep 30, 2021)

bigdata dataset

Last synced: 01 Apr 2024

https://github.com/YaohuiZeng/biglasso

biglasso: Extending Lasso Model Fitting to Big Data in R

bigdata lasso out-of-core parallel-computing r

Last synced: 31 Mar 2024

https://github.com/apache/hudi

Upserts, Deletes And Incremental Processing on Big Data.

apacheflink apachehudi apachespark bigdata data-integration datalake hudi incremental-processing stream-processing

Last synced: 31 Mar 2024

https://github.com/sirkon/ldetool

Code generator for fast log file parsers

bigdata datamining log-parsing logs-analysis logs-parsing parsing parsing-csv

Last synced: 30 Mar 2024

https://github.com/apache/shardingsphere

Distributed SQL transaction & query engine for data sharding, scaling, encryption, and more - on any database.

bigdata database database-cluster database-plus dba distributed-database distributed-sql-database distributed-transactions encrypt hacktoberfest mysql oltp postgresql rdbms shard sql

Last synced: 28 Mar 2024

https://github.com/apache/celeborn

Apache Celeborn is an elastic and high-performance service for shuffle and spilled data.

bigdata shuffle spark

Last synced: 27 Mar 2024

https://github.com/douban/dpark

Python clone of Spark, a MapReduce alike framework in Python

bigdata dpark mapreduce python spark stream-processing

Last synced: 26 Mar 2024

https://github.com/zeromicro/cds

Data syncing in golang for ClickHouse.

bigdata clickhouse go golang kafka-consumer

Last synced: 22 Mar 2024

https://github.com/BaseMax/LaravelBigDataTest

PHP Laravel: Develop a test environment in Laravel with more than 20 Million user rows. (A project in blade laravel and another SPA in vue js infinite scroll)

bigdata database-testing laravel laravel-test laravel8 php php8

Last synced: 14 Mar 2024

https://github.com/will-che/BigData-Interview

:dart: :star2:[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结

bigdata flink hadoop hbase hdfs interview interview-questions kafka mapreduce spark yarn

Last synced: 13 Mar 2024

https://github.com/simbafl/DataWarehouse

从数据仓库到用户画像,从数据建设到数据应用

bigdata datawarehouse olap presto sql userprofile

Last synced: 13 Mar 2024

https://github.com/WeBankFinTech/WeBank-all-Project

All the project addresses participated and established by WeBank are collected.汇集了微众银行参与和建立的所有项目地址。

ai bigdata blockchain could dpr fate finance frontend linkis spark

Last synced: 13 Mar 2024

https://github.com/kkyon/Simple-IT-English

Simple-IT-English: smart wordbook from community for community

bigdata dictonary english-learning english-word simple-it-english site

Last synced: 13 Mar 2024

https://github.com/DTStack/chunjun

A data integration framework

bigdata data-integration flink framework java

Last synced: 13 Mar 2024