Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/huangyueranbbc/SparkDemo
spark全示例代码(java、scala) Spark most full instance code DEMO (java、scala)
bigdata hadoop operator spark spark-sql spark-streaming sparkfun-products sparkjava sparkline sparkp
Last synced: 03 Jul 2024
https://github.com/intsmaze/flink-boot
懒松鼠Flink-Boot 脚手架让Flink全面拥抱Spring生态体系,使得开发者可以以Java WEB开发模式开发出分布式运行的流处理程序,懒松鼠让跨界变得更加简单。懒松鼠旨在让开发者以更底上手成本(不需要理解分布式计算的理论知识和Flink框架的细节)便可以快速编写业务代码实现。为了进一步提升开发者使用懒松鼠脚手架开发大型项目的敏捷的度,该脚手架默认集成Spring框架进行Bean管理,同时将微服务以及WEB开发领域中经常用到的框架集成进来,进一步提升开发速度。比如集成Mybatis ORM框架,Hibernate Validator校验框架,Spring Retry重试框架等,具体见下面的脚手架特性。
bigdata flink flink-boot java java-flink mcv mybatis sping spring-boot spring-retry
Last synced: 03 Jul 2024
https://github.com/raystack/meteor
Meteor is an easy-to-use, plugin-driven metadata collection framework to extract data from different sources and sink to any data catalog.
bigdata collector data-catalog data-management dataops extractors metadata scraper sinks
Last synced: 01 Jul 2024
https://github.com/byzer-org/byzer-lang
Byzer (former MLSQL): A low-code open-source programming language for data pipeline, analytics and AI.
bigdata machine-learning mlsql sql-like-dsl
Last synced: 28 Jun 2024
https://github.com/pbreheny/biglasso
biglasso: Extending Lasso Model Fitting to Big Data in R
bigdata lasso out-of-core parallel-computing r
Last synced: 27 Jun 2024
https://github.com/KennethanCeyer/awesome-data-pipeline
Awesome list for datapipeline
architecture awesome awesome-list big-data bigdata cloud data data-engineering dataeng datalake datapipeline datawarehouse hadoop hive opensource query spark
Last synced: 25 Jun 2024
https://github.com/pushshift/reddit_sse_stream
A Server Side Event stream to deliver Reddit comments and submissions in near real-time to a client.
bigdata flask reddit server-side-events sse stream
Last synced: 24 Jun 2024
https://github.com/AbsaOSS/spline
Data Lineage Tracking And Visualization Solution
bigdata hadoop lineage scala spark tracking visualization
Last synced: 21 Jun 2024
https://github.com/VulknData/vulkn
Love your Data. Love the Environment. Love VULKИ.
bigdata clickhouse dataops pandas python vulkn vulkndata
Last synced: 17 Jun 2024
https://github.com/godaai/flink-book-zh
Flink Tutorial Project
bigdata flink flink-examples flink-stream-processing
Last synced: 16 Jun 2024
https://github.com/meiyulee/MathAI
免費數字驅動的數學模型人工智能 | 為你的數字規律建立數學模型 | C語言免安裝軟體
ai artifical-intelligence bigdata chatgpt data-science dataanalytics datadriven math-ai mathai mathematical-modelling mathematics mathgpt numerical-computation numerical-methods portable regression regression-analysis regression-models science statistics-modeling
Last synced: 14 Jun 2024
https://github.com/YoongiKim/AutoCrawler
Google, Naver multiprocess image web crawler (Selenium)
bigdata chromedriver crawler customizable deep-learning google image-crawler multiprocess python selenium thread
Last synced: 14 Jun 2024
https://github.com/labex-labs/bigdata-free-tutorials
[Practice 16 Big Data Free Tutorials]-This repository collects 16 of free tutorials for Big Data. It offers comprehensive tutorials and hands-on labs tailored for learners of all levels, from students to professionals and enthusiasts.
awesome awesome-list bigdata education free free-tutorials hands-on labex programming tutorials
Last synced: 13 Jun 2024
https://github.com/labex-labs/practice-bigdata-programming-courses
[Big Data Programming Courses] This repository collects 2 of programming courses for Big Data.
awesome awesome-list bigdata courses education hands-on labex programming
Last synced: 13 Jun 2024
https://github.com/labex-labs/awesome-programming-projects
[Practice 844 Programming Projects] Awesome Programming Projects collects 844 of programming projects for different tech stacks.
alibabacloud ansible awesome awesome-list bigdata cysec data-science devops docker education git hands-on kubernetes labex linux ml programming projects python web-development
Last synced: 13 Jun 2024
https://github.com/tensorbase/tensorbase
TensorBase is a new big data warehousing with modern efforts.
analytics bigdata data data-infrastructure data-warehouse database engineering high-performance infrastructure modern rust rust-lang warehouse
Last synced: 11 Jun 2024
https://github.com/pierre94/flink-notes
flink学习笔记
bigdata flink flink-notes flinkx
Last synced: 07 Jun 2024
https://github.com/leesf/hudi-resources
汇总Apache Hudi相关资料
apache apachehudi bigdata data-integration datalake hudi hudi-resources incremental-processing stream-processing
Last synced: 07 Jun 2024
https://github.com/LinMingQiang/flink-learn
Learning Flink : Flink CEP,Flink Core,Flink SQL
Last synced: 07 Jun 2024
https://github.com/DTStack/flinkStreamSQL
基于开源的flink,对其实时sql进行扩展;主要实现了流与维表的join,支持原生flink SQL所有的语法
Last synced: 07 Jun 2024
https://github.com/zhaoyachao/zdh_web
大数据采集,抽取平台,zdh_web是zdh系列服务的可视化管理平台,包含数据采集,调度,权限,审批流,私域营销等模块
bigdata collection data data-collection datapipeline datax-web etl pipline scheduler spark sparketl
Last synced: 07 Jun 2024
https://github.com/luweizheng/flink-tutorials
Flink Tutorial Project
bigdata flink flink-examples flink-stream-processing
Last synced: 07 Jun 2024
https://github.com/dromara/CloudEon
CloudEon uses Kubernetes to install and deploy open-source big data components, enabling the containerized operation of an open-source big data platform. This allows you to reduce your focus on underlying resource management and maintenance.
bigdata cloudnative doris hadoop hdfs kubernetes yarn
Last synced: 07 Jun 2024
https://github.com/grailbio/bigslice
A serverless cluster computing system for the Go programming language
bigdata cluster computing etl go golang machinelearning mapreduce
Last synced: 05 Jun 2024
https://github.com/DTStack/dt-sql-parser
SQL Parsers for BigData, built with antlr4.
antlr4 autocompletion bigdata flink hive impala mysql parser postgresql spark sql sql-validation trino
Last synced: 05 Jun 2024
https://github.com/opendatadiscovery/odd-platform
First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.
alerting bigdata data-catalog data-discovery data-engineering data-exploration data-governance data-lineage data-observability data-pipelines data-platform data-profiling data-quality data-science datacatalog lineage metadata metadata-management observability oss
Last synced: 02 Jun 2024
https://github.com/minio/sidekick
High Performance HTTP Sidecar Load Balancer
bigdata kubernetes load-balancer minio-servers proxy sidecar sidekick spark
Last synced: 02 Jun 2024
https://github.com/iGaoWei/BigDataView
100+套大数据可视化炫酷大屏Html5模板;包含行业:社区、物业、政务、交通、金融银行等,全网最新、最多,最全、最酷、最炫大数据可视化模板。陆续更新中
bigdata bigdataviewer echarts html-template viewmodel
Last synced: 31 May 2024
https://github.com/www-zerocode-net-cn/ERD-Online
ERD Online is an online collaborative data warehouse design software. It does not need to install applications locally and operate databases online. It is an excellent alternative to desktop data modeling tools.
bigdata collaborative data database design erd java lowcode metadata nocode online sql
Last synced: 31 May 2024
https://github.com/liyupi/sql-generator
🔨 用 JSON 来生成结构化的 SQL 语句,基于 Vue3 + TypeScript + Vite + Ant Design + MonacoEditor 实现,项目简单(重逻辑轻页面)、适合练手~
ant-design bigdata hive javascript json monaco-editor mysql spark sql typescript vite vue vue3
Last synced: 30 May 2024
https://github.com/ricardolsmendes/datacatalog-custom-model-manager
Python package to load user-specified metadata models into Google Cloud Data Catalog, comprising Custom Entries, Tag Templates, and Tags
bigdata csv-import dataanalytics datacatalog datagovernance gcp gcp-datacatalog google-cloud python
Last synced: 27 May 2024
https://github.com/jadianes/spark-py-notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
big-data bigdata data-analysis data-science ipython ipython-notebook machine-learning mllib notebook pyspark python spark
Last synced: 26 May 2024
https://github.com/NationalSecurityAgency/datawave
DataWave is an ingest/query framework that leverages Apache Accumulo to provide fast, secure data access.
accumulo bigdata hacktoberfest java
Last synced: 18 May 2024
https://github.com/onurakpolat/awesome-bigdata
A curated list of awesome big data frameworks, ressources and other awesomeness.
awesome awesome-list bigdata data data-analytics data-science data-stream data-visualization data-warehouse database distributed-database series-database stream-processing streaming-data visualize-data
Last synced: 18 May 2024
https://github.com/exajobs/data-engineering-collection
A collection of awesome software, libraries, Learning Tutorials, documents, books, resources and interesting stuff about Big Data Science & Engineering
awesome-list big-data big-data-analytics bigdata bigdata-module data-science data-scientists data-structures data-visualization database-deployment database-design database-development database-migrations databases engineering hadoop query series-data streaming-data
Last synced: 14 May 2024
https://github.com/mikeroyal/Apache-Kafka-Guide
Apache Kafka Guide
awesome awesome-kafka awesome-list awesome-readme big-data bigdata data-engineering kafka kafka-connect kafka-consumer kafka-producer kafka-streams
Last synced: 14 May 2024
https://github.com/juicedata/juicefs
JuiceFS is a distributed POSIX file system built on top of Redis and S3.
bigdata cloud-native distributed-systems filesystem go golang hdfs object-storage posix redis s3 storage
Last synced: 13 May 2024
https://netflix.github.io/genie/
Distributed Big Data Orchestration Service
big-data bigdata cloud configuration configuration-management distributed-systems java microservice microservices netflix-oss netflixoss orchestration spring-boot
Last synced: 13 May 2024
https://github.com/taosdata/TDengine
TDengine is an open source, high-performance, cloud native time-series database optimized for Internet of Things (IoT), Connected Cars, Industrial IoT and DevOps.
bigdata cloud-native cluster connected-vehicles database distributed financial-analysis industrial-iot iot metrics monitoring scalability sql tdengine time-series time-series-database tsdb
Last synced: 13 May 2024
https://github.com/rdkmaster/jigsaw
Jigsaw七巧板 provides a set of web components based on Angular5/8/9+. The main purpose of Jigsaw is to help the application developers to construct complex & intensive interacting & user friendly web pages. Jigsaw is supporting the development of all applications of Big Data Product of ZTE.
angular bigdata component jigsaw jigsaw-seed typescript webui zte
Last synced: 12 May 2024
https://github.com/bigartm/bigartm
Fast topic modeling platform
bigartm bigdata c-plus-plus machine-learning python python-api regularizer text-mining topic-modeling
Last synced: 12 May 2024
https://github.com/davidesantangelo/api.rss
RSS as RESTful. This service allows you to transform RSS feed into an awesome API.
api bigdata dandelion-api elasticsearch feed machine-learning rails rest-api rss rss-feed ruby semantic-web sidekiq
Last synced: 11 May 2024
https://github.com/volcano-sh/volcano
A Cloud Native Batch System (Project under CNCF)
batch-systems bigdata gene golang hpc kubernetes machine-learning
Last synced: 08 May 2024
https://github.com/nicgirault/circosJS
d3 library to build circular graphs
big-data bigdata bioinformatics bioinformatics-data circos circos-graphs circular d3js javascript
Last synced: 08 May 2024
https://github.com/ConservationInternational/resilienceatlas
Resilience Atlas - Evidence-based decision-making around resilience
bigdata climate-change conservation resilience sustainability
Last synced: 08 May 2024
https://github.com/griddb/griddb
GridDB is a next-generation open source database that makes time series IoT and big data fast,and easy.
bigdata database fast griddb iot newsql nosql sql time-series timeseries
Last synced: 07 May 2024
https://github.com/kubernetes-retired/kube-batch
A batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC
bigdata hpc k8s-sig-scheduling kubernetes machine-learning
Last synced: 07 May 2024
https://github.com/chatnoir-eu/chatnoir-resiliparse
A robust web archive analytics toolkit
bigdata cpp cython extraction htmlparser python warc web webarchive
Last synced: 07 May 2024
https://github.com/Microsoft/Mobius
C# and F# language binding and extensions to Apache Spark
apache-spark bigdata csharp dataframe dataset dstream eventhubs fsharp kafka-streaming mapreduce mobius near-real-time rdd spark spark-streaming streaming
Last synced: 05 May 2024
https://github.com/oldratlee/big-data-study
:whale: big data study
awesome big-data bigdata study
Last synced: 02 May 2024
https://github.com/Clustering4Ever/Clustering4Ever
C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.
ai artificial-intelligence big-data bigdata clustering clustering-algorithm clustering-evaluation scala scalability spark
Last synced: 30 Apr 2024
https://github.com/datafuselabs/databend
𝗗𝗮𝘁𝗮, 𝗔𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝘀 & 𝗔𝗜. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
ai bigdata database rust serverless snowflake
Last synced: 29 Apr 2024
https://github.com/vaexio/vaex
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
bigdata data-science dataframe hdf5 machine-learning machinelearning memory-mapped-file pyarrow python tabular-data visualization
Last synced: 28 Apr 2024
https://github.com/hi-primus/optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
big-data-cleaning bigdata cudf dask dask-cudf data-analysis data-cleaner data-cleaning data-cleansing data-exploration data-extraction data-preparation data-profiling data-science data-transformation data-wrangling machine-learning pyspark spark
Last synced: 28 Apr 2024
https://github.com/alkihis/twitter-archive-reader
Full featured TypeScript Twitter archive reader and browser
big-data bigdata tweets twitter twitter-archives
Last synced: 27 Apr 2024
https://github.com/Seagate/cortx
CORTX Community Object Storage is 100% open source object storage uniquely optimized for mass capacity storage devices.
big-data bigdata cortx-community distributed-storage distributed-systems hackathons hacktoberfest hacktoberfest2020 inclusivity object-storage object-storage-service objectstorage objectstore open-source opensource s3 s3-storage software-defined-storage storage storage-api
Last synced: 26 Apr 2024
https://github.com/unum-cloud/ustore
Multi-Modal Database replacing MongoDB, Neo4J, and Elastic with 1 faster ACID solution, with NetworkX and Pandas interfaces, and bindings for C 99, C++ 17, Python 3, Java, GoLang 🗄️
acid apache-arrow arrow big-data bigdata database dataloader document-database graph-database iouring json key-value-store knn-search networkx nosql pandas python search spdk vector-search
Last synced: 26 Apr 2024
https://github.com/dbcli/athenacli
AthenaCLI is a CLI tool for AWS Athena service that can do auto-completion and syntax highlighting.
athena-cli autocompletion aws-athena aws-cli bigdata cli command-line python syntax-highlighting
Last synced: 25 Apr 2024
https://github.com/shouc/daudit
🌲 Configuration flaws detector for Hadoop, MongoDB, MySQL, and more!
auditing bigdata hadoop-spark mongodb redis security
Last synced: 22 Apr 2024
https://github.com/visualpython/visualpython
GUI-based Python code generator for data science, extension to Jupyter Lab, Jupyter Notebook and Google Colab.
bigdata chrome-extension code-generator data-analysis jupyter-lab-extension jupyter-notebook-extension jupyterlab-extension pandas python visual-coding
Last synced: 22 Apr 2024
https://github.com/rapiddweller/rapiddweller-benerator-ce
BENERATOR is a leading software solution to generate, obfuscate, pseudonymize and migrate data for development, testing, and training purposes with a model-driven approach.
anonymization benerator big-data bigdata data data-generation data-masking data-modelling database databene faker java masking mass-data-migration migration model-driven obfuscate performance-testing synthetic-data testdata
Last synced: 14 Apr 2024
https://github.com/shzlw/poli
An easy-to-use BI server built for SQL lovers. Power data analysis in SQL and gain faster business insights.
bigdata business-intelligence dashboard data-visualization jdbc reactjs reporting spring-boot sql sql-editor
Last synced: 14 Apr 2024
https://github.com/apache/amoro
Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.
Last synced: 11 Apr 2024
https://github.com/apache/incubator-livy
Apache Livy is an open source REST interface for interacting with Apache Spark from anywhere.
Last synced: 11 Apr 2024
https://github.com/dotnet/spark
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
analytics apache-spark azure bigdata csharp databricks dotnet dotnet-core dotnet-standard emr fsharp hdinsight machine-learning microsoft spark spark-sql spark-streaming streaming tpcds tpch
Last synced: 11 Apr 2024
https://github.com/Kotlin/kotlin-spark-api
This projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to have this as a part of Apache Spark 3.x
bigdata kotlin nullability scala spark
Last synced: 11 Apr 2024
https://github.com/gearpump/gearpump
Lightweight real-time big data streaming engine over Akka
akka bigdata scala stream-processing
Last synced: 11 Apr 2024
https://github.com/wavestone-cdt/hadoop-attack-library
A collection of pentest tools and resources targeting Hadoop environments
Last synced: 05 Apr 2024
https://github.com/apache/incubator-amoro
Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.
Last synced: 02 Apr 2024
https://github.com/Xmader/musescore-dataset
The dataset of all music sheets and users on musescore.com (unmaintained/discontinued since Sep 30, 2021)
Last synced: 01 Apr 2024
https://github.com/YaohuiZeng/biglasso
biglasso: Extending Lasso Model Fitting to Big Data in R
bigdata lasso out-of-core parallel-computing r
Last synced: 31 Mar 2024
https://github.com/apache/hudi
Upserts, Deletes And Incremental Processing on Big Data.
apacheflink apachehudi apachespark bigdata data-integration datalake hudi incremental-processing stream-processing
Last synced: 31 Mar 2024
https://github.com/sirkon/ldetool
Code generator for fast log file parsers
bigdata datamining log-parsing logs-analysis logs-parsing parsing parsing-csv
Last synced: 30 Mar 2024
https://github.com/apache/shardingsphere
Distributed SQL transaction & query engine for data sharding, scaling, encryption, and more - on any database.
bigdata database database-cluster database-plus dba distributed-database distributed-sql-database distributed-transactions encrypt hacktoberfest mysql oltp postgresql rdbms shard sql
Last synced: 28 Mar 2024
https://github.com/apache/celeborn
Apache Celeborn is an elastic and high-performance service for shuffle and spilled data.
Last synced: 27 Mar 2024
https://github.com/douban/dpark
Python clone of Spark, a MapReduce alike framework in Python
bigdata dpark mapreduce python spark stream-processing
Last synced: 26 Mar 2024
https://github.com/Netflix/genie
Distributed Big Data Orchestration Service
big-data bigdata cloud configuration configuration-management distributed-systems java microservice microservices netflix-oss netflixoss orchestration spring-boot
Last synced: 23 Mar 2024
https://github.com/zeromicro/cds
Data syncing in golang for ClickHouse.
bigdata clickhouse go golang kafka-consumer
Last synced: 22 Mar 2024
https://github.com/AivanF/Lemuras
A small Python library to deal with big tables
bigdata data-analysis html ipython-notebook join-tables json jupyter-notebook pandas pivot-tables python sql table
Last synced: 18 Mar 2024
https://github.com/BaseMax/LaravelBigDataTest
PHP Laravel: Develop a test environment in Laravel with more than 20 Million user rows. (A project in blade laravel and another SPA in vue js infinite scroll)
bigdata database-testing laravel laravel-test laravel8 php php8
Last synced: 14 Mar 2024
https://github.com/WeBankFinTech/WeDataSphere
WeDataSphere is a financial grade, one-stop big data platform suite.
analytics bigdata data-analysis datafabric datagovernance dataspherestudio exchangis flink hadoop hive ide linkis prophecis qualitis schedulis scriptis spark streamis visualis
Last synced: 13 Mar 2024
https://github.com/simbafl/DataWarehouse
从数据仓库到用户画像,从数据建设到数据应用
bigdata datawarehouse olap presto sql userprofile
Last synced: 13 Mar 2024
https://github.com/kkyon/Simple-IT-English
Simple-IT-English: smart wordbook from community for community
bigdata dictonary english-learning english-word simple-it-english site
Last synced: 13 Mar 2024
https://github.com/GZTipDM/TipDM
TipDM建模平台,开源的数据挖掘工具。
bigdata data-analysis data-analysis-python data-mining graph-schedule machine-learning tensorflow workflow
Last synced: 13 Mar 2024
https://github.com/DTStack/chunjun
A data integration framework
bigdata data-integration flink framework java
Last synced: 13 Mar 2024