
An open API service indexing awesome lists of open source software.


java spark springboot-spark

Last synced: 03 Jul 2024

spark全示例代码(java、scala) Spark most full instance code DEMO (java、scala)

bigdata hadoop operator spark spark-sql spark-streaming sparkfun-products sparkjava sparkline sparkp

Last synced: 03 Jul 2024

Web UI for Trino, Hive and SparkSQL

elasticsearch hive spark trino

Last synced: 03 Jul 2024

LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.

arrow big-data datafusion datalake flink huggingface lakehouse lakesoul postgresql python pytorch rust spark sql streaming vectorized velox

Last synced: 03 Jul 2024

Taier is a big data development platform for submission, scheduling, operation and maintenance, and indicator information display

azkaban chunjun cronjob-scheduler dag data-schedule distributed-schedule-system flink hadoop hive job-scheduler scheduler spark task-schedule workflow-scheduling-system

Last synced: 03 Jul 2024

Book recommender system using collaborative filtering based on Spark

collaborative-filtering python-flask recommendation-system spark

Last synced: 02 Jul 2024

:ship: Docker image for Apache Spark

docker hadoop java scala spark

Last synced: 02 Jul 2024

New Generation Opensource Data Stack Demo

cube dagster datahub dbt iceberg metabase python spark spark-sql trino trinodb

Last synced: 02 Jul 2024

Generic Data Ingestion & Dispersal Library for Hadoop

avro-schema data-lake hadoop ingest-data schema-format spark

Last synced: 02 Jul 2024

Ytk-learn is a distributed machine learning library which implements most of popular machine learning algorithms(GBDT, GBRT, Mixture Logistic Regression, Gradient Boosting Soft Tree, Factorization Machines, Field-aware Factorization Machines, Logistic Regression, Softmax).

distributed factorization-machines gbdt gbm hadoop logistic-regression machine-learning spark

Last synced: 27 Jun 2024

80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

avro aws cloudformation devops docker dockerhub elasticsearch gcf gcp hadoop hbase hdfs json linux parquet pyspark python solr spark travis-ci

Last synced: 27 Jun 2024

Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and selection, Model interprability.

apache-spark feature-engineering machinelearning ml pyspark scala spark

Last synced: 24 Jun 2024

Official code repository for GATK versions 4 and up

bioinformatics dna gatk genome genomics ngs science sequencing spark

Last synced: 23 Jun 2024

Data Lineage Tracking And Visualization Solution

bigdata hadoop lineage scala spark tracking visualization

Last synced: 21 Jun 2024

Use Jupyter Notebooks to demonstrate how to build a Recommender with Apache Spark & Elasticsearch

elasticsearch ibmcode jupyter python spark

Last synced: 20 Jun 2024

:snowflake: :whale: Awesome tools and libs for AI, Deep Learning, Machine Learning, Computer Vision, Data Science, Data Analytics and Cognitive Computing that are baked in the oven to be Native on Kubernetes and Docker with Python, R, Scala, Java, C#, Go, Julia, C++ etc

ai analytics big-data cognitive-science data-science docker kubeflow kubernetes kubernetes-ai kubernetes-analytics kubernetes-data-science kubernetes-ml ml pachyderm python-ml scala seldon-core spark spark-kubernetes spark-ml

Last synced: 20 Jun 2024

Run spark calculations from Ammonite

ammonite scala spark

Last synced: 17 Jun 2024

Data Quality Monitoring Tool

data-quality monitoring scala spark

Last synced: 17 Jun 2024

scalable knowledge graph construction from unstructured text

corenlp neo4j spark

Last synced: 17 Jun 2024

This is an ETL application on AWS with general open sales and customer data that you can find here:, it's a zipped file with some .csvs inside that we will apply transformations.

airflow argocd athena aws catalog data data-engineer database emr emr-cluster etl glue kubernetes pipeline postgres pyspark rds spark

Last synced: 16 Jun 2024

大数据组件 All-in-One 的 Dockerfile

big-data dockerfile hadoop kafka spark

Last synced: 16 Jun 2024

A concise resource repository for machine learning

data-analysis data-science deep-learning kaggle machine-learning python spark

Last synced: 16 Jun 2024

An Airflow docker image preconfigured to work well with Spark and Hadoop/EMR

airflow docker hadoop spark

Last synced: 15 Jun 2024

Demonstrating Confidential Analytics on Azure SGX VM's with Apache Spark and SCONE.

azure azure-sql-database docker kubernetes sgx spark

Last synced: 15 Jun 2024

An encrypted data analytics platform

analytics enclave machine-learning privacy security spark spark-sql

Last synced: 15 Jun 2024

基于JDK8 AI 聊天机器人!微信公众号 Midjourney画图、卡密兑换、web 支持ChatGPT、Midjourney画图、sd画图,卡密兑换,易支付,公众号引流,邮件注册🔥

bard-api chatgpt google midjourney-api spark stable-diffusion

Last synced: 14 Jun 2024

Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

data-lake hadoop kylo nifi spark teradata

Last synced: 14 Jun 2024

Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.

arrow-datafusion big-data data-engineering execution-engine rust spark sql

Last synced: 13 Jun 2024

The Internals of Apache Spark

apache-spark book internals spark

Last synced: 13 Jun 2024

Distributed Transaction Settlement System

hadoop kafka se347 spark zookeeper

Last synced: 12 Jun 2024

DoEKS is a tool to build, deploy and scale Data & ML Platforms on Amazon EKS

aws-eks eks jupyterhub kubeflow kubernetes ml mlflow ray spark terraform

Last synced: 12 Jun 2024

Different ways to process data into Cassandra in realtime with technologies such as Kafka, Spark, Akka, Flink

akka cassandra flink flink-stream-processing flink-streaming kafka kafka-connect spark spark-streaming

Last synced: 12 Jun 2024

A free tutorial for Apache Spark.

jupyter scala spark tutorial

Last synced: 12 Jun 2024

Learning Apache spark,including code and data .Most part can run local.

learning ml spark sparkcore

Last synced: 11 Jun 2024

50+ DockerHub public images for Docker & Kubernetes - DevOps, CI/CD, GitHub Actions, CircleCI, Jenkins, TeamCity, Alpine, CentOS, Debian, Fedora, Ubuntu, Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak

apache-drill cassandra consul devops docker dockerhub hacktoberfest hadoop hbase kafka kubernetes linux nagios-plugins presto rabbitmq rabbitmq-cluster solr solrcloud spark zookeeper

Last synced: 11 Jun 2024

Step by step guide on how to install Zeppelin 0.7.3 on Linux subsystem (WSL) for Windows 10

hadoop linux-subsystem lxss spark wsl zeppelin

Last synced: 10 Jun 2024

Overlapping community detection in Large-Scale Networks using BigCLAM model build on Apache Spark

apache-spark bigclam bigclam-model community-detection graph-mining graphx large-scale latex machine-learning scala scale-networks spark

Last synced: 09 Jun 2024


bigdata bigdatalearning debezium flink hadoop hbase hdfs hive hudi kafka kudu mapreduce olap spark

Last synced: 08 Jun 2024

🐘 A tool for blazing fast analysis and clustering of similar images using 🐘 Hadoop and ⚡ Spark.

big-data computer-vision hadoop image-processing spark

Last synced: 07 Jun 2024

A RPC framework leveraging Spark RPC module

rpc spark

Last synced: 07 Jun 2024

Spark RAPIDS plugin - accelerate Apache Spark with GPUs

big-data gpu rapids spark

Last synced: 07 Jun 2024


flink hadoop hbase hive presto spark

Last synced: 07 Jun 2024


azkaban bigdata flink flume hadoop hbase hdfs hive kafka spark zookeeper

Last synced: 07 Jun 2024

The Internals of Spark on Kubernetes

apache-spark book internals kubernetes spark

Last synced: 07 Jun 2024

REST job server for Apache Spark

hadoop hive java kerberos kubernetes spark yarn

Last synced: 07 Jun 2024

webln browser extension for lightningd/eclair/ptarmigan

c-lightning eclair lightning-network lightningd ptarmigan spark web-extension webln

Last synced: 07 Jun 2024


bigdata flink flume hadoop hbase hive javase kafka scala spark zookeeper

Last synced: 07 Jun 2024


bigdata collection data data-collection datapipeline datax-web etl pipline scheduler spark sparketl

Last synced: 07 Jun 2024

Cloud Shuffle Service(CSS) is a general purpose remote shuffle solution for compute engines, including Spark/Flink/MapReduce.

flink hadoop-mapreduce spark

Last synced: 07 Jun 2024

Uniffle is a high performance, general purpose Remote Shuffle Service.

mapreduce remote-shuffle-service rss shuffle spark tez

Last synced: 07 Jun 2024

The next generation of cloud-native big data management expert , Aims to help users rapidly build stable, efficient, and scalable cloud-native platforms for big data.

cloudnative doris easy-to-use kubernetes spark yarn

Last synced: 07 Jun 2024

数据血缘,支持spark sql,hive sql,pg sql,presto sql,mysql sql,tidb sql, flink sql, datax血缘,spark/flink jar 运行命令的血缘解析;支持with语法

datax flink hive mysql postgresql presto spark tidb trino

Last synced: 07 Jun 2024

Compass is a task diagnosis platform for bigdata

airflow bigdata diagnose dolphinscheduler flink hadoop mapreduce scheduler spark sql

Last synced: 07 Jun 2024

Hyper performant kNN using Annoy for Apache Spark.

ann annoy apache-spark k-nearest-neighbors k-nearest-neighbours knn ml spark

Last synced: 07 Jun 2024

This is a solution accelerator for creating personalized content recommendations based on user activity.

azure-synapse-analytics power-bi spark

Last synced: 04 Jun 2024

An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.

acceleration analytics big-data databases indexing spark

Last synced: 04 Jun 2024

Node.js bindings for Apache Spark DataFrame APIs

data-frame node spark

Last synced: 03 Jun 2024

ApacheCN 开源组织:公告、介绍、成员、活动、交流方式

dl ml python pytorch solidity spark

Last synced: 03 Jun 2024

Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.

automation data data-science extensible flyte flyte-tasks hacktoberfest mlops pypi python sdk spark workflows

Last synced: 02 Jun 2024

Nessie: Transactional Catalog for Data Lakes with Git-like semantics

aws-lambda data git iceberg java spark

Last synced: 02 Jun 2024

High Performance HTTP Sidecar Load Balancer

bigdata kubernetes load-balancer minio-servers proxy sidecar sidekick spark

Last synced: 02 Jun 2024

The Cosmos Generic Enabler enables an easier BigData analysis over context integrated with some of the most popular BigData platforms.

analysis big-data fiware fiware-cosmos flink processing real-time-analytics spark streaming-engine

Last synced: 01 Jun 2024

Distributed compute platform implemented in Rust, and powered by Apache Arrow.

arrow dataframe datafusion distributed java jvm kotlin kubernetes rust scala spark

Last synced: 01 Jun 2024

Sequential and distributed implementations of Apriori and FP-Growth algorithms using Scala and Spark.

apriori dfps fp-growth rapriori scala spark yafim

Last synced: 01 Jun 2024

Analyse a twitter dataset with Spark and vizualize the results on a React dashboard.

java reactjs scala spark

Last synced: 31 May 2024

A Flexible and Powerful Parameter Server for large-scale machine learning

high-dimensional machine-learning model online-learning parameter-server scala spark spark-streaming

Last synced: 31 May 2024

Implementation of Static mining part of "Mining maximal frequent patterns in transactional databases and dynamic data streams: A spark-based approach" Information Sciences, Volume 432, March 2018, Pages 278-300

data-mining data-stream frequent-pattern-mining java maximal-frequent-pattern spark structured-streaming

Last synced: 31 May 2024

Teaching Materials for Distributed Statistical Computing (大数据分布式计算教学材料)

hadoop mapreduce pyspark-tutorial spark spark-teaching statistical-models

Last synced: 31 May 2024

后端 (Java Golang)全栈知识架构体系总结

emsp go java keepalived mongodb mqtt mysql netty redis rocketmq spark spring springboot springcloud tomcat tutorial

Last synced: 31 May 2024


dubbo es6 git hbase java kafka mycat spark spring springboot

Last synced: 31 May 2024

flink learning blog. 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》

clickhouse elasticsearch flink hbase influxdb kafka loki mysql opentsdb rabbitmq redis rocketmq spark stream-processing streaming

Last synced: 31 May 2024

Apache Doris is an easy-to-use, high performance and unified analytics database.

bigquery database dbt delta-lake elt etl hadoop hive hudi iceberg lakehouse olap query-engine real-time redshift snowflake spark sql

Last synced: 31 May 2024

✨Spark is a web-based, cross-platform and full-featured Remote Administration Tool (RAT) written in Go that allows you control all your devices anywhere. Spark是一个Go编写的,网页UI、跨平台以及多功能的远程控制和监控工具,你可以随时随地监控和控制所有设备。

dashboard go golang rat remote-access-tool remote-admin-tool remote-administration-tool remote-control server-monitoring shell spark webshell

Last synced: 31 May 2024

:dart: :star2:[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结

bigdata flink hadoop hbase hdfs interview interview-questions kafka mapreduce spark yarn

Last synced: 31 May 2024

🔨 用 JSON 来生成结构化的 SQL 语句,基于 Vue3 + TypeScript + Vite + Ant Design + MonacoEditor 实现,项目简单(重逻辑轻页面)、适合练手~

ant-design bigdata hive javascript json monaco-editor mysql spark sql typescript vite vue vue3

Last synced: 30 May 2024

Extract, transform, and load data for analytic processing using AWS Glue

cdk cloud-development-kit etl glue glue-catalog glue-job miztiik-automation s3-to-rds spark

Last synced: 27 May 2024

A collection of data engineering projects: data modeling, ETL pipelines, data lakes, infrastructure configuration on AWS, data warehousing, containerization, and a dashboard to monitor data pipeline KPIs

airflow aws cassandra data-engineering data-lake data-warehouse docker emr etl-pipeline infrastructure-as-code infrastructure-setup postgresql python redshift s3 spark

Last synced: 27 May 2024