Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/vector4wang/quick-spark-process
:star2::star2::star2:学习spark的相关示例
Last synced: 03 Jul 2024
https://github.com/huangyueranbbc/SparkDemo
spark全示例代码(java、scala) Spark most full instance code DEMO (java、scala)
bigdata hadoop operator spark spark-sql spark-streaming sparkfun-products sparkjava sparkline sparkp
Last synced: 03 Jul 2024
https://github.com/yanagishima/yanagishima
Web UI for Trino, Hive and SparkSQL
elasticsearch hive spark trino
Last synced: 03 Jul 2024
https://github.com/lakesoul-io/LakeSoul
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
arrow big-data datafusion datalake flink huggingface lakehouse lakesoul postgresql python pytorch rust spark sql streaming vectorized velox
Last synced: 03 Jul 2024
https://github.com/DTStack/Taier
Taier is a big data development platform for submission, scheduling, operation and maintenance, and indicator information display
azkaban chunjun cronjob-scheduler dag data-schedule distributed-schedule-system flink hadoop hive job-scheduler scheduler spark task-schedule workflow-scheduling-system
Last synced: 03 Jul 2024
https://github.com/XuefengHuang/RecommendationSystem
Book recommender system using collaborative filtering based on Spark
collaborative-filtering python-flask recommendation-system spark
Last synced: 02 Jul 2024
https://github.com/uber/marmaray
Generic Data Ingestion & Dispersal Library for Hadoop
avro-schema data-lake hadoop ingest-data schema-format spark
Last synced: 02 Jul 2024
https://ibm-cds-labs.github.io/pixiedust
Python Helper library for Jupyter Notebooks
data-science jupyter-notebook pixiedust python python-notebook scala-notebooks spark visualization
Last synced: 30 Jun 2024
https://github.com/Hydrospheredata/hydro-serving
MLOps Platform
machine-learning models pipelines realtime scikit-learn scoring serverless serving spark tensorflow
Last synced: 29 Jun 2024
https://github.com/kanyun-inc/ytk-learn
Ytk-learn is a distributed machine learning library which implements most of popular machine learning algorithms(GBDT, GBRT, Mixture Logistic Regression, Gradient Boosting Soft Tree, Factorization Machines, Field-aware Factorization Machines, Logistic Regression, Softmax).
distributed factorization-machines gbdt gbm hadoop logistic-regression machine-learning spark
Last synced: 27 Jun 2024
https://github.com/HariSekhon/DevOps-Python-tools
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
avro aws cloudformation devops docker dockerhub elasticsearch gcf gcp hadoop hbase hdfs json linux parquet pyspark python solr spark travis-ci
Last synced: 27 Jun 2024
https://github.com/streamnative/pulsar-spark
Spark Connector to read and write with Pulsar
apache-pulsar apache-spark batch-processing data-processing data-science flink spark spark-sql stream-processing structured-streaming
Last synced: 26 Jun 2024
https://github.com/KennethanCeyer/awesome-data-pipeline
Awesome list for datapipeline
architecture awesome awesome-list big-data bigdata cloud data data-engineering dataeng datalake datapipeline datawarehouse hadoop hive opensource query spark
Last synced: 25 Jun 2024
https://github.com/databrickslabs/automl-toolkit
Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and selection, Model interprability.
apache-spark feature-engineering machinelearning ml pyspark scala spark
Last synced: 24 Jun 2024
https://github.com/giantcroc/featuretoolsOnSpark
A simplified version of featuretools for Spark
automated-feature-engineering automated-machine-learning automl deep-feature-synthesis feature-engineering featuretools machine-learning python spark
Last synced: 24 Jun 2024
https://github.com/broadinstitute/gatk
Official code repository for GATK versions 4 and up
bioinformatics dna gatk genome genomics ngs science sequencing spark
Last synced: 23 Jun 2024
https://github.com/zinggAI/zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
analytics analytics-engineering data-science data-transformation data-transformations dataengineering datalake dataquality dedupe deduplication entity-resolution etl fuzzy-matching fuzzymatch identity identity-resolution masterdata ml modern-data-stack spark
Last synced: 22 Jun 2024
https://github.com/webysther/aws-glue-docker
🐋 Docker image for AWS Glue Spark/Python
apache-arrow aws aws-cli aws-glue aws-glue-docker cdk data-engineering development docker docker-image dockerfile etl glue-catalog glue-pyspark pandas pytest python python-poetry sam spark
Last synced: 21 Jun 2024
https://github.com/AbsaOSS/spline
Data Lineage Tracking And Visualization Solution
bigdata hadoop lineage scala spark tracking visualization
Last synced: 21 Jun 2024
https://github.com/IBM/elasticsearch-spark-recommender
Use Jupyter Notebooks to demonstrate how to build a Recommender with Apache Spark & Elasticsearch
elasticsearch ibmcode jupyter python spark
Last synced: 20 Jun 2024
https://github.com/CognonicLabs/awesome-AI-kubernetes
:snowflake: :whale: Awesome tools and libs for AI, Deep Learning, Machine Learning, Computer Vision, Data Science, Data Analytics and Cognitive Computing that are baked in the oven to be Native on Kubernetes and Docker with Python, R, Scala, Java, C#, Go, Julia, C++ etc
ai analytics big-data cognitive-science data-science docker kubeflow kubernetes kubernetes-ai kubernetes-analytics kubernetes-data-science kubernetes-ml ml pachyderm python-ml scala seldon-core spark spark-kubernetes spark-ml
Last synced: 20 Jun 2024
https://microsoft.github.io/SynapseML/
Simple and Distributed Machine Learning
ai apache-spark azure big-data cognitive-services data-science databricks deep-learning http lightgbm machine-learning microsoft ml model-deployment onnx opencv pyspark scala spark synapse
Last synced: 19 Jun 2024
https://github.com/alexarchambault/ammonite-spark
Run spark calculations from Ammonite
Last synced: 17 Jun 2024
https://github.com/piotr-kalanski/data-quality-monitoring
Data Quality Monitoring Tool
data-quality monitoring scala spark
Last synced: 17 Jun 2024
https://github.com/dstlry/dstlr
scalable knowledge graph construction from unstructured text
Last synced: 17 Jun 2024
https://github.com/camposvinicius/aws-etl
This is an ETL application on AWS with general open sales and customer data that you can find here: https://github.com/camposvinicius/data/blob/main/AdventureWorks.zip, it's a zipped file with some .csvs inside that we will apply transformations.
airflow argocd athena aws catalog data data-engineer database emr emr-cluster etl glue kubernetes pipeline postgres pyspark rds spark
Last synced: 16 Jun 2024
https://github.com/iamabug/BigDataParty
大数据组件 All-in-One 的 Dockerfile
big-data dockerfile hadoop kafka spark
Last synced: 16 Jun 2024
https://github.com/innat/ML-Resource
A concise resource repository for machine learning
data-analysis data-science deep-learning kaggle machine-learning python spark
Last synced: 16 Jun 2024
https://github.com/angadsingh/airflow-ditto
An airflow DAG transformation framework
airflow airflow-dag aws azure dataflow emr extensible framework graph-algorithms graph-manipulation hdinsight isomorphism livy networkx spark yarn
Last synced: 15 Jun 2024
https://github.com/dsaidgovsg/airflow-pipeline
An Airflow docker image preconfigured to work well with Spark and Hadoop/EMR
Last synced: 15 Jun 2024
https://github.com/mdrakiburrahman/sgx-pyspark-sql-demo
Demonstrating Confidential Analytics on Azure SGX VM's with Apache Spark and SCONE.
azure azure-sql-database docker kubernetes sgx spark
Last synced: 15 Jun 2024
https://github.com/mc2-project/opaque-sql
An encrypted data analytics platform
analytics enclave machine-learning privacy security spark spark-sql
Last synced: 15 Jun 2024
https://github.com/a616567126/gpt-web-java
基于JDK8 AI 聊天机器人!微信公众号 Midjourney画图、卡密兑换、web 支持ChatGPT、Midjourney画图、sd画图,卡密兑换,易支付,公众号引流,邮件注册🔥
bard-api chatgpt google midjourney-api spark stable-diffusion
Last synced: 14 Jun 2024
https://github.com/Teradata/kylo
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
data-lake hadoop kylo nifi spark teradata
Last synced: 14 Jun 2024
https://github.com/kwai/blaze
Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.
arrow-datafusion big-data data-engineering execution-engine rust spark sql
Last synced: 13 Jun 2024
https://github.com/japila-books/apache-spark-internals
The Internals of Apache Spark
apache-spark book internals spark
Last synced: 13 Jun 2024
https://github.com/san089/goodreads_etl_pipeline
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
airflow airflow-dag apache-airflow apache-spark data-engineering data-engineering-pipeline data-lake data-migration emr-cluster etl-framework etl-job etl-pipeline goodreads-data-pipeline livy python redshift s3 scheduler spark warehouse
Last synced: 13 Jun 2024
https://github.com/alanchn31/Data-Engineering-Projects
Personal Data Engineering Projects
airflow aws-redshift cassandra data-engineering data-engineering-nanodegree data-lake data-modeling data-warehouse ingest-data mongodb postgres scrapy spark star-schema
Last synced: 13 Jun 2024
https://github.com/awslabs/data-on-eks
DoEKS is a tool to build, deploy and scale Data & ML Platforms on Amazon EKS
aws-eks eks jupyterhub kubeflow kubernetes ml mlflow ray spark terraform
Last synced: 12 Jun 2024
https://github.com/Anant/Cassandra.Realtime
Different ways to process data into Cassandra in realtime with technologies such as Kafka, Spark, Akka, Flink
akka cassandra flink flink-stream-processing flink-streaming kafka kafka-connect spark spark-streaming
Last synced: 12 Jun 2024
https://github.com/deanwampler/spark-scala-tutorial
A free tutorial for Apache Spark.
Last synced: 12 Jun 2024
https://github.com/xubo245/SparkLearning
Learning Apache spark,including code and data .Most part can run local.
Last synced: 11 Jun 2024
https://github.com/abhishek-ch/around-dataengineering
A Data Engineering & Machine Learning Knowledge Hub
airflow data-engineering datascience devops infrastructure machine-learning mlops spark
Last synced: 11 Jun 2024
https://github.com/HariSekhon/Dockerfiles
50+ DockerHub public images for Docker & Kubernetes - DevOps, CI/CD, GitHub Actions, CircleCI, Jenkins, TeamCity, Alpine, CentOS, Debian, Fedora, Ubuntu, Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak
apache-drill cassandra consul devops docker dockerhub hacktoberfest hadoop hbase kafka kubernetes linux nagios-plugins presto rabbitmq rabbitmq-cluster solr solrcloud spark zookeeper
Last synced: 11 Jun 2024
https://github.com/x4ax/lxss-install-zeppelin
Step by step guide on how to install Zeppelin 0.7.3 on Linux subsystem (WSL) for Windows 10
hadoop linux-subsystem lxss spark wsl zeppelin
Last synced: 10 Jun 2024
https://github.com/thangdnsf/BigCLAM-ApacheSpark
Overlapping community detection in Large-Scale Networks using BigCLAM model build on Apache Spark
apache-spark bigclam bigclam-model community-detection graph-mining graphx large-scale latex machine-learning scala scale-networks spark
Last synced: 09 Jun 2024
https://github.com/policratus/sparkmage
🐘 A tool for blazing fast analysis and clustering of similar images using 🐘 Hadoop and ⚡ Spark.
big-data computer-vision hadoop image-processing spark
Last synced: 07 Jun 2024
https://github.com/neoremind/kraps-rpc
A RPC framework leveraging Spark RPC module
Last synced: 07 Jun 2024
https://github.com/NVIDIA/spark-rapids
Spark RAPIDS plugin - accelerate Apache Spark with GPUs
Last synced: 07 Jun 2024
https://github.com/japila-books/spark-sql-internals
The Internals of Spark SQL
apache-spark book internals mkdocs-material spark spark-sql
Last synced: 07 Jun 2024
https://github.com/jaceklaskowski/spark-kubernetes-book
The Internals of Spark on Kubernetes
apache-spark book internals kubernetes spark
Last synced: 07 Jun 2024
https://github.com/melin/spark-jobserver
REST job server for Apache Spark
hadoop hive java kerberos kubernetes spark yarn
Last synced: 07 Jun 2024
https://github.com/fiatjaf/kwh
webln browser extension for lightningd/eclair/ptarmigan
c-lightning eclair lightning-network lightningd ptarmigan spark web-extension webln
Last synced: 07 Jun 2024
https://github.com/zhaoyachao/zdh_web
大数据采集,抽取平台,zdh_web是zdh系列服务的可视化管理平台,包含数据采集,调度,权限,审批流,私域营销等模块
bigdata collection data data-collection datapipeline datax-web etl pipline scheduler spark sparketl
Last synced: 07 Jun 2024
https://github.com/bytedance/CloudShuffleService
Cloud Shuffle Service(CSS) is a general purpose remote shuffle solution for compute engines, including Spark/Flink/MapReduce.
Last synced: 07 Jun 2024
https://github.com/apache/incubator-uniffle
Uniffle is a high performance, general purpose Remote Shuffle Service.
mapreduce remote-shuffle-service rss shuffle spark tez
Last synced: 07 Jun 2024
https://github.com/datavane/datasophon
The next generation of cloud-native big data management expert , Aims to help users rapidly build stable, efficient, and scalable cloud-native platforms for big data.
cloudnative doris easy-to-use kubernetes spark yarn
Last synced: 07 Jun 2024
https://github.com/duhanmin/bigdata-sql-parser
数据血缘,支持spark sql,hive sql,pg sql,presto sql,mysql sql,tidb sql, flink sql, datax血缘,spark/flink jar 运行命令的血缘解析;支持with语法
datax flink hive mysql postgresql presto spark tidb trino
Last synced: 07 Jun 2024
https://github.com/jaceklaskowski/spark-workshop
Apache Spark™ and Scala Workshops
apache-spark spark spark-mllib spark-sql spark-structured-streaming spark-workshops workshop
Last synced: 07 Jun 2024
https://github.com/Nosto/spartann
Hyper performant kNN using Annoy for Apache Spark.
ann annoy apache-spark k-nearest-neighbors k-nearest-neighbours knn ml spark
Last synced: 07 Jun 2024
https://github.com/DTStack/dt-sql-parser
SQL Parsers for BigData, built with antlr4.
antlr4 autocompletion bigdata flink hive impala mysql parser postgresql spark sql sql-validation trino
Last synced: 05 Jun 2024
https://github.com/microsoft/Azure-Synapse-Content-Recommendations-Solution-Accelerator
This is a solution accelerator for creating personalized content recommendations based on user activity.
azure-synapse-analytics power-bi spark
Last synced: 04 Jun 2024
https://github.com/microsoft/hyperspace
An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.
acceleration analytics big-data databases indexing spark
Last synced: 04 Jun 2024
https://github.com/henridf/apache-spark-node
Node.js bindings for Apache Spark DataFrame APIs
Last synced: 03 Jun 2024
https://github.com/geekyouth/SZT-bigdata
深圳地铁大数据客流分析系统🚇🚄🌟
cdh6 clickhouse docker elasticsearch flink hadoop hbase hive kafka kibana kylin mongodb mysql phoenix redis scala spark springboot szt-bigdata zookeeper
Last synced: 03 Jun 2024
https://github.com/flyteorg/flytekit
Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.
automation data data-science extensible flyte flyte-tasks hacktoberfest mlops pypi python sdk spark workflows
Last synced: 02 Jun 2024
https://github.com/projectnessie/nessie
Nessie: Transactional Catalog for Data Lakes with Git-like semantics
aws-lambda data git iceberg java spark
Last synced: 02 Jun 2024
https://github.com/minio/sidekick
High Performance HTTP Sidecar Load Balancer
bigdata kubernetes load-balancer minio-servers proxy sidecar sidekick spark
Last synced: 02 Jun 2024
https://github.com/ging/fiware-cosmos
The Cosmos Generic Enabler enables an easier BigData analysis over context integrated with some of the most popular BigData platforms.
analysis big-data fiware fiware-cosmos flink processing real-time-analytics spark streaming-engine
Last synced: 01 Jun 2024
https://github.com/polyaxon/mloperator
Machine Learning Operator & Controller for Kubernetes
dask deep-learning k8s keras kubernetes kubernetes-operator machine-learning mlops mpi mxnet notebook pytorch scikit-learn spark tensorboard tensorflow xgboost
Last synced: 01 Jun 2024
https://github.com/ballista-compute/ballista
Distributed compute platform implemented in Rust, and powered by Apache Arrow.
arrow dataframe datafusion distributed java jvm kotlin kubernetes rust scala spark
Last synced: 01 Jun 2024
https://github.com/xqnwang/darima
Distributed ARIMA Models
arima distributed-computing spark time-series-forecasting
Last synced: 31 May 2024
https://github.com/felipekunzler/spark-twitter-analysis
Analyse a twitter dataset with Spark and vizualize the results on a React dashboard.
Last synced: 31 May 2024
https://github.com/Angel-ML/angel
A Flexible and Powerful Parameter Server for large-scale machine learning
high-dimensional machine-learning model online-learning parameter-server scala spark spark-streaming
Last synced: 31 May 2024
https://github.com/rezacsedu/Mining-Maximal-Frequent-Pattern-Spark
Implementation of Static mining part of "Mining maximal frequent patterns in transactional databases and dynamic data streams: A spark-based approach" Information Sciences, Volume 432, March 2018, Pages 278-300
data-mining data-stream frequent-pattern-mining java maximal-frequent-pattern spark structured-streaming
Last synced: 31 May 2024
https://github.com/feng-li/Distributed-Statistical-Computing
Teaching Materials for Distributed Statistical Computing (大数据分布式计算教学材料)
hadoop mapreduce pyspark-tutorial spark spark-teaching statistical-models
Last synced: 31 May 2024
https://github.com/zhonghuasheng/Tutorial
后端 (Java Golang)全栈知识架构体系总结
emsp go java keepalived mongodb mqtt mysql netty redis rocketmq spark spring springboot springcloud tomcat tutorial
Last synced: 31 May 2024
https://github.com/zhisheng17/flink-learning
flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》
clickhouse elasticsearch flink hbase influxdb kafka loki mysql opentsdb rabbitmq redis rocketmq spark stream-processing streaming
Last synced: 31 May 2024
https://github.com/XZB-1248/Spark
✨Spark is a web-based, cross-platform and full-featured Remote Administration Tool (RAT) written in Go that allows you control all your devices anywhere. Spark是一个Go编写的,网页UI、跨平台以及多功能的远程控制和监控工具,你可以随时随地监控和控制所有设备。
dashboard go golang rat remote-access-tool remote-admin-tool remote-administration-tool remote-control server-monitoring shell spark webshell
Last synced: 31 May 2024
https://github.com/liyupi/sql-generator
🔨 用 JSON 来生成结构化的 SQL 语句,基于 Vue3 + TypeScript + Vite + Ant Design + MonacoEditor 实现,项目简单(重逻辑轻页面)、适合练手~
ant-design bigdata hive javascript json monaco-editor mysql spark sql typescript vite vue vue3
Last synced: 30 May 2024
https://github.com/FirelyTeam/spark
Firely and Incendi's open source FHIR server
c-sharp docker dstu2 fhir fhir-api fhir-server fhir-spec fhir-specification r4 spark spark-fhir-server stu3
Last synced: 30 May 2024
https://github.com/miztiik/s3-to-rds-with-glue
Extract, transform, and load data for analytic processing using AWS Glue
cdk cloud-development-kit etl glue glue-catalog glue-job miztiik-automation s3-to-rds spark
Last synced: 27 May 2024
https://github.com/alanchn31/Movalytics-Data-Warehouse
Data pipeline performing ETL to AWS Redshift using Spark, orchestrated with Apache Airflow
airflow analytics aws-redshift aws-s3 data-engineer-nanodegree data-engineering data-engineering-pipeline data-modelling data-warehouse-cloud docker movie-database movie-recommendation movie-reviews pyspark python3 redshift spark sql udacity
Last synced: 27 May 2024
https://github.com/AuFeld/Data_Engineering_Projects
A collection of data engineering projects: data modeling, ETL pipelines, data lakes, infrastructure configuration on AWS, data warehousing, containerization, and a dashboard to monitor data pipeline KPIs
airflow aws cassandra data-engineering data-lake data-warehouse docker emr etl-pipeline infrastructure-as-code infrastructure-setup postgresql python redshift s3 spark
Last synced: 27 May 2024