Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/dimajix/docker-jupyter-spark

Docker image for Jupyter notebooks with PySpark

docker hadoop jupyter pyspark python spark

Last synced: 09 Nov 2024

https://github.com/absaoss/pramen

Resilient data pipeline framework running on Apache Spark

big-data data-pipeline etl hacktoberfest scala spark

Last synced: 19 Dec 2024

https://github.com/ibm-cloud/biginsights-on-apache-hadoop

Example projects for 'BigInsights for Apache Hadoop' on IBM Bluemix

ambari biginsights bigsql hadoop hbase hive ibm-bluemix knox oozie spark spark-streaming webhdfs zeppelin

Last synced: 17 Nov 2024

https://github.com/zhengxs2018/ai

集成 百度文心一言,阿里通义千问,腾讯混元助手 和 讯飞星火认知 等大模型的 API,并且适配 OpenAI 的输入与输出。

ai aigc erniebot hunyuan minimax openai qwen spark

Last synced: 10 Nov 2024

https://github.com/trainingbypackt/big-data-analysis-with-python

Combine Spark and Python to process large datasets and unlock the power of parallel computing and machine learning

combine-spark dataset machine-learning python spark

Last synced: 14 Nov 2024

https://github.com/netease/spark-alarm

Alerting and monitoring tool for Apache Spark

alert monitoring monitoring-tool scala spark

Last synced: 16 Nov 2024

https://github.com/lynnlangit/spark-scala-eks

Spark Scala docker container sample for AWS testing - EKS & S3

docker-image scala spark spark-ml

Last synced: 28 Oct 2024

https://github.com/spektom/realtime-dashboard-example

This is a real-time dashboard example using Spark Streaming and Node.js

dashboard-application flink kafka meetup rethinkdb spark spark-streaming

Last synced: 19 Nov 2024

https://github.com/san089/cloudera_material

Cloudera_Material: Study Material to help people preparing for Cloudera CCA Spark and Hadoop Developer Exam (CCA175). Feel free to collaborate.

big-data bigdata cca cca175 certification cloudera flume hadoop hive hive-metastore pyspark spark sqoop sqoop-export sqoop-import sqoop-session

Last synced: 12 Oct 2024

https://github.com/moritzkoerber/covid-19-data-engineering-pipeline

A Covid-19 data pipeline on AWS featuring PySpark/Glue, Docker, Great Expectations, Airflow, and Redshift, templated in CloudFormation and CDK, deployable via Github Actions.

apache-airflow apache-spark api aws aws-cdk aws-cloudformation aws-ecr aws-glue aws-lambda aws-redshift aws-s3 docker great-expectations pyspark spark

Last synced: 11 Nov 2024

https://github.com/archivesunleashed/notebooks

Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit.

juypter-notebook notebooks pyspark-notebook python3 spark web-archives

Last synced: 11 Nov 2024

https://github.com/pdsuwwz/chatgpt-vue3-light-mvp

💭 一个可二次开发 Chat Bot 对话 Web 端 MVP 原型模板, 基于 Vue3、Vite 5、TypeScript、Naive UI 、UnoCSS 等主流技术构建, 🧤简单集成大模型 API, 采用单轮 AI 问答对话模式, 每次提问独立响应, 无需上下文, 支持打字机效果流式输出, 集成 markdown-it 预览, 💼 易于定制和快速搭建 Chat 类大语言模型产品 (附示例截图)

ai chat chatbot deepseek event glm gpt llm ollama openai qwen siliconcloud siliconflow source spark stream ts

Last synced: 11 Oct 2024

https://github.com/medmes/twitterstreamingsparkkafkademo

a demo project to Analyze most popular twitter hashtags using Java 8 Spring-Boot Spark Streaming Kafka & Docker Demo.

apache docker java-8 kafka spark spark-streaming spring-boot twitter twitter-streaming-api zookeeper

Last synced: 08 Nov 2024

https://github.com/Componolit/gneiss

Framework for platform-independent SPARK components

ada component-based embedded formal-methods formal-verification spark

Last synced: 25 Oct 2024

https://github.com/Componolit/SXML

Formally verified, bounded-stack XML library

ada formal-methods formal-verification parser spark xml

Last synced: 26 Oct 2024

https://github.com/crflynn/pbspark

protobuf pyspark conversion

dataframe protobuf protocol-buffers pyspark spark

Last synced: 08 Nov 2024

https://github.com/ember-sparks/ember-sparks

✨ Ambitious UI components for your Ember app.

addon ember ember-css-modules javascript spark ui ui-components

Last synced: 21 Nov 2024

https://github.com/maropu/datasketches-spark

Data Sketches for Apache Spark

approximate-computing spark

Last synced: 08 Nov 2024

https://github.com/jgperrin/net.jgp.books.spark.ch02

Spark in Action, 2nd edition - chapter 2

apache-spark java java8 manning spark sparkwithjava

Last synced: 09 Nov 2024

https://github.com/hoangsonww/moodify-emotion-music-app

🎹 Moodify - an emotion-based music recommendation system that uses AI/ML models to analyze text, speech, and facial expressions, providing personalized music recommendations across web and mobile platforms.

artificial-intelligence django django-rest-framework emotion fullstack-development hadoop kubernetes machine-learning mobile-development mongodb music python pytorch react-native reactjs redis restful-api spark tensorflow torch

Last synced: 01 Nov 2024

https://github.com/ysh329/link-prediction

[UNMAINTAINED] 基于PySpark与MySQL的复杂网络链路预测。

link-prediction network pyspark spark

Last synced: 23 Oct 2024

https://github.com/mgubaidullin/infinity

Prototype of forecast service that uses machine learning to deliver forecasts

camel cassandra chartjs java kafka quarkus spark vuejs

Last synced: 14 Oct 2024

https://github.com/mlr-org/mlr3db

Data Backends to let mlr3 work transparently with (remote) data bases

bigquery data-backend database duckdb machine-learning mariadb mlr3 mysql odbc postgresql r r-package spark sqlite

Last synced: 14 Oct 2024

https://github.com/opensearch-project/opensearch-spark

Spark Accelerator framework ; It enables secondary indices to remote data stores.

compute opensearch secondary-index spark

Last synced: 11 Nov 2024

https://github.com/oracle-quickstart/oci-cloudera

Terraform module to deploy Cloudera on Oracle Cloud Infrastructure (OCI)

cdh cdp cloud cloudera dsw edh hadoop oci oracle partner-led spark terraform

Last synced: 07 Nov 2024

https://github.com/vmitchell85/spark-kiosk-notify

Adds a notification panel to your Laravel Spark Kiosk, allowing you to send notifications to users.

laravel notifications spark

Last synced: 12 Oct 2024

https://github.com/cognitedata/cdp-spark-datasource

Spark data source for Cognite Data Fusion

cognite datasource scala spark

Last synced: 31 Oct 2024

https://github.com/geotrellis/geotrellis-netcdf

Scala/Spark Project For Reading NetCDF

geotrellis netcdf scala spark

Last synced: 11 Nov 2024

https://github.com/hortonworks-spark/cloud-integration

Spark cloud integration: tests, cloud committers and more

apache-spark aws-s3 azure gcs spark

Last synced: 14 Nov 2024

https://github.com/theajack/spark-node

讯飞星火认知大模型 Nodejs SDK

nodejs spark xun-fei

Last synced: 08 Nov 2024

https://github.com/microsoft/masc

Microsoft's contributions for Spark with Apache Accumulo

accumulo apache big-data machine-learning spark

Last synced: 22 Jan 2025

https://github.com/fqaiser94/mse

Make Structs Easy (MSE)

nested pyspark python scala spark struct

Last synced: 10 Oct 2024

https://github.com/nashtech-labs/spark-graphx-twitter

An example of Spark and GraphX with Twitter as sample

apache-spark graph knoldus sbt spark spark-graphx twitter

Last synced: 05 Nov 2024

https://github.com/snowplow/dataflow-runner

Run templatable playbooks of Hadoop/Spark/et al jobs on Amazon EMR

amazon-emr flink golang-application hadoop spark

Last synced: 09 Nov 2024

https://github.com/NashTech-Labs/spark-graphx-twitter

An example of Spark and GraphX with Twitter as sample

apache-spark graph knoldus sbt spark spark-graphx twitter

Last synced: 23 Oct 2024

https://github.com/miraisolutions/sparkbq

Sparklyr extension package to connect to Google BigQuery

bigquery r spark sparklyr

Last synced: 18 Nov 2024

https://github.com/aphp/spark-etl

Better bridge apache spark and postgresql

etl postgresql spark

Last synced: 25 Nov 2024

https://github.com/vemonet/setup-spark

:octocat:✨ Setup Apache Spark in GitHub Action workflows

apache-spark github-actions setup spark

Last synced: 11 Nov 2024

https://github.com/yj8023xx/xiwenlejian

一个基于深度学习的书籍推荐系统,可以根据用户的行为进行个性化的推荐

deep-learning java python recommender-system spark springcloud vue

Last synced: 14 Nov 2024

https://github.com/bluejoe2008/spark-http-stream

spark structured streaming via HTTP communication

http spark spark-structured-streaming

Last synced: 23 Oct 2024

https://github.com/gilbitron/spark-create-stripe-plans

A simple Laravel artisan command to create Spark plans in Stripe

laravel laravel-artisan-command spark stripe

Last synced: 14 Oct 2024

https://github.com/jplane/pyspark-devcontainer

A simple VS Code devcontainer setup for local PySpark development

devcontainer devcontainers jupyter jupyter-notebooks pyspark pyspark-notebook python spark vscode

Last synced: 17 Oct 2024

https://github.com/longnguyen010203/youtube-recommend-master-etl-pipeline

💜🌈📊 A Data Engineering Project that implements an ETL data pipeline using Dagster, Apache Spark, Streamlit, MinIO, Metabase, Dbt, Polars, Docker. Data from kaggle and youtube-api 🌺

cleaning-data dagster data-engineering data-engineering-pipeline dbt docker docker-compose dockerfile etl-pipeline metabase minio mysql polars postgresql processing pyspark spark streamlit youtube youtube-api

Last synced: 22 Nov 2024

https://github.com/romans-weapon/spear-framework

Rapid ETL/ELT-connectors/pipeline development leveraged on top of Apache Spark

docker-compose hadoop kafka scala shell-script spark

Last synced: 10 Oct 2024

https://github.com/chen0040/spring-boot-spark-integration-demo

Demo on how to integrate Spring Data JPA, Apache Spark and GraphX with Java and Scala mixed codes

graphx spark spring-boot spring-jpa

Last synced: 16 Dec 2024

https://github.com/mohamedhmini/d-pandisim

distributed pandemics simulator, uses the power of spark to generate huge bulks of contact-tracing data.

big-data distributed-programming epidemic-simulations epidemics graph-algorithms markov-chain pandemic-simulator pyspark spark

Last synced: 15 Nov 2024

https://github.com/lovenui/etl-with-aws-emr-and-mwaa

Create a data pipeline on AWS to execute batch processing in a Spark cluster provisioned by Amazon EMR. ETL using managed airflow: extracts data from S3, transform data using spark, load transformed data back to S3.

airflow aws-ec2 aws-s3 data-engineering etl spark

Last synced: 19 Jan 2025

https://github.com/qubole/streaminglens

Qubole Streaminglens tool for tuning Spark Structured Streaming Pipelines

cluster-management micro-batches scala sla spark spark-streaming sparklens streaming streaming-pipeline structured-streaming

Last synced: 21 Nov 2024

https://github.com/miztiik/s3-to-rds-with-glue

Extract, transform, and load data for analytic processing using AWS Glue

cdk cloud-development-kit etl glue glue-catalog glue-job miztiik-automation s3-to-rds spark

Last synced: 04 Dec 2024

https://github.com/tonyz0x0/football-manager

Data Analysis as a Football Manager

numpy pandas python spark

Last synced: 01 Jan 2025

https://github.com/flint-bot/sparky

Cisco Spark API for NodeJS (deprecated in favor of https://github.com/webex/webex-bot-node-framework)

cisco spark

Last synced: 27 Oct 2024

https://github.com/Componolit/jwx

JSON/JWK/JWS/JWT/Base64 library in SPARK

ada base64 jose json json-web-signature jwk jws jwt jwt-authentication jwt-token spark

Last synced: 26 Oct 2024

https://github.com/woltapp/spark-osm-datasource

Native Spark OSM PBF data source

osm pbf spark

Last synced: 11 Oct 2024

https://github.com/zoltan-nz/kafka-spark-project

Distributed System in Docker with Apache Kafka and Spark for big data streaming and visualisation (NodeJS, TypeScript, React, NestJS, Java)

java javascript kafka nodejs spark typescript

Last synced: 12 Oct 2024

https://github.com/hibayesian/spark-lof

A parallel implementation of local outlier factor based on Spark

local-outlier-factor machine-learning outlier-detection spark

Last synced: 23 Nov 2024

https://github.com/alvertogit/bigdata_docker

Big Data Docker Data Science Spark Spark3 Hadoop HDFS Scala Python Artificial Intelligence Machine Learning Jupyter Lab Notebook

big-data data-science docker jupyter-lab jupyter-notebook machine-learning python scala spark spark3

Last synced: 23 Nov 2024

https://github.com/qubole/s3-sqs-connector

A library for reading data from Amzon S3 with optimised listing using Amazon SQS using Spark SQL Streaming ( or Structured streaming).

s3 scala spark spark-streaming sqs streaming structured-streaming

Last synced: 21 Nov 2024

https://github.com/singgel/bigdata-skilltree

Spark、flink、HBase、Hive、flume集成了一些Hadoop的原生api的一些demo(如HDFS、MapReduce:目前就这两个);同时测试一些异常功能

hadoop hbase hdfs hive kylin mapreduce scala spark

Last synced: 14 Oct 2024

https://github.com/wazzabeee/pyspark-etl-twitter

Implementation of an ETL process for real-time sentiment analysis of tweets with Docker, Apache Kafka, Spark Streaming, MongoDB and Delta Lake

delta-lake docker etl etl-pipeline etl-process kafka kafka-consumer kafka-producer kafka-streams mongodb nlp pyspark python sentiment-analysis spark spark-streaming tweet-analysis tweet-classification twitter twitter-sentiment-analysis

Last synced: 13 Nov 2024

https://github.com/jgperrin/net.jgp.books.spark.ch03

Spark in Action, 2nd edition - chapter 3

apache-spark dataframe java java8 manning spark sparkwithjava

Last synced: 09 Nov 2024

https://github.com/hammerlab/spark-util

low-level helpers for Apache Spark libraries and tests

hadoop kryo scala spark

Last synced: 12 Oct 2024

https://github.com/dvgodoy/yelpdatasetchallenge

Restaurant recommendations and review text-based quality predictions

dataset lstm-sentiment-analysis recommender-systems sentiment-analysis spark spark-ml yelp-dataset

Last synced: 13 Oct 2024

https://github.com/luckyzxl2016/spark-example

Spark1.6和spark2.2的示例,包含kafka,flume,structuredstreaming,jedis,elasticsearch,mysql,dataframe

dataframe elasticsearch jedis kafka mysql spark spark-example spark-sql spark-streaming spark-structured-streaming

Last synced: 28 Oct 2024

https://github.com/radanalyticsio/oshinko-s2i

This is a place to put s2i images and utilities for spark application builders for openshift

java openshift oshinko-s2i pyspark s2i-image scala spark

Last synced: 05 Nov 2024

https://github.com/laravel/spark-next-docs

The Spark documentation.

laravel paddle php spark stripe

Last synced: 07 Oct 2024

https://github.com/felipekunzler/frequent-itemset-mining-spark

Sequential and distributed implementations of Apriori and FP-Growth algorithms using Scala and Spark.

apriori dfps fp-growth rapriori scala spark yafim

Last synced: 30 Oct 2024

https://github.com/camposvinicius/aws-etl

This is an ETL application on AWS with general open sales and customer data that you can find here: https://github.com/camposvinicius/data/blob/main/AdventureWorks.zip, it's a zipped file with some .csvs inside that we will apply transformations.

airflow argocd athena aws catalog data data-engineer database emr emr-cluster etl glue kubernetes pipeline postgres pyspark rds spark

Last synced: 04 Dec 2024

https://github.com/nikoshet/spark-cherry-shuffle-service

Code for the "Cherry: A Distributed Task-Aware Shuffle Service for Serverless Analytics" paper for 2021 IEEE International Conference on Big Data

ansible apache-spark bigdata devops distributed docker ieee kubernetes papers-with-code serverless shuffling spark

Last synced: 09 Nov 2024

https://github.com/lifeomic/spark-vcf

Spark VCF data source implementation for Dataframes

dataframe genomics genotype lifeomic spark spark-sql team-clinical-intelligence variants vcf vcf-files

Last synced: 12 Nov 2024

https://github.com/steven-matison/HDP3-Hue-Service

A continuation of Ambari Hue Service for HDP 3.x and Hue 4.6.0

ambari ambari-hue-service hbase hdp3 hive hue spark

Last synced: 31 Oct 2024

https://github.com/hashload/freeza-offset

Spark stream consumption commit in kafka consumer group

databricks kafka kafka-commit kafka-offset-commits spark spark-streaming

Last synced: 12 Oct 2024

https://github.com/qiushisun/distributed-computing-systems

2021 Spring (Distributed Computing Systems) 分布式系统与编程

distributed-computing distributed-systems ecnu-dase flink hadoop-mapreduce spark

Last synced: 19 Dec 2024

https://github.com/absaoss/spark-hofs

Scala API for Apache Spark SQL high-order functions

high-order-functions scala spark sql

Last synced: 10 Oct 2024

https://github.com/ehsanmok/sparkling-titanic

Training models with Apache Spark, PySpark for Titanic Kaggle competition

kaggle-titanic pyspark spark

Last synced: 10 Jan 2025

https://github.com/wang1365/spark-traffic

使用Spark批量处理离线交通大数据

spark traffic

Last synced: 07 Nov 2024

https://github.com/bluegranite/databrickstraining

Repository for Microsoft Databricks Training Events - Hosted by BlueGranite

apache-spark azure azure-databricks databricks distributed-computing machine-learning pyspark spark spark-streaming

Last synced: 18 Nov 2024

https://github.com/qxzzxq/faker

Generate fake data for Scala and Spark :tophat:

fake fake-data faker faker4s scala spark spark-data-generator test-data test-data-generator testing

Last synced: 18 Dec 2024