Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/getyourguide/TypedPyspark

Type-annotate your spark dataframes and validate them

pyspark python spark typing

Last synced: 11 Nov 2024

https://github.com/franzdiebold/docker-datascience-ultimate

Customized Jupyter Spark Docker images with everything you need

docker jupyter jupyterlab polars pyspark python spark

Last synced: 05 Nov 2024

https://github.com/getyourguide/typedpyspark

Type-annotate your spark dataframes and validate them

pyspark python spark typing

Last synced: 14 Nov 2024

https://github.com/innfactory/akka-lift-ml

akka http service for serving spark machine learning models

akka akka-http data-engineering fast-data machine-learning scala spark

Last synced: 28 Nov 2024

https://github.com/qxzzxq/faker

Generate fake data for Scala and Spark :tophat:

fake fake-data faker faker4s scala spark spark-data-generator test-data test-data-generator testing

Last synced: 18 Dec 2024

https://github.com/mach-kernel/databricks-kube-operator

A Kubernetes operator to enable GitOps style deploys for Databricks resources

ci cicd databricks gitops helm kubernetes operators rust spark

Last synced: 11 Nov 2024

https://github.com/asuiu/sparkorm

ORM for Apache Spark and DataFrames schema manager

orm pyspark pyspark-python python python3 spark spark-orm spark-sql sparkql sqlalchemy sqlalchemy-orm

Last synced: 27 Dec 2024

https://github.com/dazheng/SparkETL

Implement a complete data warehouse etl using spark SQL

datawarehouse etl spark sparksql

Last synced: 13 Nov 2024

https://github.com/DataEval/dingo

Dingo: A Comprehensive Data Quality Evaluation Tool

data-evaluation data-quality data-science data-validation gpt llm spark vlm

Last synced: 06 Jan 2025

https://github.com/azavea/geotrellis-collections-api-research

A research project to investigate using GeoTrellis as a REST service

akka-http geotrellis leaflet react react-leaflet redux scala spark victory

Last synced: 10 Nov 2024

https://github.com/renoki-co/thunder

Thunder is an advanced Laravel tool to track user consumption using Cashier's Metered Billing for Stripe. ⚡

billing cashier laravel saas spark stripe thunder

Last synced: 14 Nov 2024

https://github.com/qyu-ai/reina

PySpark-based causal inference package.

big-data causal-inference machine-learning spark

Last synced: 02 Nov 2024

https://github.com/lovenui/weblogs-analysis-system

A big data platform for analyzing web access logs

hbase javascript log-analysis python scala spark

Last synced: 19 Jan 2025

https://github.com/fscm/terraform-module-aws-spark

Terraform Module to create a Apache Spark cluster on AWS

aws spark terraform

Last synced: 07 Nov 2024

https://github.com/zuinnote/spark-hadoopcryptoledger-ds

A Spark datasource for the HadoopCryptoLedger library

altcoin auxpow bitcoin cryptoledger datasource ethereum hadoopcryptoledger read spark

Last synced: 03 Dec 2024

https://github.com/collabh/reasearch-bigdata

看书看源码看第三方学习视频

flink hadoop hive spark

Last synced: 28 Oct 2024

https://github.com/chezou/sparkavro

Load Avro data into Spark with sparklyr

avro r spark sparklyr

Last synced: 18 Nov 2024

https://github.com/mlverse/pysparklyr

Extension to {sparklyr} that allows you to interact with Spark & Databricks Connect

databricks pyspark r spark spark-connect

Last synced: 22 Nov 2024

https://github.com/AuFeld/Data_Engineering_Projects

A collection of data engineering projects: data modeling, ETL pipelines, data lakes, infrastructure configuration on AWS, data warehousing, containerization, and a dashboard to monitor data pipeline KPIs

airflow aws cassandra data-engineering data-lake data-warehouse docker emr etl-pipeline infrastructure-as-code infrastructure-setup postgresql python redshift s3 spark

Last synced: 04 Dec 2024

https://github.com/dmwm/cmsspark

General purpose framework to run CMS experiment workflows on HDFS/Spark platform

analytics bigdata cms-framework hdfs spark

Last synced: 11 Dec 2024

https://github.com/daniel-acuna/pyspark_pipes

Helper functions for building complex Spark ML pipelines

python3 spark sparkml

Last synced: 30 Oct 2024

https://github.com/microsoft/azure-synapse-content-recommendations-solution-accelerator

This is a solution accelerator for creating personalized content recommendations based on user activity.

azure-synapse-analytics power-bi spark

Last synced: 02 Nov 2024

https://github.com/analyticalmonk/pyspark_nlp_workshop

Instructions and code for the workshop "From Big Data to NLP Insights: Unlocking the Power of PySpark and Spark NLP"

databricks databricks-notebooks distributed-computing nlp pyspark spark spark-nlp workshop

Last synced: 08 Nov 2024

https://github.com/maropu/spark-data-repair-plugin

Provide functionality to build statistical models to repair dirty tabular data in Spark

data-repairing distributed-computing error-detection parallel-computing spark

Last synced: 08 Nov 2024

https://github.com/ravi72munde/scala-spark-cab-rides-predictions

A big data project for predicting prices of Uber/Lyft rides depending on the weather

predict-prices scala spark spark-streaming streaming uber weather

Last synced: 16 Dec 2024

https://github.com/microsoft/Azure-Synapse-Content-Recommendations-Solution-Accelerator

This is a solution accelerator for creating personalized content recommendations based on user activity.

azure-synapse-analytics power-bi spark

Last synced: 01 Nov 2024

https://github.com/archivesunleashed/docker-aut

Docker image for the Archives Unleashed Toolkit

archives-unleashed aut docker docker-image spark webarchives

Last synced: 11 Nov 2024

https://github.com/allegro/camus-compressor

Camus Compressor merges files created by Camus and saves them in a compressed format.

avro etl hadoop kafka spark

Last synced: 06 Nov 2024

https://github.com/r-spark/sparkhail

A sparklyr extension for Hail

hail r spark sparklyr

Last synced: 13 Nov 2024

https://github.com/data-tools/big-data-types

A library to transform Scala product types and Schemes from different systems into other Schemes. Any implemented type automatically gets methods to convert it into the rest of the types and vice versa. E.g: a Spark Schema can be transformed into a BigQuery table.

apache-spark bigquery bigquery-tables cassandra circe database-types scala schemas spark typeclass typeclass-derivation typesafe

Last synced: 12 Oct 2024

https://github.com/exasol/spark-connector

A connector for Apache Spark to access Exasol

apache-spark connector exasol exasol-integration spark streaming

Last synced: 02 Nov 2024

https://github.com/sysgears/akka-spark-pipeline

An example project that implements a data pipeline using Scala, Akka, and Spark and works with document-oriented and graph databases to let you find out how frequently a specific technology is used with different technology stacks.

akka akka-http akka-streams mongodb neo4j scala spark spark-graphx

Last synced: 16 Nov 2024

https://github.com/allwefantasy/mlsql

New Repo: https://github.com/byzer-org/kolo-lang

mlsql ray spark sql

Last synced: 11 Oct 2024

https://github.com/blaze-init/spark-blaze-extension

Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.

arrow datafusion spark

Last synced: 01 Nov 2024

https://github.com/xd-deng/diy-a-cluster

How to Do-It-Yourself A Cluster for Spark & Hadoop

cluster-computing hadoop spark

Last synced: 16 Oct 2024

https://github.com/apache/kyuubi-docker

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

data-lake hadoop hive jdbc kubernetes spark spark-sql sql thrift

Last synced: 07 Oct 2024

https://github.com/vsouza/spark-kinesis-redshift

Example project for consuming AWS Kinesis streamming and save data on Amazon Redshift using Apache Spark

aws aws-kinesis aws-kinesis-stream aws-redshift etl etl-pipeline python shell spark spark-streaming

Last synced: 17 Nov 2024

https://github.com/hadesarchitect/caspark

Cassandra + Spark = ❤️ Machine Learning with Apache Spark & Cassandra

cassandra jupyter machine-learning spark

Last synced: 12 Oct 2024

https://github.com/chaokunyang/bigdata-examples

bigdata examples about spark and flink

bigdata flink hadoop monitor python samples spark spark-sql sparkml

Last synced: 19 Nov 2024

https://github.com/hbutani/icebergsql

Integration of Iceberg table management into Spark SQL

iceberg spark sql

Last synced: 31 Oct 2024

https://github.com/manuparra/masterdatcom_bdcc_practice

Practice and Workshop on BigData and Cloud Computing using Docker Containers and OpenNebula. HDFS, hadoop and spark+R

bigdata cloudcomputing containers docker hadoop hdfs linux opennebula practices spark sparkr

Last synced: 07 Nov 2024

https://github.com/jhleeeme/fake-data-pipeline

Data Generators -> Kafka -> Spark Streaming -> PostgreSQL -> Grafana

data-engineering data-pipeline docker docker-compose grafana kafka postgresql scala spark

Last synced: 17 Jan 2025

https://github.com/anskarl/parsimonious

Parsimonious is a helper library for encoding/decoding Apache Thrift and Twitter Scrooge classes to Spark Dataframes and Jackson JSON.

deserialization jackson json serialization spark thrift

Last synced: 31 Oct 2024

https://github.com/selimhorri/spark-application

Java Application, uses Apache Spark, handles batch as well as streaming processing

dataframes-api java mysql spark spark-batch spark-sql spark-streaming

Last synced: 14 Oct 2024

https://github.com/baghelamit/video-stream-classification

Video Stream Classification

java kafka opencv spark tensorflow

Last synced: 10 Nov 2024

https://github.com/tupol/spark-tools

Executable Apache Spark Tools: Format Converter & SQL Processor

apache-spark converts format-converter scala spark sql tools

Last synced: 12 Oct 2024

https://github.com/newfront/odsc-west-streaming-trends

All Data, Relevant Information, Scripts, and Applications for the Open Data Science Conference (2018)

ml spark spark-streaming

Last synced: 02 Dec 2024

https://github.com/minzhang-1/PointHop-PointHop2_Spark

A fast and low memory requirement version of PointHop and PointHop++, which is built upon Apache Spark.

3d 3d-classification classification feature-extraction knn least-square-regression pca point-cloud pyspark python spark

Last synced: 28 Oct 2024

https://github.com/eto-ai/spark-video

Processing videos on Apache Spark

ffmpeg opencv spark

Last synced: 24 Nov 2024

https://github.com/fabianmurariu/website-categories-nn

Build a deep learning model predicting categories from dmoz datasource

deep-learning deep-neural-networks keras spark tensorflow

Last synced: 18 Jan 2025

https://github.com/manuzhang/jupyterlab_spark

Spark Application UI extension for JupyterLab

jupyterlab jupyterlab-extension spark typescript

Last synced: 17 Nov 2024

https://github.com/stefen-taime/etl-data-pipeline-rdbms-to-hdfs-using-airflow-apache-sqoop-spark-postgres-and-hive

This project aims to move the data from a Relational database system (RDBMS) to a Hadoop file system (HDFS)

airflow big-data data docker-compose etl-pipeline hdfs hive infrastructure-as-code rdbms spark sql sqoop

Last synced: 17 Jan 2025

https://github.com/miquido/datascience

Useful scripts and notebooks for Data Science. The project was made by Miquido. https://www.miquido.com/

aws-s3 docker machine-learning pipeline pyspark pyspark-mllib pyspark-notebook pyspark-tutorial spark

Last synced: 09 Nov 2024

https://github.com/zhaytam/realtimesentimentanalysis

A real-time sentiment analysis of Youtube comments using Python, Spark and Kafka

kafka python sentiment-analysis spark video webserver youtube

Last synced: 19 Dec 2024

https://github.com/jgperrin/net.jgp.books.spark.ch04

Spark in Action, 2nd edition - chapter 4

java manning spark sparkjava sparkwithjava

Last synced: 09 Nov 2024

https://github.com/jgperrin/net.jgp.books.spark.ch09

Spark in Action, 2e - chapter 9 - Advanced ingestion: finding data sources and building your own

apache-spark ingestion java java8 manning spark sparkwithjava

Last synced: 09 Nov 2024

https://github.com/sashgorokhov/pyspark-spy

Collect and aggregate on spark events for profitz

bigdata pyspark spark

Last synced: 27 Oct 2024

https://github.com/jgperrin/net.jgp.books.spark.ch17

Spark in Action, 2nd edition - chapter 16 - exporting data, using delta lake

apache-spark delta-lake java java8 manning spark sparkwithjava

Last synced: 09 Nov 2024

https://github.com/gvcgo/gogpt

A GPT TUI client with proxy supported.

chatgpt client go golang iflytek iflytek-spark openai proxy spark tui xf-spark xunfei xunfei-spark

Last synced: 11 Nov 2024

https://github.com/andrewpalumbo/mahout-samsara-book

Accompanying code examples for Apache Mahout: Beyond MapReduce. Distributed Algorithm Design.

distributed-algorithm mahout mahout-samsara-book spark spark-mllib-naivebayes

Last synced: 08 Nov 2024

https://github.com/sandervanhooft/vaporize-spark-mollie

Run Spark for Mollie on Laravel Vapor

laravel mollie saas spark vapor

Last synced: 07 Nov 2024

https://github.com/aamend/pathogen

The rooster crows immediately before sunrise, the rooster causes the sun to rise

big-data bigdata causation contagion correlation datascience fcm graph graphx machine-learning spark

Last synced: 08 Nov 2024

https://github.com/getyourguide/ddataflow

A tool to help you to test and develop pyspark code with sampled and local data

machine-learning python spark

Last synced: 14 Nov 2024

https://github.com/aphp/uimaonspark

Way to run Uima Pipelines on Apache Spark

spark uima

Last synced: 25 Nov 2024

https://github.com/hsiehshujeng/cdk-emrserverless-with-delta-lake

This construct builds some elements for you to quickly launch an EMR Serverless application. After submitting the Emr Serverless job, you could also launch an EMR notebook via cluster template to check the outcome from the EMR Serverless application.

aws aws-cloudformation aws-service-catalog cdk-constructs delta-lake dotnet emr-notebooks emr-serverless emr-studio golang java javacript projen python serverless spark

Last synced: 16 Nov 2024

https://github.com/googlecloudplatform/dataproc-scala-examples

Dataproc Scala Examples is an effort to assist in the creation of Spark jobs written in Scala to run on Dataproc.

airflow composer dataproc gcp scala spark

Last synced: 19 Dec 2024

https://github.com/xianwill/spark-boilerplate

A boilerplate for spark projects with docker support for local development and scripts for emr support.

apache-spark boilerplate docker emr emr-cluster spark

Last synced: 14 Oct 2024

https://github.com/anant/cassandra.lunch

Resources from weekly Zoom lunches revolving around Apache Cassandra and Apache Cassandra-related topics. Hosted by Anant Corporation.

airflow akka astra cassandra datastax elk kafka nosql scylladb spark

Last synced: 18 Nov 2024

https://github.com/dirkster99/pynotes

My notebook on using Python with Jupyter Notebook, PySpark etc

dataframe jupyter-notebook panda pandas-dataframe parquet pyspark python spark spark-sql sparknlp

Last synced: 01 Jan 2025

https://github.com/miraisolutions/sparkgeo

Sparklyr extension package providing geospatial analytics capabilities

geospatial-analytics r spark sparklyr udf

Last synced: 18 Nov 2024

https://github.com/zekeriyyaa/traffic-data-analysis-with-apache-spark-based-on-mobile-robot-data

Mobile robot data were analyzed with Apache-Spark to extract five different statistical result such as travel time, waiting time, average speed, occupancy and density were produced.

agv apache-spark big-data data-analysis data-visualization industrial-robot mobile-robot mongodb mssql pyqt5 pyspark python spark

Last synced: 09 Nov 2024

https://github.com/codingcat/kittenwhisker

debugging performance issues for Spark applications

apache-spark debugging flamegraph jvm jvm-performance performance spark

Last synced: 13 Oct 2024

https://github.com/jgperrin/net.jgp.books.spark.ch11

Spark in Action, 2nd edition - chapter 11 - Working with SQL

apache-spark java java8 manning spark spark-sql sparkwithjava sql

Last synced: 09 Nov 2024

https://github.com/duhanmin/bigdata-sql-parser

数据血缘,支持spark sql,hive sql,pg sql,presto sql,mysql sql,tidb sql, flink sql, datax血缘,spark/flink jar 运行命令的血缘解析;支持with语法

datax flink hive mysql postgresql presto spark tidb trino

Last synced: 05 Nov 2024

https://github.com/brooksian/churnbabychurn

Telco Churn - Ensemble and Stacked Classifer Models

cdsw machine-learning spark

Last synced: 18 Nov 2024

https://github.com/x4ax/lxss-install-zeppelin

Step by step guide on how to install Zeppelin 0.7.3 on Linux subsystem (WSL) for Windows 10

hadoop linux-subsystem lxss spark wsl zeppelin

Last synced: 04 Dec 2024

https://github.com/unosd/sparksharp

C# Livy client to submit Spark jobs to HDInsight and other Spark clusters

azure cosmos cosmos-db cosmosdb csharp hdinsight livy spark

Last synced: 16 Nov 2024

https://github.com/asvyatkovskiy/scabillmatch

Policy diffusion in the US legislature

data-frame graph policy-diffusion spark tf-idf

Last synced: 18 Oct 2024

https://github.com/newfront/odsc-east-2020-decision-intelligence

This is the home of the 2020 Open Data Science Conference workshop (Creating Streaming Predictive Analytics and Decision Intelligence Systems with Apache Spark)

decision-intelligence-systems odsc odsc-east-2020 spark

Last synced: 02 Dec 2024

https://github.com/archivesunleashed/twut

An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.

apache-spark spark spark-packages tweets twitter-data twitter-json

Last synced: 12 Oct 2024

https://github.com/airscholar/sparkingflow

This project demonstrates how to use Apache Airflow to submit jobs to Apache spark cluster in different programming laguages using Python, Scala and Java as an example.

apache-airflow dataengineering docker java pyspark scala spark

Last synced: 14 Nov 2024

https://github.com/da91666/daph

Daph是一个通用的数据同步与数据处理平台级工具,既具有丰富的数据同步能力,又具有强大的数据处理能力,一站式满足数据开发所有需求,可用于构建可视化配置化的数据同步与数据处理平台。

bigdata etl flink spark

Last synced: 11 Oct 2024

https://github.com/fancellu/graphx-citymap

CityMap coding test plus 3 solutions, 1 with Spark/GraphX

graphx scalatest spark

Last synced: 10 Nov 2024

https://github.com/hibayesian/spark-fim

A library of scalable frequent itemset mining algorithms based on Spark

frequent-itemset-mining machine-learning spark

Last synced: 23 Nov 2024