Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/flint-bot/flint

Webex Bot SDK for Node.js (deprecated in favor of https://github.com/webex/webex-bot-node-framework)

cisco spark

Last synced: 19 Dec 2024

https://github.com/snowch/movie-recommender-demo

This project walks through how you can create recommendations using Apache Spark machine learning. There are a number of jupyter notebooks that you can run on IBM Data Science Experience, and there a live demo of a movie recommendation web application you can interact with. The demo also uses IBM Message Hub (kafka) to push application events to topic where they are consumed by a spark streaming job running on IBM BigInsights (hadoop).

alternating-least-squares biginsights bluemix bokeh cloudant collaborative-filtering dsx hadoop hive ibm-biginsights ibm-bluemix jupyter-notebook kafka machine-learning messagehub notebook python-flask-application redis spark spark-streaming

Last synced: 17 Nov 2024

https://github.com/rogaha/data-processing-pipeline

Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra

cassandra digital-ocean docker-machine kafka spark twitter twitter-streaming-api visualization

Last synced: 06 Nov 2024

https://github.com/malexer/pytest-spark

pytest plugin to run the tests with support of pyspark

pytest python spark unit-test unittest

Last synced: 19 Jan 2025

https://github.com/apache/doris-spark-connector

Spark Connector for Apache Doris

apache connector data-warehousing dbms doris mpp olap spark

Last synced: 19 Jan 2025

https://github.com/kakao/cuesheet

A framework for writing Spark 2.x applications in a pretty way

apache-spark magic mango scala spark yarn

Last synced: 22 Jan 2025

https://github.com/apache/spark-kubernetes-operator

Apache Spark Kubernetes Operator

java kubernetes spark

Last synced: 21 Jan 2025

https://github.com/oeljeklaus-you/sparkcore

Spark源码分析,主要包含SparkContext源码、Executor进程启动、Stage划分、Task执行和Spark2.0的新特性

scala spark spark-learning sparkcore

Last synced: 12 Nov 2024

https://github.com/huangyueranbbc/SparkDemo

spark全示例代码(java、scala) Spark most full instance code DEMO (java、scala)

bigdata hadoop operator spark spark-sql spark-streaming sparkfun-products sparkjava sparkline sparkp

Last synced: 30 Oct 2024

https://github.com/chabane/generator-mitosis

A micro-service infrastructure generator based on Yeoman/Chatbot, Kubernetes/Docker Swarm, Traefik, Ansible, Jenkins, Spark, Hadoop, Kafka, etc.

ansible chatbot docker elasticsearch golang jenkins kafka kibana kubernetes logstash machine-learning rust sonarqube spark swarm traefik vagrant yeoman-generator

Last synced: 01 Nov 2024

https://github.com/Chabane/generator-mitosis

A micro-service infrastructure generator based on Yeoman/Chatbot, Kubernetes/Docker Swarm, Traefik, Ansible, Jenkins, Spark, Hadoop, Kafka, etc.

ansible chatbot docker elasticsearch golang jenkins kafka kibana kubernetes logstash machine-learning rust sonarqube spark swarm traefik vagrant yeoman-generator

Last synced: 04 Nov 2024

https://github.com/docandrew/CuBit

General-purpose, formally-verified, 64-bit operating system in SPARK/Ada for x86-64

ada os spark x86-64

Last synced: 25 Oct 2024

https://github.com/simplexspatial/osm4scala

Scala and Spark library focused on reading OpenStreetMap Pbf files.

gis openstreetmap openstreetmap-pbf-files osm pbf scala spark

Last synced: 11 Oct 2024

https://github.com/P7h/docker-spark

:ship: Docker image for Apache Spark

docker hadoop java scala spark

Last synced: 29 Oct 2024

https://github.com/starlake-ai/starlake

Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.

bigquery data-engineering data-integration data-pipeline etl hdfs redshift snowflake spark synapse

Last synced: 22 Jan 2025

https://github.com/azure/azure-kusto-spark

Apache Spark Connector for Azure Kusto

azure kusto scala spark

Last synced: 21 Jan 2025

https://github.com/ehsanmok/spark-lp

Distributed Linear Programming Solver on top of Apache Spark

distributed-computing distributed-optimization high-performance linear-programming scala spark

Last synced: 10 Jan 2025

https://github.com/coxautomotivedatasolutions/waimak

Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.

data-engineering hadoop scala spark

Last synced: 12 Oct 2024

https://github.com/yokawasa/databricks-notebooks

Collection of Sample Databricks Spark Notebooks ( mostly for Azure Databricks )

azure azuredatabricks databricks elt python spark streaming

Last synced: 30 Oct 2024

https://github.com/cbilgili/zemberek-nlp-server

Zemberek Türkçe NLP Java Kütüphanesi üzerine REST Docker Sunucu

docker javascript nlp part-of-speech-tagger rest sentence-tokenizer spark turkish turkish-language zemberek

Last synced: 12 Nov 2024

https://github.com/hibayesian/spark-fm

A parallel implementation of factorization machines based on Spark

factorization-machines machine-learning spark

Last synced: 23 Nov 2024

https://github.com/wey-gu/nebulagraph-ai

(Pre Alpha)NebulaGraph AI High-Level API, do Graph Algo, Analytics in 4 lines of code.

graph graph-algorithms hacktoberfest nebulagraph networkx spark

Last synced: 19 Jan 2025

https://github.com/swoop-inc/spark-records

Bulletproof Apache Spark jobs with fast root cause analysis of failures.

apache-spark big-data scala spark spark-records sparksql swoop

Last synced: 12 Oct 2024

https://github.com/wookey-project/ewok-kernel

A secure and high performances microkernel for building secure MCU-based IoTs

ada arm armv7m embedded ewok ewok-kernel microcontroller microcontroller-firmware microkernel security spark

Last synced: 25 Oct 2024

https://github.com/zaratsian/spark

Apache Spark (Scala, PySpark, SparkR) Code, Tricks, and References

machine-learning nlp pyspark spark text-analysis

Last synced: 07 Nov 2024

https://github.com/jaceklaskowski/spark-kubernetes-book

The Internals of Spark on Kubernetes

apache-spark book internals kubernetes spark

Last synced: 12 Oct 2024

https://github.com/vesoft-inc/nebula-algorithm

Nebula-Algorithm is a Spark Application based on GraphX, which enables state of art Graph Algorithms to run on top of NebulaGraph and write back results to NebulaGraph.

graph-algorithm graph-database graphx hacktoberfest nebulagraph spark

Last synced: 21 Jan 2025

https://github.com/src-d/jgit-spark-connector

jgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis.

datasource git pyspark python scala spark

Last synced: 16 Dec 2024

https://github.com/samelamin/spark-bigquery

Google BigQuery support for Spark, Structured Streaming, SQL, and DataFrames with easy Databricks integration.

bigquery data-frame schema spark

Last synced: 12 Oct 2024

https://github.com/ibm/sparksql-for-hbase

Learn how to use Spark SQL and HSpark connector package to create / query data tables that reside in HBase region servers

apache-spark hadoop-hdfs hbase ibmcode nosql spark sql

Last synced: 12 Oct 2024

https://github.com/wanghan0501/usersessionbehaviorofflineanalysis

四川大学拓思爱诺用户session行为数据离线分析项目

spark

Last synced: 11 Nov 2024

https://github.com/jwplayer/sparksteps

:star: CLI tool to launch Spark jobs on AWS EMR

aws aws-emr python spark

Last synced: 05 Nov 2024

https://github.com/makelove/python_master_courses

人生苦短 我用Python

course python scrapy spark

Last synced: 19 Nov 2024

https://github.com/radanalyticsio/silex

something to help you spark

scala silex spark

Last synced: 05 Nov 2024

https://github.com/zuinnote/hadoopoffice

HadoopOffice - Analyze Office documents using the Hadoop ecosystem (Spark/Flink/Hive)

analyze-office-documents bigdata excel flink hadoop hadoop-ecosystem hadoopoffice hive office poi spark

Last synced: 14 Oct 2024

https://github.com/ansrivas/spark-structured-streaming

Spark structured streaming with Kafka data source and writing to Cassandra

cassandra kafka kafka-topic spark

Last synced: 14 Oct 2024

https://github.com/scylladb/scylla-migrator

Migrate data extract using Spark to Scylla, normally from Cassandra/parquet files. Alt. from DynamoDB to Scylla Alternator.

alternator dynamodb migration scylladb spark

Last synced: 21 Jan 2025

https://github.com/ing-bank/entitymatchingmodel

Entity Matching Model solves the problem of matching company names between two possibly very large datasets.

entity-matching pandas spark

Last synced: 20 Jan 2025

https://github.com/garystafford/kafka-connect-msk-demo

For a series of posts on Amazon MSK, Amazon EKS, and Amazon EMR

aws kafka kafka-connect kubernetes spark spark-streaming

Last synced: 06 Dec 2024

https://github.com/Tubular/sparkly

Helpers & syntactic sugar for PySpark.

pyspark python spark

Last synced: 18 Nov 2024

https://github.com/spatialx-project/geolake

Universal solution for geospatial data tailored to data lakehouse systems for the first time in the industry

geospatial geospatial-analysis geospatial-processing iceberg spark spatial spatial-data

Last synced: 22 Jan 2025

https://github.com/vivek-bombatkar/mylearningnotes

Because its never late to start taking notes and 'public' it...

blockchain hadoop hive pandas python spark sparkml

Last synced: 21 Jan 2025

https://github.com/rubenafo/docker-spark-cluster

A Spark cluster setup running on Docker containers

big-data docker docker-image hadoop openjdk scala spark

Last synced: 13 Oct 2024

https://github.com/zhuyuqing/bestconf

A tool automatically improving the performance of large-scale systems by finding better configuration settings

benchmark cassandra configuration hadoop hive mysql optimization performance spark tomcat tuning

Last synced: 05 Nov 2024

https://github.com/potix2/spark-google-spreadsheets

Google Spreadsheets datasource for SparkSQL and DataFrames

data-frame scala spark sparksql spreadsheet

Last synced: 14 Oct 2024

https://github.com/jaceklaskowski/spark-streaming-notebook

Notes about Spark Streaming in Apache Spark

apache-spark notebook spark spark-streaming

Last synced: 08 Nov 2024

https://github.com/wordpress/openverse-catalog

Identifies and collects data on cc-licensed content across web crawl data and public apis.

airflow apache-airflow creative-commons hacktoberfest openverse pytest python search-engine spark

Last synced: 19 Jan 2025

https://github.com/turboway/pybigdata

使用 python 操作大数据的各种组件

elasticsearch hadoop hbase hive impala kafka mapreduce spark

Last synced: 15 Nov 2024

https://github.com/googlecloudplatform/serverless-spark-workshop

Solution Accelerators for Serverless Spark on GCP, the industry's first auto-scaling and serverless Spark as a service

apache-spark autoscaling bigdata dataproc hadoop serverless solution-accelerator spark usecases

Last synced: 07 Oct 2024

https://github.com/kislerdm/data-engineering-interviews

Data engineering interviews Q&A for data community by data community

dataengineering interview-questions kafka linux opensource python spark sql

Last synced: 11 Nov 2024

https://github.com/fancellu/zio-restful-webservice

ZIO 2.0 Restful webservice example using zio, zio-http, zio-json, quill, H2, twirl, zio-streams, zio-cache, zio-logging, zio-actors, zio-spark, openai, DallE

dalle2 h2-database openai quill scala spark twirl zio zio-actors zio-cache zio-http zio-logging zio-spark zio-streams

Last synced: 10 Nov 2024

https://github.com/goldmansachs/tablasco

Tablasco is a JUnit rule for comparing tables and Spark module for comparing large data sets

avro integration java junit regression spark tablasco testing

Last synced: 07 Nov 2024

https://github.com/myamafuj/hadoop-hive-spark-docker

Hadoop-Hive-Spark cluster + Jupyter on Docker

docker hadoop hive jupyter jupyter-notebook pyspark spark

Last synced: 11 Nov 2024

https://github.com/yaooqinn/itachi

A library that brings useful functions from various modern database management systems to Apache Spark

hive postgres presto spark trino

Last synced: 15 Oct 2024

https://github.com/felixcheung/spark-notebook-examples

Some notebook examples related to Apache Spark, IPython / Jupyter, Zeppelin

ipython jupyter notebook spark zeppelin

Last synced: 12 Oct 2024

https://github.com/particle-iot/spark-sdk-ios

DEPRECATED Particle iOS Cloud SDK. Use -->

carthage cocoapods ios ipad iphone particle sdk spark

Last synced: 10 Nov 2024

https://github.com/aws/sagemaker-sparkml-serving-container

This code is used to build & run a Docker container for performing predictions against a Spark ML Pipeline.

inference inference-pipeline machine-learning mleap mleap-serialized-spark pipeline sagemaker serving spark sparkml

Last synced: 07 Oct 2024

https://github.com/iobruno/data-engineering-zoomcamp

Data Engineering examples for Airflow, Prefect, and Mage.ai; dbt for BigQuery, Redshift, ClickHouse, PostgreSQL; Spark/PySpark for Batch processing; and Kafka for Stream processing

airflow airflow-dags dbt-bigquery dbt-clickhouse dbt-postgres dbt-redshift kafka ksqldb mageai prefect pyspark spark typer-cli

Last synced: 14 Dec 2024

https://github.com/vigneshss-07/cloud-ai-analytics

This Repo contain details related to Data Engineering tech stacks in GCP

apachebeam bigdata bigquery clouddataflow cloudsql datalab google-cloud-platform spark

Last synced: 20 Jan 2025

https://github.com/distributedsystemsgroup/zoe

Zoe: Container Analytics as a Service -- mirror of https://gitlab.eurecom.fr/zoe/main/

analytics containers data jupyter python spark

Last synced: 13 Nov 2024

https://github.com/yaooqinn/spark-ranger

已经合入(apache/incubator-kyuubi) ACL Management for Apache Spark SQL with Apache Ranger.

acl authorization data-masking ranger row-level-security spark sparksql

Last synced: 01 Oct 2024

https://github.com/squashql/squashql

Official repository of SquashQL, the SQL query engine for multi-dimensional and hierarchical analysis that empowers your SQL database

bigquery clickhouse database duckdb java jdbc query querybuilder snowflake spark sql typescript

Last synced: 14 Dec 2024

https://github.com/logicalclocks/feature-store-api

Python - Java/Scala API for the Hopsworks feature store

feature-store hopsworks hsfs python scala spark

Last synced: 21 Jan 2025

https://github.com/maicius/weblogsanalysissystem

A big data platform for analyzing web access logs

echarts hadoop hbase scala spark

Last synced: 11 Nov 2024

https://github.com/paypal/PPExtensions

Set of iPython and Jupyter extensions to improve user experience

gimel hive ipython-magic jupyer jupyter-extension magics notebooks spark tableau teradata

Last synced: 07 Nov 2024

https://github.com/zaleslaw/spark-tutorial

How to build your first Spark application with MLlib, StructuredStreaming, GraphFrames, Datasets and so on? Answer is here!

kafka spark streaming structured-streaming

Last synced: 17 Nov 2024

https://github.com/geotrellis/geotrellis-chatta-demo

Demo of GeoTrellis - weighted overlay and zonal summary for University of Tennessee at Chattanooga.

chattanooga geodocker-cluster geotrellis s3 spark

Last synced: 11 Nov 2024

https://github.com/uwplse/casper

A compiler for automatically re-targeting sequential Java code to Apache Spark.

compiler dafny java spark synthesis z3

Last synced: 25 Nov 2024

https://github.com/zhonghuasheng/java_love_go

专注Java与Golang!!!Java基础、Java Core、JVM、Spring大家族、Golang语言、各种中间件(如rabbitmq、netty、mybatis、redis、mongodb、Spark等)

java java-8 jdbc mybatis spark spring springboot springmvc

Last synced: 16 Jan 2025

https://github.com/Merck/rdf2x

RDF2X converts big RDF datasets to the relational database model, CSV, JSON and ElasticSearch.

conversion json linked-data postgresql rdf spark sparql sql

Last synced: 18 Jan 2025

https://github.com/contiamo/rhombic

SQL parsing, lineage extraction and manipulation

lineage parser postgresql spark sql sql-lineage

Last synced: 07 Nov 2024

https://github.com/hydrospheredata/spark-ml-serving

Spark ML Lib serving library

inference scoring serving spark

Last synced: 28 Nov 2024

https://github.com/scalad/book

本项目收藏这些年来看过或者听过的一些不错的书籍,在整理文件时看见这些,发现删掉有点可惜,放着又太浪费空间,本着分享的原则,就把它们共享出来,一方面给需要的读者提供这些书籍,另一方面也是一种像知识库的积累吧

java linux mysql pdf scala spark spring

Last synced: 05 Nov 2024

https://github.com/jonathandinu/spark-ray-data-science

Supporting content (slides and exercises) for the Pearson video series covering best practices for developing scalable applications with Spark and Ray in the context of a data scientist's standard workflow.

artificial-intelligence data-science distributed-computing machine-learning python ray spark

Last synced: 15 Nov 2024

https://github.com/dcfjs/dcf

Yet another distributed compute framework

distributed-computing nodejs spark

Last synced: 22 Jan 2025

https://github.com/wzdnzd/bigdata-notes

BigData Learning Notes

bigdata hadoop spark

Last synced: 16 Jan 2025

https://github.com/mozilla/emr-bootstrap-spark

AWS bootstrap scripts for Mozilla's flavoured Spark setup.

aws jupyter spark zeppelin

Last synced: 29 Sep 2024

https://github.com/merck/rdf2x

RDF2X converts big RDF datasets to the relational database model, CSV, JSON and ElasticSearch.

conversion json linked-data postgresql rdf spark sparql sql

Last synced: 16 Nov 2024

https://github.com/rstudio/sparkxgb

R interface for XGBoost on Spark

apache-spark machine-learning r rstats spark xgboost

Last synced: 10 Nov 2024

https://github.com/xd-deng/spark-ml-intro

PySpark Machine Learning Examples

machine-learning spark

Last synced: 16 Oct 2024