Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/felipekunzler/spark-twitter-analysis

Analyse a twitter dataset with Spark and vizualize the results on a React dashboard.

java reactjs scala spark

Last synced: 31 May 2024

https://github.com/Angel-ML/angel

A Flexible and Powerful Parameter Server for large-scale machine learning

high-dimensional machine-learning model online-learning parameter-server scala spark spark-streaming

Last synced: 31 May 2024

https://github.com/rezacsedu/Mining-Maximal-Frequent-Pattern-Spark

Implementation of Static mining part of "Mining maximal frequent patterns in transactional databases and dynamic data streams: A spark-based approach" Information Sciences, Volume 432, March 2018, Pages 278-300

data-mining data-stream frequent-pattern-mining java maximal-frequent-pattern spark structured-streaming

Last synced: 31 May 2024

https://github.com/feng-li/Distributed-Statistical-Computing

Teaching Materials for Distributed Statistical Computing (大数据分布式计算教学材料)

hadoop mapreduce pyspark-tutorial spark spark-teaching statistical-models

Last synced: 31 May 2024

https://github.com/zhonghuasheng/Tutorial

后端 (Java Golang)全栈知识架构体系总结

emsp go java keepalived mongodb mqtt mysql netty redis rocketmq spark spring springboot springcloud tomcat tutorial

Last synced: 31 May 2024

https://github.com/aalansehaiyang/technology-talk

【大厂面试专栏】一份Java程序员需要的技术指南,这里有面试题、系统架构、职场锦囊、主流中间件等,让你成为更牛的自己!

dubbo es6 git hbase java kafka mycat spark spring springboot

Last synced: 31 May 2024

https://github.com/zhisheng17/flink-learning

flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》

clickhouse elasticsearch flink hbase influxdb kafka loki mysql opentsdb rabbitmq redis rocketmq spark stream-processing streaming

Last synced: 31 May 2024

https://github.com/apache/doris

Apache Doris is an easy-to-use, high performance and unified analytics database.

bigquery database dbt delta-lake elt etl hadoop hive hudi iceberg lakehouse olap query-engine real-time redshift snowflake spark sql

Last synced: 31 May 2024

https://github.com/XZB-1248/Spark

✨Spark is a web-based, cross-platform and full-featured Remote Administration Tool (RAT) written in Go that allows you control all your devices anywhere. Spark是一个Go编写的,网页UI、跨平台以及多功能的远程控制和监控工具,你可以随时随地监控和控制所有设备。

dashboard go golang rat remote-access-tool remote-admin-tool remote-administration-tool remote-control server-monitoring shell spark webshell

Last synced: 31 May 2024

https://github.com/water8394/BigData-Interview

:dart: :star2:[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结

bigdata flink hadoop hbase hdfs interview interview-questions kafka mapreduce spark yarn

Last synced: 31 May 2024

https://github.com/liyupi/sql-generator

🔨 用 JSON 来生成结构化的 SQL 语句,基于 Vue3 + TypeScript + Vite + Ant Design + MonacoEditor 实现,项目简单(重逻辑轻页面)、适合练手~

ant-design bigdata hive javascript json monaco-editor mysql spark sql typescript vite vue vue3

Last synced: 30 May 2024

https://github.com/miztiik/s3-to-rds-with-glue

Extract, transform, and load data for analytic processing using AWS Glue

cdk cloud-development-kit etl glue glue-catalog glue-job miztiik-automation s3-to-rds spark

Last synced: 27 May 2024

https://github.com/AuFeld/Data_Engineering_Projects

A collection of data engineering projects: data modeling, ETL pipelines, data lakes, infrastructure configuration on AWS, data warehousing, containerization, and a dashboard to monitor data pipeline KPIs

airflow aws cassandra data-engineering data-lake data-warehouse docker emr etl-pipeline infrastructure-as-code infrastructure-setup postgresql python redshift s3 spark

Last synced: 27 May 2024

https://github.com/Qihoo360/Quicksql

A Flexible, Fast, Federated(3F) SQL Analysis Middleware for Multiple Data Sources

flink hive spark sql

Last synced: 26 May 2024

https://github.com/dharmeshkakadia/tpch-hdinsight

TPCH benchmark for various engines

benchmarking hive llap presto spark tpch

Last synced: 26 May 2024

https://github.com/dharmeshkakadia/tpcds-hdinsight

TPCDS benchmark for various engines

benchmarking hive llap presto spark tpcds

Last synced: 26 May 2024

https://github.com/lw-lin/CoolplaySpark

酷玩 Spark: Spark 源代码解析、Spark 类库等

apache-spark spark spark-streaming sparkcore structured-streaming

Last synced: 26 May 2024

https://github.com/jadianes/spark-py-notebooks

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

big-data bigdata data-analysis data-science ipython ipython-notebook machine-learning mllib notebook pyspark python spark

Last synced: 26 May 2024

https://github.com/lightbend/cloudflow

Cloudflow enables users to quickly develop, orchestrate, and operate distributed streaming applications on Kubernetes.

akka cloudflow flink kubernetes microservices-architectures spark streaming-applications streaming-data streaming-runtimes

Last synced: 26 May 2024

https://github.com/angelotc/MacroDAG

A Dockerized Airflow ETL pipeline that processes macroeconomic indicators from the Federal Reserve.

airflow docker spark

Last synced: 26 May 2024

https://github.com/elasticluster/elasticluster

Create clusters of VMs on the cloud and configure them with Ansible.

ansible azure cloud cluster clustering ec2 gcp gridengine hadoop hpc python slurm spark

Last synced: 26 May 2024

https://github.com/GaiZhenbiao/ChuanhuChatGPT

GUI for ChatGPT API and many LLMs. Supports agents, file-based QA, GPT finetuning and query with web search. All with a neat UI.

chatbot chatglm chatgpt-api claude dalle3 ernie gemini gemma inspurai llama midjourney minimax moss ollama qwen spark stablelm

Last synced: 25 May 2024

https://github.com/yahoo/lopq

Training of Locally Optimized Product Quantization (LOPQ) models for approximate nearest neighbor search of high dimensional data in Python and Spark.

clustering lopq nearest-neighbor-search product-quantization spark

Last synced: 24 May 2024

https://github.com/radanalyticsio/spark-operator

Operator for managing the Spark clusters on Kubernetes and OpenShift.

apache-spark kubernetes kubernetes-operator openshift spark

Last synced: 22 May 2024

https://github.com/vmitchell85/spark-kiosk-notify

Adds a notification panel to your Laravel Spark Kiosk, allowing you to send notifications to users.

laravel notifications spark

Last synced: 21 May 2024

https://github.com/gilbitron/spark-create-stripe-plans

A simple Laravel artisan command to create Spark plans in Stripe

laravel laravel-artisan-command spark stripe

Last synced: 21 May 2024

https://github.com/cretueusebiu/laravel-spark-camera

Profile Photo Camera support for Laravel Spark

camera laravel laravel-spark php spark

Last synced: 21 May 2024

https://github.com/cretueusebiu/laravel-spark-google2fa

Google Authenticator support for Laravel Spark

authenticator laravel laravel-spark php spark

Last synced: 21 May 2024

https://github.com/leobenkel/Zparkio

Boiler plate framework to use Spark and ZIO together.

boiler-plate functional-programming helpers scala spark template zio

Last synced: 20 May 2024

https://github.com/rstudio-conf-2020/big-data

:wrench: Use dplyr to analyze Big Data :elephant:

databases dplyr r rstudio spark sparklyr workshop

Last synced: 20 May 2024

https://github.com/databricks/koalas

Koalas: pandas API on Apache Spark

big-data data-science dataframe mlflow pandas pydata spark

Last synced: 18 May 2024

https://simplexspatial.github.io/osm4scala/

Scala and Spark library focused on reading OpenStreetMap Pbf files.

gis openstreetmap openstreetmap-pbf-files osm pbf scala spark

Last synced: 17 May 2024

https://github.com/lucidworks/spark-solr

Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.

solr spark

Last synced: 17 May 2024

https://github.com/WeBankFinTech/DataSphereStudio

DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.

airflow atlas azkaban dataworks davinci dolphinscheduler flink governance griffin hadoop hive hue kettle linkis spark supperset tableau visualis workflow zeppelin

Last synced: 16 May 2024

https://github.com/getredash/redash

Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

analytics athena bi bigquery business-intelligence dashboard databricks hacktoberfest javascript mysql postgresql python redash redshift spark spark-sql visualization

Last synced: 16 May 2024

https://github.com/TIBCOSoftware/snappydata

Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster

analytics memory-database scale snappydata spark stream transaction

Last synced: 16 May 2024

https://github.com/apache/linkis

Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.

application-manager context-service engine hive hive-table impala jdbc jobserver linkis livy presto pyspark resource-manager rest-api scriptis spark sql storage thrift-server udf

Last synced: 16 May 2024

https://github.com/gchq/Gaffer

A large-scale entity and relation database supporting aggregation of properties

accumulo aggregation big-data graph graph-database hadoop hbase parquet spark

Last synced: 15 May 2024

https://github.com/Alluxio/alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud

alluxio data-analysis data-orchestration hadoop memory-speed presto spark tensorflow virtual-distributed-filesystem

Last synced: 15 May 2024

https://github.com/tfayyaz/awesome-azure-databricks

Awesome content all about Azure Databricks

awesome awesome-list azure azure-databricks delta-lake spark

Last synced: 14 May 2024

https://github.com/jonathandinu/spark-ray-data-science

Supporting content (slides and exercises) for the Pearson video series covering best practices for developing scalable applications with Spark and Ray in the context of a data scientist's standard workflow.

artificial-intelligence data-science distributed-computing machine-learning python ray spark

Last synced: 14 May 2024

https://github.com/eto-ai/rikai

Parquet-based ML data format optimized for working with unstructured data

deep-learning machine-learning pytorch spark tensorflow

Last synced: 14 May 2024

https://github.com/purduedb/knowledgecubes

Efficient RDF Data Management over Spark

data-management filtering rdf-data spark

Last synced: 14 May 2024

https://github.com/delta-io/delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

acid analytics big-data delta-lake spark

Last synced: 14 May 2024

https://github.com/yeasy/docker_practice

Learn and understand Docker&Container technologies, with real DevOps practice!

book cloud-computing container devops docker kubernetes linux mesos spark swarm

Last synced: 13 May 2024

https://github.com/SANSA-Stack/SANSA-Stack

Big Data RDF Processing and Analytics Stack built on Apache Spark and Apache Jena http://sansa-stack.github.io/SANSA-Stack/

apache-jena apache-spark distributed-computing flink rdf semantic-web spark

Last synced: 13 May 2024

https://github.com/ytsaurus/ytsaurus

YTsaurus is a scalable and fault-tolerant open-source big data platform.

big-data clickhouse distributed-database lakehouse olap-database spark sql ytsaurus

Last synced: 13 May 2024

https://github.com/h2oai/sparkling-water

Sparkling Water provides H2O functionality inside Spark cluster

big-data h2o integration machine-learning pyspark pysparkling rsparkling scala spark

Last synced: 13 May 2024

https://github.com/bigdatagenomics/adam

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

avro big-data bioinformatics genomics java parquet python r scala spark

Last synced: 13 May 2024

https://github.com/LB-Yu/data-systems-learning

Learning summary and examples about data systems.

big-data distributed-systems flink hbase spark

Last synced: 11 May 2024

https://github.com/datamechanics/delight

A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.

apache-spark cpu dashboard delight kubernetes memory monitoring netapp-public spark spark-history-server spark-ui

Last synced: 11 May 2024

https://github.com/FavioVazquez/ds-cheatsheets

List of Data Science Cheatsheets to rule the world

cheatsheet datascience jupyter programming python r spark

Last synced: 10 May 2024

https://github.com/salesforce/TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning

ai automated-machine-learning automl dsl einstein estimators feature-engineering features machine-learning ml pipelines salesforce scala spark sparkml structured-data transformations transformers transmogrification transmogrify

Last synced: 09 May 2024

https://github.com/uni-openai/uniai-maas

An opensource AI & model as a service platform.

ai chatglm chatgpt gpt kimichat midjourney moonshot spark stability-ai uniai

Last synced: 08 May 2024

https://github.com/archivesunleashed/twut

An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.

apache-spark spark spark-packages tweets twitter-data twitter-json

Last synced: 07 May 2024

https://github.com/archivesunleashed/aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

analysis apache-spark big-data big-data-analytics dataframe digital-humanities hadoop network-graphing pyspark python3 scala spark text-extraction webarchives

Last synced: 07 May 2024

https://github.com/archivesunleashed/notebooks

Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit.

juypter-notebook notebooks pyspark-notebook python3 spark web-archives

Last synced: 07 May 2024

https://github.com/helgeho/ArchiveSpark

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

archivespark internet-archive spark spark-framework warc web-archiving webarchive

Last synced: 07 May 2024

https://github.com/helgeho/HadoopConcatGz

A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz

hadoop spark warc web-archiving webarchive

Last synced: 07 May 2024

https://github.com/ohenley/awesome-ada

A curated list of awesome resources related to the Ada and SPARK programming language

ada ada-binding ada-framework ada-language ada-library ada-programs awesome awesome-list gnat spark spark-ada

Last synced: 05 May 2024

https://github.com/blaze-init/spark-blaze-extension

Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.

arrow datafusion spark

Last synced: 03 May 2024

https://github.com/atalii/adage

ada privilege escalation

ada security spark sudo

Last synced: 03 May 2024

https://github.com/WeBankFinTech/Scriptis

Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.

errorcode hive hive-table hql hue ide linkis pyspark resouce-management scala spark sql udf zeppelin

Last synced: 02 May 2024

https://github.com/AdaCore/RecordFlux

Formal specification and generation of verifiable binary parsers, message generators and protocol state machines

ada binary-parser communication-protocol formal-methods formal-specification formal-verification parser protocol-parser protocol-specification python spark

Last synced: 02 May 2024

https://github.com/docandrew/CuBit

General-purpose, formally-verified, 64-bit operating system in SPARK/Ada for x86-64

ada os spark x86-64

Last synced: 02 May 2024

https://github.com/RoaringBitmap/RoaringBitmap

A better compressed bitset in Java: used by Apache Spark, Netflix Atlas, Apache Pinot, Tablesaw, and many others

bitset druid java lucene roaring-bitmaps roaringbitmap spark

Last synced: 02 May 2024

https://github.com/deeplearning4j/deeplearning4j

Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.

artificial-intelligence clojure deeplearning deeplearning4j dl4j gpu hadoop intellij java linear-algebra matrix-library neural-nets python scala spark

Last synced: 01 May 2024

https://github.com/simplexspatial/osm4scala

Scala and Spark library focused on reading OpenStreetMap Pbf files.

gis openstreetmap openstreetmap-pbf-files osm pbf scala spark

Last synced: 30 Apr 2024

https://github.com/jacksu/utils4s

scala、spark使用过程中,各种测试用例以及相关资料整理

akka breeze json4s scala scala-demo scala-spark spark spark-streaming

Last synced: 30 Apr 2024

https://github.com/apache/spark

Apache Spark - A unified analytics engine for large-scale data processing

big-data java jdbc python r scala spark sql

Last synced: 30 Apr 2024

https://github.com/frees-io/freestyle

A cohesive & pragmatic framework of FP centric Scala libraries

architectural-patterns cassandra free-monads freestyle functional-programming kafka monads redis rpc scala spark tagless-final

Last synced: 30 Apr 2024

https://github.com/apache/zeppelin

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

big-data database flink java javascript nosql scala spark zeppelin

Last synced: 30 Apr 2024

https://github.com/spark-notebook/spark-notebook

Interactive and Reactive Data Science using Scala and Spark.

apache-spark data-science notebook reactive scala spark

Last synced: 30 Apr 2024

https://github.com/indix/sparkplug

Spark package to "plug" holes in data using SQL based rules ⚡️ 🔌

datapipeline spark spark-sql

Last synced: 30 Apr 2024

https://github.com/indix/schemer

Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.

avro graphql-api json parquet schema-inference schema-registry spark tsv

Last synced: 30 Apr 2024

https://github.com/Clustering4Ever/Clustering4Ever

C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.

ai artificial-intelligence big-data bigdata clustering clustering-algorithm clustering-evaluation scala scalability spark

Last synced: 30 Apr 2024

https://github.com/zio/zio-quill

Compile-time Language Integrated Queries for Scala

cassandra database jdbc linq mysql postgres scala scalajs spark sparksql

Last synced: 30 Apr 2024

https://github.com/Anant/Cassandra.Lunch

Resources from weekly Zoom lunches revolving around Apache Cassandra and Apache Cassandra-related topics. Hosted by Anant Corporation.

airflow akka astra cassandra datastax elk kafka nosql scylladb spark

Last synced: 30 Apr 2024

https://github.com/academyofdata/cassandra-zeppelin

Docker-Compose script for Cassandra + Zeppelin setup

cassandra spark zeppelin

Last synced: 30 Apr 2024

https://github.com/nmarus/node-red-contrib-spark

Node-RED Nodes to integrate with the Cisco Webex Teams API

cisco node-red spark

Last synced: 29 Apr 2024

https://github.com/brh55/generator-spark-bot

:zap: Yeoman generator that scaffold out a Cisco spark bot with usability and simplicity in mind

cisco cisco-spark flint nodejs scaffold spark yeoman

Last synced: 29 Apr 2024

https://github.com/flint-bot/sparky

Cisco Spark API for NodeJS (deprecated in favor of https://github.com/webex/webex-bot-node-framework)

cisco spark

Last synced: 29 Apr 2024