Projects in Awesome Lists tagged with big-data

https://github.com/man-group/arcticdb

ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.

big-data data data-analysis data-science database dataframe pandas quantitative-analysis quantitative-finance quantitative-trading

Last synced: 30 Sep 2024

https://github.com/man-group/ArcticDB

ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.

big-data data data-analysis data-science database dataframe pandas quantitative-analysis quantitative-finance quantitative-trading

Last synced: 30 Jul 2024

https://github.com/hazelcast/hazelcast-jet

Distributed Stream and Batch Processing

batch-processing big-data cdc event-processing hacktoberfest java kafka low-latency stream-processing

Last synced: 26 Sep 2024

https://github.com/traildb/traildb

TrailDB is an efficient tool for storing and querying series of events

big-data c data-analytics database event-data time-series traildb

Last synced: 31 Jul 2024

https://github.com/datumbox/datumbox-framework

Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.

big-data data-science java machine-learning nlp statistics

Last synced: 30 Sep 2024

https://github.com/apache/accumulo

Apache Accumulo

accumulo big-data hacktoberfest

Last synced: 30 Sep 2024

https://github.com/bigdatagenomics/adam

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

avro big-data bioinformatics genomics java parquet python r scala spark

Last synced: 28 Sep 2024

https://github.com/h2oai/sparkling-water

Sparkling Water provides H2O functionality inside Spark cluster

big-data h2o integration machine-learning pyspark pysparkling rsparkling scala spark

Last synced: 28 Sep 2024

https://github.com/commsor/titanoboa

Titanoboa makes complex workflows easy. It is a low-code workflow orchestration platform for JVM - distributed, highly scalable and fault tolerant.

big-data distributed distributed-systems esb integrations ipaas jvm low-code service-bus titanoboa workflow workflow-engine workflow-platform

Last synced: 29 Sep 2024

https://github.com/firmai/data-science-career

Career Resources for Data Science, Machine Learning, Big Data and Business Analytics Career Repository

analytics big-data business-analytics business-intelligence career data-science machine-learning resources

Last synced: 07 Aug 2024

https://github.com/kwai/blaze

Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.

arrow-datafusion big-data data-engineering execution-engine rust spark sql

Last synced: 28 Sep 2024

https://github.com/GoogleCloudPlatform/DataflowJavaSDK

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.

big-data data-analysis data-mining data-processing data-science google-cloud-dataflow

Last synced: 02 Aug 2024

https://github.com/joshday/onlinestats.jl

⚡ Single-pass algorithms for statistics

big-data julia julia-language julialang online-algorithms onlinestats statistics stochastic-approximation streaming-data

Last synced: 29 Sep 2024

https://github.com/joshday/OnlineStats.jl

⚡ Single-pass algorithms for statistics

big-data julia julia-language julialang online-algorithms onlinestats statistics stochastic-approximation streaming-data

Last synced: 31 Jul 2024

https://github.com/nodefluent/kafka-streams

equivalent to kafka-streams :octopus: for nodejs :sparkles::turtle::rocket::sparkles:

big-data kafka kafka-streams node nodejs stream-processing streams

Last synced: 01 Aug 2024

https://github.com/jadianes/spark-movie-lens

An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset

big-data bigdata flask movie-recommendation movielens-dataset python spark

Last synced: 14 Aug 2024

https://github.com/apache/samza

Mirror of Apache Samza

big-data samza scala

Last synced: 30 Sep 2024

https://github.com/rakam-io/rakam-api

📈 Collect customer event data from your apps. (Note that this project only includes the API collector, not the visualization platform)

analytics analytics-platform bi-server big-data java

Last synced: 06 Aug 2024

https://github.com/apache/ozone

Scalable, redundant, and distributed object store for Apache Hadoop

big-data hadoop kubernetes object-store s3 storage

Last synced: 30 Sep 2024

https://github.com/miguelgfierro/ai_projects

AI projects

analytics artificial-intelligence big-data code-examples data-science deep-learning examples machine-learning neural-networks programming-exercise

Last synced: 31 Jul 2024

https://github.com/NVIDIA/spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs

big-data gpu rapids spark

Last synced: 01 Aug 2024

https://github.com/nvidia/spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs

big-data gpu rapids spark

Last synced: 28 Sep 2024

https://github.com/mukunku/parquetviewer

Simple Windows desktop application for viewing & querying Apache Parquet files

apache-parquet big-data dot-net parquet windows-desktop

Last synced: 28 Sep 2024

https://github.com/foochane/books

整理一些书籍 ,包含 C&C++ 、git 、Java、Keras 、Linux 、NLP 、Python 、Scala 、TensorFlow 、大数据、推荐系统、数据库、数据挖掘、机器学习、深度学习、算法等。

big-data c cpp database datamining dl git java keras ml nlp python scala tensorflow

Last synced: 01 Aug 2024

https://github.com/delta-io/delta-sharing

An open protocol for secure data sharing

big-data data-sharing delta-lake pandas spark

Last synced: 07 Aug 2024

https://github.com/nipy/nipype

Workflows and interfaces for neuroimaging packages

big-data brain-imaging brainweb data-science dataflow dataflow-programming neuroimaging python workflow-engine

Last synced: 06 Aug 2024

https://github.com/apache/flink-kubernetes-operator

Apache Flink Kubernetes Operator

big-data flink java

Last synced: 01 Aug 2024

https://github.com/apache/oozie

Mirror of Apache Oozie

big-data java javascript oozie

Last synced: 30 Sep 2024

https://github.com/raycad/devops-roadmap

DevOps methodology & roadmap for a devops developer in 2019. Interesting books to learn new technologies.

ai big-data books deep-learning devops experience expert-system machine-learning programming

Last synced: 01 Aug 2024

https://github.com/apache/orc

Apache ORC - the smallest, fastest columnar storage for Hadoop workloads

apache big-data cpp java orc

Last synced: 31 Jul 2024

https://github.com/cernopendata/opendata.cern.ch

Source code for the CERN Open Data portal

big-data digital-library digital-repository flask invenio inveniosoftware json-schema open-data open-research-data open-science python research-data research-data-management research-data-repository

Last synced: 30 Sep 2024

https://github.com/clickhouse/clickbench

ClickBench: a Benchmark For Analytical Databases

analytics benchmark big-data databases olap sql

Last synced: 01 Oct 2024

https://github.com/IntelPython/sdc

Numba extension for compiling Pandas data frames, Intel® Scalable Dataframe Compiler

big-data compilers machine-learning numpy pandas parallel-computing python

Last synced: 03 Aug 2024

https://github.com/metabrainz/listenbrainz-server

Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.

big-data database listenbrainz-server music python react spark typescript web

Last synced: 01 Aug 2024

https://github.com/elastic/eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch

big-data data-analysis dataframe dataframes eland elasticsearch etl lightgbm machine-learning pandas python scikit-learn time-series-forecasting

Last synced: 27 Sep 2024

https://github.com/Seagate/cortx

CORTX Community Object Storage is 100% open source object storage uniquely optimized for mass capacity storage devices.

big-data bigdata cortx-community distributed-storage distributed-systems hackathons hacktoberfest hacktoberfest2020 inclusivity object-storage object-storage-service objectstorage objectstore open-source opensource s3 s3-storage software-defined-storage storage storage-api

Last synced: 01 Aug 2024

https://github.com/TouK/nussknacker

Low-code tool for automating actions on real time data | Stream processing for the users.

apache-flink automation big-data data-streaming decision-engine decision-making decisioning flink flink-kafka gui kafka low-code lowcode real-time rules-engine scala stream-processing streaming touk

Last synced: 31 Jul 2024

https://github.com/ClickHouse/ClickBench

ClickBench: a Benchmark For Analytical Databases

analytics benchmark big-data databases olap sql

Last synced: 01 Aug 2024

https://github.com/apache/giraph

Mirror of Apache Giraph

big-data giraph java

Last synced: 01 Oct 2024

https://github.com/KlugerLab/FIt-SNE

Fast Fourier Transform-accelerated Interpolation-based t-SNE (FIt-SNE)

big-data fast-algorithm t-sne visualization

Last synced: 31 Jul 2024

https://github.com/thrill/thrill

Thrill - An EXPERIMENTAL Algorithmic Distributed Big Data Batch Processing Framework in C++

big-data c-plus-plus distributed-computing thrill

Last synced: 30 Jul 2024

https://github.com/yahoo/redislite

Redis in a python module.

big-data database key-value python redis redis-bindings redis-server redislite screwdriver

Last synced: 31 Jul 2024

https://github.com/apache/bigtop

Bigtop is an Apache Foundation project for Infrastructure Engineers and Data Scientists looking for comprehensive packaging, testing, and configuration of the leading open source big data components.

big-data bigtop java

Last synced: 31 Jul 2024

https://github.com/austinksmith/Hamsters.js

100% Vanilla Javascript Multithreading & Parallel Execution Library

big-data concurrent-programming cross-platform future-proofing high-performance-computing innovation javascript-library multithreaded multithreading nodejs-server open-source optimization parallel-processing react-native-app scalability task-processor threadpool web-application

Last synced: 01 Aug 2024

https://github.com/austinksmith/hamsters.js

100% Vanilla Javascript Multithreading & Parallel Execution Library

big-data concurrent-programming cross-platform future-proofing high-performance-computing innovation javascript-library multithreaded multithreading nodejs-server open-source optimization parallel-processing react-native-app scalability task-processor threadpool web-application

Last synced: 27 Sep 2024

https://github.com/harsha2010/magellan

Geo Spatial Data Analytics on Spark

big-data geojson geometric-algorithms geospatial geospatial-analysis geospatial-analytics geospatial-processing magellan shapefile spark sparksql

Last synced: 02 Aug 2024

https://github.com/nomemory/mockneat

MockNeat - the modern faker lib.

arbitrary-data big-data csv data-generation data-generator fake-data faker faker-generator faker-library java java-8 lorem-ipsum mocking random-generation random-number-generators randomization randomizer sample-data sample-data-generator sql-insert

Last synced: 28 Sep 2024

https://github.com/nicgirault/circosJS

d3 library to build circular graphs

big-data bigdata bioinformatics bioinformatics-data circos circos-graphs circular d3js javascript

Last synced: 03 Aug 2024

https://github.com/nicgirault/circosjs

d3 library to build circular graphs

big-data bigdata bioinformatics bioinformatics-data circos circos-graphs circular d3js javascript

Last synced: 03 Oct 2024

https://github.com/yahoo/HaloDB

A fast, log structured key-value store.

big-data embedded-database java key-value-store storage-engine

Last synced: 01 Aug 2024

https://github.com/synmetrix/synmetrix

Synmetrix – production-ready open source semantic layer on Cube

big-data bigquery business-intelligence clickhouse cube cubejs data-engineering databricks dremio druid firebolt llm prestodb redshift semantic-layer snowflake vertica

Last synced: 29 Sep 2024

https://github.com/unum-cloud/ustore

Multi-Modal Database replacing MongoDB, Neo4J, and Elastic with 1 faster ACID solution, with NetworkX and Pandas interfaces, and bindings for C 99, C++ 17, Python 3, Java, GoLang 🗄️

acid apache-arrow arrow big-data bigdata database dataloader document-database graph-database iouring json key-value-store knn-search networkx nosql pandas python search spdk vector-search

Last synced: 01 Aug 2024

https://github.com/Lonero-Team/Decentralized-Internet

A SDK/library for decentralized web and distributing computing projects

big-data biostatistical-computing blockchain cryptography dapps decentralized decentralized-applications decentralized-internet developer-tools dweb grid-computing iot library mesh-networks offline-first p2p peer-to-peer protocols saas sdk

Last synced: 31 Jul 2024

https://github.com/apache/tez

Apache Tez

apache big-data hadoop java tez

Last synced: 30 Sep 2024

https://github.com/CogComp/cogcomp-nlp

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.

big-data cogcomp data-mining dependency-parsing lemmatization lemmatizer named-entity-recognition natural-language-processing natural-language-understanding ner nlp parts-of-speech-tagging pos pos-tagging relation-extraction similarity tokenizer transliteration

Last synced: 31 Jul 2024

https://github.com/apache/helix

Mirror of Apache Helix

big-data cloud helix java

Last synced: 30 Sep 2024

https://github.com/conjure-up/conjure-up

Deploying complex solutions, magically.

big-data big-software conjure conjure-up juju kubernetes macos openstack ubuntu

Last synced: 01 Aug 2024

https://github.com/apache/parquet-cpp

Apache Parquet

big-data java parquet

Last synced: 01 Oct 2024

https://github.com/microsoft/hyperspace

An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.

acceleration analytics big-data databases indexing spark

Last synced: 26 Sep 2024

https://github.com/USCDataScience/sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

big-data distributed-systems information-retrieval nutch search search-engine solr spark tika web-crawler

Last synced: 31 Jul 2024

https://github.com/cartershanklin/pyspark-cheatsheet

PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster

apache-spark big-data pyspark spark

Last synced: 28 Sep 2024

https://github.com/smooks/smooks

Extensible data integration Java framework for building XML and non-XML fragment-based applications

analytics big-data enterprise-integration etl event-driven java pipelines sax smooks stream-processing xml

Last synced: 31 Jul 2024

https://github.com/apache/couchdb-fauxton

Fauxton is the new Web UI for CouchDB

apache big-data cloud couchdb database erlang fauxton http javascript network-client

Last synced: 30 Sep 2024

https://github.com/apache/apex-core

Mirror of Apache Apex core

apex big-data java

Last synced: 01 Oct 2024

https://github.com/hortonworks/cloudbreak

CDP Public Cloud is an integrated analytics and data management platform deployed on cloud services. It offers broad data analytics and artificial intelligence functionality along with secure user access and data governance features.

big-data cloud cloudera deployment hacktoberfest hadoop java

Last synced: 03 Aug 2024

https://github.com/zutianbiao/baize

白泽自动化运维系统：配置管理、网络探测、资产管理、业务管理、CMDB、CD、DevOps、作业编排、任务编排等功能,未来将添加监控、报警、日志分析、大数据分析等部分内容

analyse ansible big-data cmdb crontab data devops django log monitor ops python

Last synced: 03 Aug 2024

https://github.com/opencypher/morpheus

Morpheus brings the leading graph query language, Cypher, onto the leading distributed processing platform, Spark.

apache-spark apache2 big-data cypher graph scala

Last synced: 28 Sep 2024

https://github.com/opencypher/cypher-for-apache-spark

Morpheus brings the leading graph query language, Cypher, onto the leading distributed processing platform, Spark.

apache-spark apache2 big-data cypher graph scala

Last synced: 07 Aug 2024

https://github.com/sigmf/SigMF

The Signal Metadata Format Specification

big-data metadata signals specification standard

Last synced: 01 Aug 2024

https://github.com/tirthajyoti/spark-with-python

Fundamentals of Spark with Python (using PySpark), code examples

analytics apache apache-spark big-data database dataframe distributed-computing hadoop hdfs machine-learning map-reduce mlib parallel-computing pyspark python spark sql

Last synced: 28 Sep 2024

https://github.com/Hydrospheredata/mist

Serverless proxy for Spark cluster

apache-spark api big-data serverless

Last synced: 31 Jul 2024

https://github.com/hydrospheredata/mist

Serverless proxy for Spark cluster

apache-spark api big-data serverless

Last synced: 28 Sep 2024

https://github.com/rom1504/cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...

big-data dataset multimodal

Last synced: 03 Oct 2024

https://github.com/selinon/selinon

An advanced distributed task flow management on top of Celery

big-data celery distributed-computing flow-management hacktoberfest hacktoberfest-accepted hacktoberfest2021 kubernetes openshift python schedule-tasks selinon task workflow-management workflow-management-system workflow-scheduler

Last synced: 03 Aug 2024

https://github.com/helicalinsight/helicalinsight

Helical Insight software is world’s first Open Source Business Intelligence framework which helps you to make sense out of your data and make well informed decisions.

amazon-redshift big-data business-intelligence dashboard data-analysis data-visualization druid graph-database hive mongodb mysql neo4j nosql oracle-database postgresql rdbms reporting sql-editor sqllite

Last synced: 26 Sep 2024

https://github.com/microsoft/data-accelerator

Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.

apache-spark azure big-data cosmosdb docker eventhub hdinsight iot iothub kafka kafka-streams nodejs react servicefabric spark spark-sql spark-streaming sparksql streaming streaming-data

Last synced: 28 Sep 2024

https://github.com/linkedin/openhouse

Open Control Plane for Tables in Data Lakehouse

big-data catalog datalake datalakehouse declarative iceberg management tables

Last synced: 28 Sep 2024

https://github.com/zero-one-group/geni

A Clojure dataframe library that runs on Spark

big-data clojure clojure-library clojure-repl data-engineering data-science dataframe distributed-computing high-performance-computing machine-learning parallel-computing spark

Last synced: 31 Jul 2024

https://github.com/tonbo-io/tonbo

A portable embedded database using Arrow.

arrow big-data embedded-database olap rust store-engine

Last synced: 17 Aug 2024

https://github.com/apache/predictionio-sdk-php

PredictionIO PHP SDK

big-data predictionio scala

Last synced: 01 Oct 2024

https://github.com/apache/couchdb-docker

Semi-official Apache CouchDB Docker images

apache big-data cloud couchdb cplusplus database erlang http javascript network-client network-server

Last synced: 30 Sep 2024

https://github.com/apache/trafodion

Apache Trafodion

big-data cplusplus trafodion

Last synced: 29 Sep 2024

https://github.com/apache/calcite-avatica

Apache Calcite Avatica

big-data calcite geospatial hadoop java sql

Last synced: 30 Sep 2024

https://github.com/ubisoft/mobydq

:whale: Tool to automate data quality checks on data pipelines

big-data data-pipeline data-quality data-quality-checks data-quality-monitoring data-warehouse

Last synced: 02 Aug 2024

https://github.com/paypal/gimel

Big Data Processing Framework - Unified Data API or SQL on Any Storage

aerospike big-data cassandra data-api elasticsearch gimel hbase jdbc kafka paypal pyspark python restapi scala spark spark-streaming streaming-sql teradata

Last synced: 29 Sep 2024

https://github.com/awslabs/amazon-s3-find-and-forget

Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)

amazon-s3 aws big-data ccpa data data-erasure data-lake gdpr parquet privacy right-to-be-forgotten s3

Last synced: 01 Aug 2024

https://github.com/vertica/VerticaPy

VerticaPy is a Python library that exposes sci-kit like functionality to conduct data science projects on data stored in Vertica, thus taking advantage Vertica’s speed and built-in analytics and machine learning capabilities.

big-data data-science data-visualization machine-learning preparation python python-library vertica

Last synced: 02 Aug 2024

https://github.com/scikit-hep/awkward-0.x

Manipulate arrays of complex data structures as easily as Numpy.

analysis apache-arrow arrow big-data columnar columnar-storage hdf5 numpy parquet python python3 root root-cern scikit-hep

Last synced: 28 Sep 2024

https://github.com/talariadb/talaria

TalariaDB is a distributed, highly available, and low latency time-series database for Presto

big-data column-store database prestodb real-time stream-processing time-series

Last synced: 02 Aug 2024

https://github.com/Chabane/bigdata-playground

A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL

angular apache-flink apache-spark avro big-data docker graphql hadoop hbase kafka kops machine-learning mongodb nodejs parquet python scala spark-sql spark-streaming twitter-api

Last synced: 02 Aug 2024

https://github.com/apache/predictionio-sdk-python

PredictionIO Python SDK

big-data predictionio scala

Last synced: 01 Oct 2024

https://github.com/apache/predictionio-sdk-ruby

PredictionIO Ruby SDK

big-data predictionio scala

Last synced: 31 Jul 2024

https://github.com/apache/knox

Mirror of Apache Knox

big-data java knox

Last synced: 30 Sep 2024

https://github.com/privefl/bigsnpr

R package for the analysis of massive SNP arrays.

big-data bioinformatics memory-mapped-file parallel-computing polygenic-scores population-structure-inference r r-package snp-data statistical-methods

Last synced: 13 Aug 2024

https://github.com/apache/incubator-wayang

Apache Wayang(incubating) is the first cross-platform data processing system.

apache big-data cross-platform data-management-platform data-processing distributed-system hadoop java jdbc middleware open-source performance scala spark

Last synced: 29 Sep 2024

https://github.com/locationtech-labs/geopyspark

GeoTrellis for PySpark

big-data geospatial geotrellis python spark tile-server

Last synced: 07 Aug 2024

https://github.com/privefl/bigstatsr

R package for statistical tools with big matrices stored on disk.

big-data large-matrices memory-mapped-file parallel-computing r r-package statistical-methods

Last synced: 05 Aug 2024

https://github.com/SETL-Framework/setl

A simple Spark-powered ETL framework that just works 🍺

big-data data-analysis data-engineering data-science data-transformation dataset etl etl-pipeline framework machine-learning modularization pipeline scala setl spark

Last synced: 01 Aug 2024

https://github.com/airscholar/e2e-data-engineering

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

apache-airflow apache-kafka apache-spark apache-zookeeper big-data cassandra containerization data-engineering data-pipeline data-processing data-storage docker etl-pipeline postgresql real-time-analytics

Last synced: 28 Sep 2024

https://github.com/setl-framework/setl

A simple Spark-powered ETL framework that just works 🍺

big-data data-analysis data-engineering data-science data-transformation dataset etl etl-pipeline framework machine-learning modularization pipeline scala setl spark

Last synced: 28 Sep 2024