Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Projects in Awesome Lists tagged with big-data
A curated list of projects in awesome lists tagged with big-data .
https://github.com/man-group/arcticdb
ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
big-data data data-analysis data-science database dataframe pandas quantitative-analysis quantitative-finance quantitative-trading
Last synced: 30 Sep 2024
https://github.com/man-group/ArcticDB
ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
big-data data data-analysis data-science database dataframe pandas quantitative-analysis quantitative-finance quantitative-trading
Last synced: 30 Jul 2024
https://github.com/hazelcast/hazelcast-jet
Distributed Stream and Batch Processing
batch-processing big-data cdc event-processing hacktoberfest java kafka low-latency stream-processing
Last synced: 26 Sep 2024
https://github.com/traildb/traildb
TrailDB is an efficient tool for storing and querying series of events
big-data c data-analytics database event-data time-series traildb
Last synced: 31 Jul 2024
https://github.com/datumbox/datumbox-framework
Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.
big-data data-science java machine-learning nlp statistics
Last synced: 30 Sep 2024
https://github.com/apache/accumulo
Apache Accumulo
accumulo big-data hacktoberfest
Last synced: 30 Sep 2024
https://github.com/h2oai/sparkling-water
Sparkling Water provides H2O functionality inside Spark cluster
big-data h2o integration machine-learning pyspark pysparkling rsparkling scala spark
Last synced: 28 Sep 2024
https://github.com/commsor/titanoboa
Titanoboa makes complex workflows easy. It is a low-code workflow orchestration platform for JVM - distributed, highly scalable and fault tolerant.
big-data distributed distributed-systems esb integrations ipaas jvm low-code service-bus titanoboa workflow workflow-engine workflow-platform
Last synced: 29 Sep 2024
https://github.com/firmai/data-science-career
Career Resources for Data Science, Machine Learning, Big Data and Business Analytics Career Repository
analytics big-data business-analytics business-intelligence career data-science machine-learning resources
Last synced: 07 Aug 2024
https://github.com/kwai/blaze
Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.
arrow-datafusion big-data data-engineering execution-engine rust spark sql
Last synced: 28 Sep 2024
https://github.com/GoogleCloudPlatform/DataflowJavaSDK
Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
big-data data-analysis data-mining data-processing data-science google-cloud-dataflow
Last synced: 02 Aug 2024
https://github.com/joshday/onlinestats.jl
⚡ Single-pass algorithms for statistics
big-data julia julia-language julialang online-algorithms onlinestats statistics stochastic-approximation streaming-data
Last synced: 29 Sep 2024
https://github.com/joshday/OnlineStats.jl
⚡ Single-pass algorithms for statistics
big-data julia julia-language julialang online-algorithms onlinestats statistics stochastic-approximation streaming-data
Last synced: 31 Jul 2024
https://github.com/nodefluent/kafka-streams
equivalent to kafka-streams :octopus: for nodejs :sparkles::turtle::rocket::sparkles:
big-data kafka kafka-streams node nodejs stream-processing streams
Last synced: 01 Aug 2024
https://github.com/jadianes/spark-movie-lens
An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset
big-data bigdata flask movie-recommendation movielens-dataset python spark
Last synced: 14 Aug 2024
https://github.com/rakam-io/rakam-api
📈 Collect customer event data from your apps. (Note that this project only includes the API collector, not the visualization platform)
analytics analytics-platform bi-server big-data java
Last synced: 06 Aug 2024
https://github.com/apache/ozone
Scalable, redundant, and distributed object store for Apache Hadoop
big-data hadoop kubernetes object-store s3 storage
Last synced: 30 Sep 2024
https://github.com/miguelgfierro/ai_projects
AI projects
analytics artificial-intelligence big-data code-examples data-science deep-learning examples machine-learning neural-networks programming-exercise
Last synced: 31 Jul 2024
https://github.com/NVIDIA/spark-rapids
Spark RAPIDS plugin - accelerate Apache Spark with GPUs
Last synced: 01 Aug 2024
https://github.com/nvidia/spark-rapids
Spark RAPIDS plugin - accelerate Apache Spark with GPUs
Last synced: 28 Sep 2024
https://github.com/mukunku/parquetviewer
Simple Windows desktop application for viewing & querying Apache Parquet files
apache-parquet big-data dot-net parquet windows-desktop
Last synced: 28 Sep 2024
https://github.com/foochane/books
整理一些书籍 ,包含 C&C++ 、git 、Java、Keras 、Linux 、NLP 、Python 、Scala 、TensorFlow 、大数据 、推荐系统、数据库、数据挖掘 、机器学习 、深度学习 、算法等。
big-data c cpp database datamining dl git java keras ml nlp python scala tensorflow
Last synced: 01 Aug 2024
https://github.com/delta-io/delta-sharing
An open protocol for secure data sharing
big-data data-sharing delta-lake pandas spark
Last synced: 07 Aug 2024
https://github.com/nipy/nipype
Workflows and interfaces for neuroimaging packages
big-data brain-imaging brainweb data-science dataflow dataflow-programming neuroimaging python workflow-engine
Last synced: 06 Aug 2024
https://github.com/apache/flink-kubernetes-operator
Apache Flink Kubernetes Operator
Last synced: 01 Aug 2024
https://github.com/apache/oozie
Mirror of Apache Oozie
big-data java javascript oozie
Last synced: 30 Sep 2024
https://github.com/raycad/devops-roadmap
DevOps methodology & roadmap for a devops developer in 2019. Interesting books to learn new technologies.
ai big-data books deep-learning devops experience expert-system machine-learning programming
Last synced: 01 Aug 2024
https://github.com/cernopendata/opendata.cern.ch
Source code for the CERN Open Data portal
big-data digital-library digital-repository flask invenio inveniosoftware json-schema open-data open-research-data open-science python research-data research-data-management research-data-repository
Last synced: 30 Sep 2024
https://github.com/IntelPython/sdc
Numba extension for compiling Pandas data frames, Intel® Scalable Dataframe Compiler
big-data compilers machine-learning numpy pandas parallel-computing python
Last synced: 03 Aug 2024
https://github.com/metabrainz/listenbrainz-server
Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.
big-data database listenbrainz-server music python react spark typescript web
Last synced: 01 Aug 2024
https://github.com/elastic/eland
Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
big-data data-analysis dataframe dataframes eland elasticsearch etl lightgbm machine-learning pandas python scikit-learn time-series-forecasting
Last synced: 27 Sep 2024
https://github.com/Seagate/cortx
CORTX Community Object Storage is 100% open source object storage uniquely optimized for mass capacity storage devices.
big-data bigdata cortx-community distributed-storage distributed-systems hackathons hacktoberfest hacktoberfest2020 inclusivity object-storage object-storage-service objectstorage objectstore open-source opensource s3 s3-storage software-defined-storage storage storage-api
Last synced: 01 Aug 2024
https://github.com/TouK/nussknacker
Low-code tool for automating actions on real time data | Stream processing for the users.
apache-flink automation big-data data-streaming decision-engine decision-making decisioning flink flink-kafka gui kafka low-code lowcode real-time rules-engine scala stream-processing streaming touk
Last synced: 31 Jul 2024
https://github.com/KlugerLab/FIt-SNE
Fast Fourier Transform-accelerated Interpolation-based t-SNE (FIt-SNE)
big-data fast-algorithm t-sne visualization
Last synced: 31 Jul 2024
https://github.com/thrill/thrill
Thrill - An EXPERIMENTAL Algorithmic Distributed Big Data Batch Processing Framework in C++
big-data c-plus-plus distributed-computing thrill
Last synced: 30 Jul 2024
https://github.com/yahoo/redislite
Redis in a python module.
big-data database key-value python redis redis-bindings redis-server redislite screwdriver
Last synced: 31 Jul 2024
https://github.com/apache/bigtop
Bigtop is an Apache Foundation project for Infrastructure Engineers and Data Scientists looking for comprehensive packaging, testing, and configuration of the leading open source big data components.
Last synced: 31 Jul 2024
https://github.com/austinksmith/Hamsters.js
100% Vanilla Javascript Multithreading & Parallel Execution Library
big-data concurrent-programming cross-platform future-proofing high-performance-computing innovation javascript-library multithreaded multithreading nodejs-server open-source optimization parallel-processing react-native-app scalability task-processor threadpool web-application
Last synced: 01 Aug 2024
https://github.com/austinksmith/hamsters.js
100% Vanilla Javascript Multithreading & Parallel Execution Library
big-data concurrent-programming cross-platform future-proofing high-performance-computing innovation javascript-library multithreaded multithreading nodejs-server open-source optimization parallel-processing react-native-app scalability task-processor threadpool web-application
Last synced: 27 Sep 2024
https://github.com/harsha2010/magellan
Geo Spatial Data Analytics on Spark
big-data geojson geometric-algorithms geospatial geospatial-analysis geospatial-analytics geospatial-processing magellan shapefile spark sparksql
Last synced: 02 Aug 2024
https://github.com/nomemory/mockneat
MockNeat - the modern faker lib.
arbitrary-data big-data csv data-generation data-generator fake-data faker faker-generator faker-library java java-8 lorem-ipsum mocking random-generation random-number-generators randomization randomizer sample-data sample-data-generator sql-insert
Last synced: 28 Sep 2024
https://github.com/nicgirault/circosJS
d3 library to build circular graphs
big-data bigdata bioinformatics bioinformatics-data circos circos-graphs circular d3js javascript
Last synced: 03 Aug 2024
https://github.com/nicgirault/circosjs
d3 library to build circular graphs
big-data bigdata bioinformatics bioinformatics-data circos circos-graphs circular d3js javascript
Last synced: 03 Oct 2024
https://github.com/yahoo/HaloDB
A fast, log structured key-value store.
big-data embedded-database java key-value-store storage-engine
Last synced: 01 Aug 2024
https://github.com/synmetrix/synmetrix
Synmetrix – production-ready open source semantic layer on Cube
big-data bigquery business-intelligence clickhouse cube cubejs data-engineering databricks dremio druid firebolt llm prestodb redshift semantic-layer snowflake vertica
Last synced: 29 Sep 2024
https://github.com/unum-cloud/ustore
Multi-Modal Database replacing MongoDB, Neo4J, and Elastic with 1 faster ACID solution, with NetworkX and Pandas interfaces, and bindings for C 99, C++ 17, Python 3, Java, GoLang 🗄️
acid apache-arrow arrow big-data bigdata database dataloader document-database graph-database iouring json key-value-store knn-search networkx nosql pandas python search spdk vector-search
Last synced: 01 Aug 2024
https://github.com/Lonero-Team/Decentralized-Internet
A SDK/library for decentralized web and distributing computing projects
big-data biostatistical-computing blockchain cryptography dapps decentralized decentralized-applications decentralized-internet developer-tools dweb grid-computing iot library mesh-networks offline-first p2p peer-to-peer protocols saas sdk
Last synced: 31 Jul 2024
https://github.com/CogComp/cogcomp-nlp
CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
big-data cogcomp data-mining dependency-parsing lemmatization lemmatizer named-entity-recognition natural-language-processing natural-language-understanding ner nlp parts-of-speech-tagging pos pos-tagging relation-extraction similarity tokenizer transliteration
Last synced: 31 Jul 2024
https://github.com/conjure-up/conjure-up
Deploying complex solutions, magically.
big-data big-software conjure conjure-up juju kubernetes macos openstack ubuntu
Last synced: 01 Aug 2024
https://github.com/microsoft/hyperspace
An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.
acceleration analytics big-data databases indexing spark
Last synced: 26 Sep 2024
https://github.com/USCDataScience/sparkler
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
big-data distributed-systems information-retrieval nutch search search-engine solr spark tika web-crawler
Last synced: 31 Jul 2024
https://github.com/cartershanklin/pyspark-cheatsheet
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
apache-spark big-data pyspark spark
Last synced: 28 Sep 2024
https://github.com/smooks/smooks
Extensible data integration Java framework for building XML and non-XML fragment-based applications
analytics big-data enterprise-integration etl event-driven java pipelines sax smooks stream-processing xml
Last synced: 31 Jul 2024
https://github.com/apache/couchdb-fauxton
Fauxton is the new Web UI for CouchDB
apache big-data cloud couchdb database erlang fauxton http javascript network-client
Last synced: 30 Sep 2024
https://github.com/hortonworks/cloudbreak
CDP Public Cloud is an integrated analytics and data management platform deployed on cloud services. It offers broad data analytics and artificial intelligence functionality along with secure user access and data governance features.
big-data cloud cloudera deployment hacktoberfest hadoop java
Last synced: 03 Aug 2024
https://github.com/opencypher/morpheus
Morpheus brings the leading graph query language, Cypher, onto the leading distributed processing platform, Spark.
apache-spark apache2 big-data cypher graph scala
Last synced: 28 Sep 2024
https://github.com/opencypher/cypher-for-apache-spark
Morpheus brings the leading graph query language, Cypher, onto the leading distributed processing platform, Spark.
apache-spark apache2 big-data cypher graph scala
Last synced: 07 Aug 2024
https://github.com/sigmf/SigMF
The Signal Metadata Format Specification
big-data metadata signals specification standard
Last synced: 01 Aug 2024
https://github.com/tirthajyoti/spark-with-python
Fundamentals of Spark with Python (using PySpark), code examples
analytics apache apache-spark big-data database dataframe distributed-computing hadoop hdfs machine-learning map-reduce mlib parallel-computing pyspark python spark sql
Last synced: 28 Sep 2024
https://github.com/Hydrospheredata/mist
Serverless proxy for Spark cluster
apache-spark api big-data serverless
Last synced: 31 Jul 2024
https://github.com/hydrospheredata/mist
Serverless proxy for Spark cluster
apache-spark api big-data serverless
Last synced: 28 Sep 2024
https://github.com/rom1504/cc2dataset
Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
Last synced: 03 Oct 2024
https://github.com/selinon/selinon
An advanced distributed task flow management on top of Celery
big-data celery distributed-computing flow-management hacktoberfest hacktoberfest-accepted hacktoberfest2021 kubernetes openshift python schedule-tasks selinon task workflow-management workflow-management-system workflow-scheduler
Last synced: 03 Aug 2024
https://github.com/helicalinsight/helicalinsight
Helical Insight software is world’s first Open Source Business Intelligence framework which helps you to make sense out of your data and make well informed decisions.
amazon-redshift big-data business-intelligence dashboard data-analysis data-visualization druid graph-database hive mongodb mysql neo4j nosql oracle-database postgresql rdbms reporting sql-editor sqllite
Last synced: 26 Sep 2024
https://github.com/microsoft/data-accelerator
Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.
apache-spark azure big-data cosmosdb docker eventhub hdinsight iot iothub kafka kafka-streams nodejs react servicefabric spark spark-sql spark-streaming sparksql streaming streaming-data
Last synced: 28 Sep 2024
https://github.com/linkedin/openhouse
Open Control Plane for Tables in Data Lakehouse
big-data catalog datalake datalakehouse declarative iceberg management tables
Last synced: 28 Sep 2024
https://github.com/zero-one-group/geni
A Clojure dataframe library that runs on Spark
big-data clojure clojure-library clojure-repl data-engineering data-science dataframe distributed-computing high-performance-computing machine-learning parallel-computing spark
Last synced: 31 Jul 2024
https://github.com/tonbo-io/tonbo
A portable embedded database using Arrow.
arrow big-data embedded-database olap rust store-engine
Last synced: 17 Aug 2024
https://github.com/apache/couchdb-docker
Semi-official Apache CouchDB Docker images
apache big-data cloud couchdb cplusplus database erlang http javascript network-client network-server
Last synced: 30 Sep 2024
https://github.com/apache/calcite-avatica
Apache Calcite Avatica
big-data calcite geospatial hadoop java sql
Last synced: 30 Sep 2024
https://github.com/ubisoft/mobydq
:whale: Tool to automate data quality checks on data pipelines
big-data data-pipeline data-quality data-quality-checks data-quality-monitoring data-warehouse
Last synced: 02 Aug 2024
https://github.com/paypal/gimel
Big Data Processing Framework - Unified Data API or SQL on Any Storage
aerospike big-data cassandra data-api elasticsearch gimel hbase jdbc kafka paypal pyspark python restapi scala spark spark-streaming streaming-sql teradata
Last synced: 29 Sep 2024
https://github.com/awslabs/amazon-s3-find-and-forget
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
amazon-s3 aws big-data ccpa data data-erasure data-lake gdpr parquet privacy right-to-be-forgotten s3
Last synced: 01 Aug 2024
https://github.com/vertica/VerticaPy
VerticaPy is a Python library that exposes sci-kit like functionality to conduct data science projects on data stored in Vertica, thus taking advantage Vertica’s speed and built-in analytics and machine learning capabilities.
big-data data-science data-visualization machine-learning preparation python python-library vertica
Last synced: 02 Aug 2024
https://github.com/scikit-hep/awkward-0.x
Manipulate arrays of complex data structures as easily as Numpy.
analysis apache-arrow arrow big-data columnar columnar-storage hdf5 numpy parquet python python3 root root-cern scikit-hep
Last synced: 28 Sep 2024
https://github.com/talariadb/talaria
TalariaDB is a distributed, highly available, and low latency time-series database for Presto
big-data column-store database prestodb real-time stream-processing time-series
Last synced: 02 Aug 2024
https://github.com/Chabane/bigdata-playground
A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
angular apache-flink apache-spark avro big-data docker graphql hadoop hbase kafka kops machine-learning mongodb nodejs parquet python scala spark-sql spark-streaming twitter-api
Last synced: 02 Aug 2024
https://github.com/privefl/bigsnpr
R package for the analysis of massive SNP arrays.
big-data bioinformatics memory-mapped-file parallel-computing polygenic-scores population-structure-inference r r-package snp-data statistical-methods
Last synced: 13 Aug 2024
https://github.com/apache/incubator-wayang
Apache Wayang(incubating) is the first cross-platform data processing system.
apache big-data cross-platform data-management-platform data-processing distributed-system hadoop java jdbc middleware open-source performance scala spark
Last synced: 29 Sep 2024
https://github.com/locationtech-labs/geopyspark
GeoTrellis for PySpark
big-data geospatial geotrellis python spark tile-server
Last synced: 07 Aug 2024
https://github.com/privefl/bigstatsr
R package for statistical tools with big matrices stored on disk.
big-data large-matrices memory-mapped-file parallel-computing r r-package statistical-methods
Last synced: 05 Aug 2024
https://github.com/SETL-Framework/setl
A simple Spark-powered ETL framework that just works 🍺
big-data data-analysis data-engineering data-science data-transformation dataset etl etl-pipeline framework machine-learning modularization pipeline scala setl spark
Last synced: 01 Aug 2024
https://github.com/airscholar/e2e-data-engineering
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.
apache-airflow apache-kafka apache-spark apache-zookeeper big-data cassandra containerization data-engineering data-pipeline data-processing data-storage docker etl-pipeline postgresql real-time-analytics
Last synced: 28 Sep 2024
https://github.com/setl-framework/setl
A simple Spark-powered ETL framework that just works 🍺
big-data data-analysis data-engineering data-science data-transformation dataset etl etl-pipeline framework machine-learning modularization pipeline scala setl spark
Last synced: 28 Sep 2024