https://github.com/huawei-noah/streamDM

Stream Data Mining Library for Spark Streaming
https://github.com/huawei-noah/streamDM

Last synced: 7 months ago
JSON representation

Stream Data Mining Library for Spark Streaming

Host: GitHub
URL: https://github.com/huawei-noah/streamDM
Owner: huawei-noah
License: apache-2.0
Created: 2015-06-08T01:28:42.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2023-04-16T14:47:47.000Z (over 2 years ago)
Last Synced: 2024-08-01T17:31:38.933Z (about 1 year ago)
Language: Scala
Homepage: http://streamdm.noahlab.com.hk/
Size: 3.14 MB
Stars: 491
Watchers: 67
Forks: 147
Open Issues: 4
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

awesome-streaming - streamDM - mining Big Data streams using Spark Streaming from Huawei. (Table of Contents / Online Machine Learning)
awesome-streaming - streamDM - mining Big Data streams using Spark Streaming from Huawei. (Table of Contents / Online Machine Learning)

README

# streamDM for Spark Streaming

streamDM is a new open source software for mining big data streams using [Spark Streaming](https://spark.apache.org/streaming/), started at [Huawei Noah's Ark
Lab](http://www.noahlab.com.hk/). streamDM is licensed under Apache Software License v2.0.

## Big Data Stream Learning

Big Data stream learning is more challenging than batch or offline learning,
since the data may not keep the same distribution over the lifetime of the
stream. Moreover, each example coming in a stream can only be processed once, or
they need to be summarized with a small memory footprint, and the learning
algorithms must be very efficient.

### Spark Streaming

[Spark Streaming](https://spark.apache.org/streaming/) is an extension of the
core [Spark](https://spark.apache.org) API that enables stream processing from
a variety of sources. Spark is a extensible and programmable framework for
massive distributed processing of datasets, called Resilient Distributed
Datasets (RDD). Spark Streaming receives input data streams and divides the data
into batches, which are then processed by the Spark engine to generate the
results.

Spark Streaming data is organized into a sequence of DStreams, represented
internally as a sequence of RDDs.

### Included Methods

In this current release of StreamDM v0.2, we have implemented:

* [SGD Learner](http://huawei-noah.github.io/streamDM/docs/SGD.html) and [Perceptron](http://huawei-noah.github.io/streamDM/docs/SGD.html#perceptron)
* [Naive Bayes](http://huawei-noah.github.io/streamDM/docs/NB.html)
* [CluStream](http://huawei-noah.github.io/streamDM/docs/CluStream.html)
* [Hoeffding Decision Trees](http://huawei-noah.github.io/streamDM/docs/HDT.html)
* [Bagging](http://huawei-noah.github.io/streamDM/docs/Bagging.html)
* [Stream KM++](http://huawei-noah.github.io/streamDM/docs/StreamKM.html)

we also implemented following [data generators](http://huawei-noah.github.io/streamDM/docs/generators.html):

* HyperplaneGenerator
* RandomTreeGenerator
* RandomRBFGenerator
* RandomRBFEventsGenerator

We have also implemented [SampleDataWriter](http://huawei-noah.github.io/streamDM/docs/SampleDataWriter.html), which can call data generators
to create sample data for simulation or test.

In the next release of streamDM, we are going to add:

* Classification: Random Forests
* Multi-label: Hoeffding Tree ML, Random Forests ML
* Frequent Itemset Miner: IncMine

For future works, we are considering:
* Regression: Hoeffding Regression Tree, Bagging, Random Forests
* Clustering: Clustree, DenStream
* Frequent Itemset Miner: IncSecMine

## Going Further

For a quick introduction to running StreamDM, refer to the [Getting
Started](http://huawei-noah.github.io/streamDM/docs/GettingStarted.html) document. The StreamDM [Programming
Guide](http://huawei-noah.github.io/streamDM/docs/Programming.html) presents a detailed view of StreamDM. The full API
documentation can be consulted [here](http://huawei-noah.github.io/streamDM/api/index.html).

## Environment
* Spark 2.3.2
* Scala 2.11
* SBT 0.13
* Java 8+

## Mailing lists
### User support and questions mailing list:
streamdm-user@googlegroups.com
### Development related discussions:
streamdm-dev@googlegroups.com

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/huawei-noah/streamDM

Awesome Lists containing this project

README