https://github.com/ertis-research/kafka-ml

Kafka-ML: connecting the data stream with ML/AI frameworks (now TensorFlow and PyTorch!)
https://github.com/ertis-research/kafka-ml

data-stream deep-learning docker gpu-acceleration iot kafka keras keras-tensorflow kubernetes machine-learning pytorch tensorflow

Last synced: 5 months ago
JSON representation

Kafka-ML: connecting the data stream with ML/AI frameworks (now TensorFlow and PyTorch!)

Host: GitHub
URL: https://github.com/ertis-research/kafka-ml
Owner: ertis-research
License: mit
Created: 2020-04-20T10:36:57.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2024-09-27T07:33:33.000Z (almost 2 years ago)
Last Synced: 2025-02-24T02:31:10.285Z (over 1 year ago)
Topics: data-stream, deep-learning, docker, gpu-acceleration, iot, kafka, keras, keras-tensorflow, kubernetes, machine-learning, pytorch, tensorflow
Language: Python
Homepage:
Size: 5.44 MB
Stars: 181
Watchers: 9
Forks: 25
Open Issues: 26
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-kafka - Kafka-ML Framework - Open-source framework connecting Kafka with TensorFlow and PyTorch. (AI/ML Integration / ML Pipelines)

README

# Kafka-ML: connecting the data stream with ML/AI frameworks

Kafka-ML is a framework to manage the pipeline of Tensorflow/Keras and PyTorch
(Ignite) machine learning (ML) models on Kubernetes. The pipeline allows the
design, training, and inference of ML models. The training and inference
datasets for the ML models can be fed through Apache Kafka, thus they can be
directly connected to data streams like the ones provided by the IoT.

ML models can be easily defined in the Web UI with no need for external
libraries and executions, providing an accessible tool for both experts and
non-experts on ML/AI.

You can find more information about Kafka-ML and its architecture in the
open-access publication below:

> _C. Martín, P. Langendoerfer, P. Zarrin, M. Díaz and B. Rubio
>
> **Kafka-ML: connecting the data stream with ML/AI frameworks**
Future
> Generation Computer Systems, 2022, vol. 126, p. 15-33
>
> [10.1016/j.future.2021.07.037](https://www.sciencedirect.com/science/article/pii/S0167739X21002995)_

If you wish to reuse Kafka-ML, please properly cite the above mentioned paper.
Below you can find a BibTex reference:

```
@article{martin2022kafka,
title={Kafka-ML: connecting the data stream with ML/AI frameworks},
author={Mart{\'\i}n, Cristian and Langendoerfer, Peter and Zarrin, Pouya Soltani and D{\'\i}az, Manuel and Rubio, Bartolom{\'e}},
journal={Future Generation Computer Systems},
volume={126},
pages={15--33},
year={2022},
publisher={Elsevier}
}
```

Kafka-ML article has been selected as
[Spring 2022 Editor’s Choice Paper at Future Generation Computer Systems](https://www.sciencedirect.com/journal/future-generation-computer-systems/about/editors-choice)!
:blush: :book: :rocket:

## Table of Contents

- [Changelog](#changelog)
- [Deploy Kafka-ML in a fast way](#Deploy-Kafka-ML-in-a-fast-way)
- [Requirements](#Requirements)
- [Steps to run Kafka-ML](#Steps-to-run-Kafka-ML)
- [Troubleshooting](#Troubleshooting)
- [Usage](#usage)
- [Single models](#Single-models)
- [Distributed models](#Distributed-models)
- [Semi-supervised learning](#Semi-supervised-learning)
- [Incremental training](#Incremental-training)
- [Federated learning](#Federated-learning)
- [Installation and development](#Installation-and-development)
- [Requirements to build locally](#Requirements-to-build-locally)
- [Steps to build Kafka-ML](#Steps-to-build-Kafka-ML)
- [GPU configuration](#GPU-configuration)
- [Publications](#publications)
- [License](#license)

## Changelog

- [29/04/2021] Integration of distributed models.
- [05/11/2021] Automation of data types and reshapes for the training module.
- [20/01/2022] Added GPU support. ML Code has been taken out of backend.
- [04/03/2022] Added PyTorch ML Framework support!
- [08/04/2022] Added support for learning curves visualization, confusion matrix
generation and small changes on metrics visualization. Now datasets can be
splitted into training, validation and test.
- [26/05/2022] Included support for visualization of prediction data. Now you
can easily prototype and visualize your ML/AI application. You can train
models, deploy them for inference, and visualize your prediction data just
with data streams.
- [14/07/2022] Added incremental training support and configuration of training
parameters for the deployment of distributed models.
- [02/09/2022] Added real-time display of training parameters.
- [26/12/2022] Added indefinite incremental training support.
- [07/07/2023] Added federated training support (currently only for Tensorflow/Keras models).
- [28/09/2023] Federated learning enabled for distributed neural networks and incremental training.
- [05/07/2024] Added semi-supervised learning support.

## Deploy Kafka-ML in a fast way

### Requirements

- [Docker](https://www.docker.com/)
- [kubernetes>=v1.15.5](https://kubernetes.io/)

### Steps to run Kafka-ML

For a basic local installation, we recommend using Docker Desktop with
Kubernetes enabled. Please follow the installation guide on
[Docker's website](https://docs.docker.com/desktop/). To enable Kubernetes,
refer to
[Enable Kubernetes](https://docs.docker.com/desktop/kubernetes/#enable-kubernetes)

Once Kubernetes is running, open a terminal and run the following command:

```sh
# Uncomment only if you are running Kafka-ML on Apple Silicon
# export DOCKER_DEFAULT_PLATFORM=linux/amd64
kubectl apply -k "github.com/ertis-research/kafka-ml/kustomize/local?ref=v1.2"
```

This will install all the required components of Kafka-ML, plus Kafka on the
namespace `kafkaml`. The UI will be available at http://localhost/ . You can
continue with the [Usage](#usage) section to see how you can use Kafka-ML!

For a more advanced installation on Kubernetes, please refer to the
[kustomization guide](kustomize/README.md)

## Usage

To follow this tutorial, please deploy Kafka-ML as indicated in
[Deploy Kafka-ML in a fast way](#Deploy-Kafka-ML-in-a-fast-way) or
[Installation and development](#Installation-and-development).

### Single models

#### 1. Define a ML/AI model in Kafka-ML

Create a model in the Models tab with just a TF/Keras model source code and some
imports/functions if needed. Maybe this model for the MNIST dataset is a simple
way to start:

```py
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(
optimizer=tf.keras.optimizers.Adam(0.001),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)
```

Something similar should be done in case you wish to use PyTorch:

```py
class NeuralNetwork(nn.Module):
def __init__(self):
super(NeuralNetwork, self).__init__()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(28*28, 128),
nn.ReLU(),
nn.Linear(128, 10),
nn.Softmax()
)

def forward(self, x):
x = self.flatten(x)
logits = self.linear_relu_stack(x)
return logits

def loss_fn(self):
return nn.CrossEntropyLoss()

def optimizer(self):
return torch.optim.Adam(model.parameters(), lr=0.001)

def metrics(self):
val_metrics = {
"accuracy": Accuracy(),
"loss": Loss(self.loss_fn())
}
return val_metrics

model = NeuralNetwork()
```

Note that functions 'loss_fn', 'optimizer', and 'metrics' must necessarily be
defined.

Insert the ML code into the Kafka-ML UI.

#### 2. Define a configuration

A configuration is a set of models that can be grouped for training. This can be
useful when you want to evaluate and compare the metrics (e.g, loss and
accuracy) of a set of models or just to define a group of them that can be
trained with the same data stream in parallel. A configuration can also contain
a single ML model.

#### 3. Deploy a configuration of models for training

Change the batch size, training and validation parameters in the Deployment
form. Use the same format and parameters than TensorFlow methods _fit_ and
_evaluate_ respectively. Validation parameters are optional (they are only used
if _validation_rate>0 or test_rate>0_ in the stream data received).

Note: If you do not have the GPU(s) properly tuned, **set the "GPU Memory usage
estimation" parameter to 0**. Otherwise, the training component will be
deployed, but in a pending state waiting to allocate GPU memory. If the pod is
described, it will show a `aliyun.com/gpu-mem` related warning. If you wish, you
can mark the last field for the creation of the confusion matrix at the end of
the training.

Once the configuration is deployed, you will see one training result per model
in the configuration. Models are now ready to be trained and receive stream
data.

![](./images/training-results.png)

#### 4. Stream data for model training

Now, it is time to ingest the model(s) with your data stream for training and
maybe evaluation.

If you have used the MNIST model you can use the example
`mnist_dataset_training_example.py`. You only need to configure the
_deployment_id_ attribute to the one generated in Kafka-ML, maybe it is still 1.
This is the way to match data streams with configurations and models during
training. You may need to install the Python libraries listed in
datasources/requirements.txt.

If so, please execute the MNIST example for training:

```
python examples/MNIST_RAW_format/mnist_dataset_training_example.py
```

You can use your own example using the AvroSink (for Apache Avro types) and
RawSink (for simple types) sink libraries to send training and evaluation data
to Kafka. Remember, you always have to configure the _deployment_id_ attribute
to the one generated in Kafka-ML.

#### 5. Model metrics visualization

Once sent the data stream, and deployed and trained the models, you will see the
models metrics and results in Kafka-ML. You can download now the trained models,
or just continue the ML pipeline to deploy a model for inference.

![](./images/training-metrics.svg)

If you wish to visualise the generated confusion matrix (in case it has been
indicated) or to visualise some training and validation metrics (if any) per
epoch, you can access for each training result to the following view.

![](./images/plot-view.svg)

In addition, from this view you can access to this data in a more generic way in
JSON, allowing you to generate new plots and other information for your reports.

#### 6. Deploy a trained model for inference

When deploying a model for inference, the parameters for the input data stream
will be automatically configured based on previous data streams received, you
might also change this. Mostly you will have to configure the number of replicas
you want to deploy for inference and the Kafka topics for input data (values to
predict) and output data (predictions).

Note: If you do not have the GPU(s) properly tuned, **set the "GPU Memory usage
estimation" parameter to 0**. Otherwise, the inference component will be
deployed, but in a pending state waiting to allocate GPU memory. If the pod is
described, it will show a `aliyun.com/gpu-mem` related warning.

#### 7. Stream data for inference

Finally, test the inference deployed using the MNIST example for inference in
the topics deployed:

```
python examples/MNIST_RAW_format/mnist_dataset_inference_example.py
```

#### 8. Prediction (classification or regression) visualization

In the visualization tab, you can easily visualize your deployed models. First
thing, you need to configure how your model prediction data will be visualized.
Here is the example for the MNIST dataset:

```json
{
"average_updated": false,
"average_window": 10000,
"type": "classification",
"labels": [
{
"id": 0,
"color": "#fff100",
"label": "Zero"
},
{
"id": 1,
"color": "#ff8c00",
"label": "One"
},
{
"id": 2,
"color": "#e81123",
"label": "Two"
},
{
"id": 3,
"color": "#ec008c",
"label": "Three"
},
{
"id": 4,
"color": "#68217a",
"label": "Four"
},
{
"id": 5,
"color": "#00188f",
"label": "Five"
},
{
"id": 6,
"color": "#00bcf2",
"label": "Six"
},
{
"id": 7,
"color": "#00b294",
"label": "Seven"
},
{
"id": 8,
"color": "#009e49",
"label": "Eight"
},
{
"id": 9,
"color": "#bad80a",
"label": "Nine"
}
]
}
```

You can specify the two types of visualization: 'regression' and
'classification'. In classification mode, 'average_update' determines if you
want to have the current status displayed based on the higher average status,
and 'average_window' determines the windows for calculating the average.

For each output of your model, you have to define a label. 'id' represents the
position of the param in the model output (e.g., suppose you have a temperature
output as the second parameter of your model), and with 'color' and 'label' you
can set a color and label to display for the param.

Once you set the configuration, you must also set the output topic where the
model is deployed, 'mnist-out' in our last example. After this, visualization
displays your data.

Here is an example in classification mode:

And in regression mode:

### Distributed models

#### 1. Define a ML/AI distributed model in Kafka-ML

Create a distributed model with just a TF/Keras model source code and some
imports/functions if needed. Maybe this distributed model consisting of three
sub-models for the MNIST dataset is a simple way to start:

```py
edge_input = keras.Input(shape=(28,28,1), name='input_img')
x = tf.keras.layers.Conv2D(28, kernel_size=(3,3), name='conv2d')(edge_input)
x = tf.keras.layers.MaxPooling2D(pool_size=(2,2), name='maxpooling')(x)
x = tf.keras.layers.Flatten(name='flatten')(x)
output_to_fog = tf.keras.layers.Dense(64, activation=tf.nn.relu, name='output_to_fog')(x)
edge_output = tf.keras.layers.Dense(10, activation=tf.nn.softmax, name='edge_output')(output_to_fog)
edge_model = keras.Model(inputs=[edge_input], outputs=[output_to_fog, edge_output], name='edge_model')

fog_input = keras.Input(shape=64, name='fog_input')
output_to_cloud = tf.keras.layers.Dense(64, activation=tf.nn.relu, name='output_to_cloud')(fog_input)
fog_output = tf.keras.layers.Dense(10, activation=tf.nn.softmax, name='fog_output')(output_to_cloud)
fog_model = keras.Model(inputs=[fog_input], outputs=[output_to_cloud, fog_output], name='fog_model')

cloud_input = keras.Input(shape=64, name='cloud_input')
x = tf.keras.layers.Dense(64, activation=tf.nn.relu, name='relu1')(cloud_input)
x = tf.keras.layers.Dense(128, activation=tf.nn.relu, name='relu2')(x)
x = tf.keras.layers.Dropout(0.2)(x)
cloud_output = tf.keras.layers.Dense(10, activation=tf.nn.softmax, name='cloud_output')(x)
cloud_model = keras.Model(inputs=cloud_input, outputs=[cloud_output], name='cloud_model')
```

Insert the ML code of each sub-model into the Kafka-ML UI separately. You will
have to specify the hierarchical relationships between the sub-models through
the "Upper model" field of the form (before you will have to check the
distributed box). In the example case proposed it has to be defined the
following relationships: the upper model of the Edge sub-model is the Fog and
the upper model of the Fog sub-model is the Cloud (Cloud sub-model is placed at
the top of the distributed chain so it does not have any upper model).

#### 2. Define a configuration

Kafka-ML will only show those sub-models which are on the top of the distributed
chain. Choosing one of them will add its corresponding full distributed model to
the configuration.

#### 3. Deploy a configuration of distributed models for training

Deploy the configuration of distributed sub-models in Kubernetes for training.

Change the optimizer, learning rate, loss function, metrics, batch size,
training and validation parameters in the Deployment form. Use the same format
and parameters than TensorFlow methods _fit_ and _evaluate_ respectively.
Optimizer, learning rate, loss function and metrics parameters are optional, so
if not specified, default values are taken, which are: _adam_, _0.001_,
_sparse_categorical_crossentropy_ and _sparse_categorical_accuracy_,
respectively. Validation parameters are also optional (they are only used if
_validation_rate>0 or test_rate>0_ in the stream data received).

Once the configuration is deployed, you will see one training result per
sub-model in the configuration. Full distributed model is now ready to be
trained and receive stream data.

![](./images/distributed-training-results.png)

#### 4. Stream data for model training

Now, it is time to ingest the distributed model with your data stream for
training and maybe evaluation.

If you have used the MNIST distributed model you can use the example
`mnist_dataset_training_example.py`. You only need to configure the
_deployment_id_ attribute to the one generated in Kafka-ML, maybe it is still 1.
This is the way to match data streams with configurations and models during
training. You may need to install the Python libraries listed in
datasources/requirements.txt.

If so, please execute the MNIST example for training:

```
python examples/MNIST_RAW_format/mnist_dataset_training_example.py
```

#### 5. Model metrics visualization

Once sent the data stream, and deployed and trained the full distributed model,
you will see the sub-models metrics and results in Kafka-ML. You can download
now the trained sub-models, or just continue the ML pipeline to deploy a model
for inference.

![](./images/distributed-training-metrics.png)

#### 6. Deploy a trained sub-model for inference

When deploying a sub-model for inference, the parameters for the input data
stream will be automatically configured based on previous data streams received,
you might also change this. Mostly you will have to configure the number of
replicas you want to deploy for inference and the Kafka topics for input data
(values to predict) and output data (predictions). Lastly, in case you are
deploying a sub-model for inference which is not the last one in the distributed
chain, you will also have to specify one more topic for upper data (partial
predictions) and a limit number (between 0 and 1). These two fields work as
follows: on the one hand, if your deployed inference gets lower predictions
values than the limit it will send partial predictions to its upper model using
the upper data topic in order to continue the data processing there; on the
other hand, if your deployed inference gets higher predictions values than the
limit it will send these final results to the output topic.

#### 7. Stream data for inference

Finally, test the inference deployed using the MNIST example for inference in
the topics deployed:

```
python examples/MNIST_RAW_format/mnist_dataset_inference_example.py
```

### Semi-supervised learning

Semi-supervised learning is a type of machine learning that falls between supervised
and unsupervised learning. In supervised learning, the model is trained on a labeled
dataset, where each example is associated with a correct output or label. In unsupervised
learning, the model is trained on an unlabeled dataset, and it must learn to identify
patterns or structure in the data without any explicit guidance. Semi-supervised learning,
on the other hand, involves training a machine learning model on a dataset that contains
both labeled and unlabeled examples. The idea behind semi-supervised learning is to use
the small amount of labeled data to guide the learning process, while also leveraging
the much larger amount of unlabeled data to improve the model's performance.

Currently, the only framework that supports semi-supervised training is TensorFlow.
In this case, the usage example will be the same as the one presented for the
single models, only the configuration deployment form will change and will now
contain more fields.

As before, change the fields as desired. The new semi-supervised fields are:
unsupervised_rounds and confidence. Unsupervised rounds are used to define the number
of rounds to iterate through the so far unlabelled data. Confidence is used to specify
the minimum reliance that the model has to have in a prediction of an unlabelled data
in order to subsequently assign that label to it. They are not required, so if not specified,
default values are taken, which are: _5_ and _0.9_, respectively.

Once the configuration is deployed, you will see one training result per model
in the configuration. Models are now ready to be trained and receive stream
data. Now, it is time to ingest the model(s) with your data stream for training.

If you have used the MNIST model you can use the example
`mnist_dataset_unsupervised_training_example.py`. You may need to install the Python
libraries listed in datasources/requirements.txt.

If so, please execute the incremental MNIST example for training:

```
python examples/MNIST_RAW_format/mnist_dataset_unsupervised_training_example.py
```

### Incremental training

Incremental training is a machine learning method in which input data is
continuously used to extend the existing model's knowledge i.e. to further train
the model. It represents a dynamic learning technique that can be applied when
training data becomes available gradually over time or its size is out of system
memory limits.

Currently, the only framework that supports incremental training is TensorFlow.
In this case, the usage example will be the same as the one presented for the
single models, only the configuration deployment form will change and will now
contain more fields.

As before, change the fields as desired. For this case, there are two types of
deployments: time-limited and indefinite. The new time-limited incremental field
is: stream timeout. The stream timeout parameter is used to configure the duration
for which the dataset will block for new messages before timing out. It is not
required, so if not specified, default value is taken, which is: _60000_.

The new indefinite incremental fields are: monitoring metric, direction, and improvement.
The monitoring metric is used to keep track of a specific parameter (of the user's choice)
within the validation phase of the model's training. The direction is used to let Kafka-ML
know in which direction the monitoring metric is improving (as it is configurable).
Finally, the improvement serves to establish a range from which an automatic deployment of
the model for inference should be carried out, since this training is indefinite in time.
Monitoring metric and direction must be specified. Improvement is not required, so if not
specified, default value is taken, which is: _0.05_.

If you have used the MNIST model you can use the example
`mnist_dataset_online_training_example.py`. You may need to install the Python
libraries listed in datasources/requirements.txt.

If so, please execute the incremental MNIST example for training:

```
python examples/MNIST_RAW_format/mnist_dataset_online_training_example.py
```

### Federated learning

Federated learning is a privacy-preserving machine learning approach that enables
collaborative model training across multiple decentralized devices without the
need to transfer sensitive data to a central location. In federated learning,
instead of sending raw data to a central server for training, local devices
perform the training on their own data and only share model updates or gradients
with the central server. These updates are then aggregated to create an improved
global model, which is sent back to the devices for further training.
This distributed learning approach allows for the benefits of collective intelligence
while ensuring data privacy and reducing the need for large-scale data transfers.
Federated learning has gained popularity in scenarios where data is sensitive or
resides in diverse locations, such as mobile devices, healthcare systems, and IoT networks.

Currently, the only framework that supports federated learning is TensorFlow.
In this case, the usage example will be the same as the one presented for the
single models, only the configuration deployment form will change and will now
contain more fields.

As before, change the fields as desired. The new incremental fields are: aggregation_rounds,
minimun_data, data_restriction and aggregation strategy. The aggregation_rounds parameter
is used to configure the number of rounds that the model will be aggregated (an aggregation
round is a round in which the model is trained with the data of the devices and then aggregated
with the other models). The minimun_data parameter is the minimum number of data that a device
must have to be able to participate in the training. The data_restriction parameter is the
data pattern (such as input shape, labels, etc.) that the data must have to be able to participate.
Finally, the aggregation strategy parameter is the strategy that will be used to aggregate the
models. Currently, the only strategy available is the average strategy, which consists of averaging
the weights of the models.

Once the configuration is deployed, you will see one training result per model
in the configuration. Models are now ready to be aggregated and they are sent
to the devices for training. Now, if the devices have data that meets the
requirements, they will train the model and send the weights to the server for
aggregation. Once the aggregation is finished, the new model will be sent to
the devices for training again. This process will be repeated until the number
of rounds specified in the configuration is reached.

If you have used the MNIST model you can use the example
`mnist_dataset_federated_training_example.py`. You only need to configure the
_deployment_id_ attribute to the one generated in Kafka-ML, maybe it is still 1.
This is the way to match data streams with configurations and models during
training. You may need to install the Python libraries listed in
datasources/requirements.txt.

If so, please execute the incremental MNIST example for training:

```
python examples/FEDERATED_MNIST_RAW_format/mnist_dataset_federated_training_example.py
```

## Installation and development

### Requirements to build locally

- [Python supported by Tensorflow 3.5–3.7 and PyTorch 1.10](https://www.python.org/)
- [Node.js](https://nodejs.org/)
- [Docker](https://www.docker.com/)
- [kubernetes>=v1.15.5](https://kubernetes.io/)

### Steps to build Kafka-ML

In this repository you can find files to build Kafka-ML in case you want to
contribute.

In case you want to build Kafka-ML in a fast way, you should set the variable
`LOCAL_BUILD` to `true` in build scripts and modify the deployments files to use
the local images. Once that is done, you can run the build scripts.

By default, Kafka-ML will be built using CPU-only images. If you desire to build
Kafka-ML with images enabled for GPU acceleration, the `Dockerfile` and
`requirements.txt` files of `mlcode_executor`, `model_inference` and
`model_training` modules must be modified as indicated in those files.

In case you want to build Kafka-ML step-by-step, then follow the following
steps:

1. You may need to deploy a local register to upload your Docker images. You can
deploy it in the port 5000:

```bash
docker run -d -p 5000:5000 --restart=always --name registry registry:2
```

2. Build the backend and push the image into the local register:

```bash
cd backend
docker build --tag localhost:5000/backend .
docker push localhost:5000/backend
```

3. Build ML Code Executors and push images into the local register:

3.1. Build the TensorFlow Code Executor and push the image into the local
register:

```bash
cd mlcode_executor/tfexecutor
docker build --tag localhost:5000/tfexecutor .
docker push localhost:5000/tfexecutor
```

3.2. Build the PyTorch Code Executor and push the image into the local
register:

```bash
cd mlcode_executor/pthexecutor
docker build --tag localhost:5000/pthexecutor .
docker push localhost:5000/pthexecutor
```

4. Build the model_training components and push the images into the local
register:

```bash
cd model_training/tensorflow
docker build --tag localhost:5000/tensorflow_model_training .
docker push localhost:5000/tensorflow_model_training

cd ../pytorch
docker build --tag localhost:5000/pytorch_model_training .
docker push localhost:5000/pytorch_model_training
```

5. Build the kafka_control_logger component and push the image into the local
register:

```bash
cd kafka_control_logger
docker build --tag localhost:5000/kafka_control_logger .
docker push localhost:5000/kafka_control_logger
```

6. Build the model_inference component and push the image into the local
register:

```bash
cd model_inference/tensorflow
docker build --tag localhost:5000/tensorflow_model_inference .
docker push localhost:5000/tensorflow_model_inference

cd ../pytorch
docker build --tag localhost:5000/pytorch_model_inference .
docker push localhost:5000/pytorch_model_inference
```

7. Install the libraries and execute the frontend:
```bash
cd frontend
npm install # nvm install 10 & nvm use 10.24.1
npm i -g @angular/cli@9.1.15
ng build -c production
docker build --tag localhost:5000/frontend .
docker push localhost:5000/frontend
```

### Deploying Kafka-ML in a single node Kubernetes cluster (e.g., minikube, Docker desktop)

Once built the images, you can deploy the system components in Kubernetes
following this order:

kubectl apply -f zookeeper-pod.yaml
kubectl apply -f zookeeper-service.yaml

kubectl apply -f kafka-pod.yaml
kubectl apply -f kafka-service.yaml

kubectl apply -f backend-deployment.yaml
kubectl apply -f backend-service.yaml

kubectl apply -f frontend-deployment.yaml
kubectl apply -f frontend-service.yaml

kubectl apply -f tf-executor-deployment.yaml
kubectl apply -f tf-executor-service.yaml

kubectl apply -f pth-executor-deployment.yaml
kubectl apply -f pth-executor-service.yaml

kubectl apply -f kafka-control-logger-deployment.yaml

Finally, you will be able to access the Kafka-ML Web UI: http://localhost/

### Deploying Kafka-ML in a distributed Kubernetes cluster

#### Configuring the back-end

The first thing to keep in mind is that the images we compiled earlier were
intended for a single node cluster (localhost) and will not be able to be
downloaded from a distributed Kubernetes cluster. Therefore, assuming that we
are going to upload them into a registry as before and on a node with IP
x.x.x.x.x, we would have to do the same for all the images as for the following
backend example:

```bash
cd backend
docker build --tag x.x.x.x:5000/backend .
docker push x.x.x.x:5000/backend
```

Now, we have to update the location of these images (tr) in the
`backend-deployment.yaml` file:

```yaml
containers:
- - image: localhost:5000/backend
+ - image: x.x.x.x:5000/backend

- name: BOOTSTRAP_SERVERS
value: kafka-cluster:9092 # You can specify all the Kafka Bootstrap Servers that you have. e.g.: kafka-cluster-2:9092,kafka-cluster-3:9092,kafka-cluster-4:9092,kafka-cluster-5:9092,kafka-cluster-6:9092,kafka-cluster-7:9092

- name: TRAINING_MODEL_IMAGE
- value: localhost:5000/model_training
+ value: x.x.x.x:5000/model_training
- name: INFERENCE_MODEL_IMAGE
- value: localhost:5000/model_inference
+ value: x.x.x.x:5000/model_inference
- name: FRONTEND_URL
- value: http://localhost
+ value: http://x.x.x.x
```

The same should be done at `frontend-deployment.yaml` file:

```yaml
containers:
- - image: localhost:5000/backend
+ - image: x.x.x.x:5000/backend

- name: BACKEND_URL
- value: http://localhost:8000
+ value: http://x.x.x.x:8000
```

To be able to deploy components in a Kubernetes cluster, we need to create a
service account, give access to that account and generate a token:

```bash
$ sudo kubectl create serviceaccount k8sadmin -n kube-system

$ sudo kubectl create clusterrolebinding k8sadmin --clusterrole=cluster-admin --serviceaccount=kube-system:k8sadmin

$ sudo kubectl -n kube-system describe secret $(sudo kubectl -n kube-system get secret | (grep k8sadmin || echo "$_") | awk '{print $1}') | grep token: | awk '{print $2}'
```

With the obtained token in the last step, we have to change the **KUBE_TOKEN**
env var to include it, and the **KUBE_HOST** var to include the URL of the
Kubernetes master (e.g., https://IP_MASTER:6443) in the
`backend-deployment.yaml` file:

```
- name: KUBE_TOKEN
value: # include token here (and remove #)
- name: KUBE_HOST
value: # include kubernetes master URL here
```

Finally, to allow access to the back-end from outside Kubernetes, we can do this
by assigning a node cluster IP available to the back-end service in Kubernetes.
For example, given the IP y.y.y.y.y of a node in the cluster, we could include
it in the `backend-service.yaml` file:

```
type: LoadBalancer
+ externalIPs:
+ - y.y.y.y.y.y
```

Add this IP also to the **ALLOWED_HOSTS** env var in the
`backend-deployment.yaml` file:

```
- name: ALLOWED_HOSTS
value: y.y.y.y, localhost
```

### GPU configuration

The following steps are required in order to use GPU acceleration in Kafka-ML
and Kubernetes. These steps are required to be performed in all the Kubernetes
nodes.

1. GPU Driver installation

```bash
# SSH into the worker machine with GPU
$ ssh USERNAME@EXTERNAL_IP

# Verify ubuntu driver
$ sudo apt install ubuntu-drivers-common
$ ubuntu-drivers devices

# Install the recommended driver
$ sudo ubuntu-drivers autoinstall

# Reboot the machine
$ sudo reboot

# After the reboot, test if the driver is installed correctly
$ nvidia-smi
```

2. Nvidia Docker installation

```bash
# SSH into the worker machine with GPU
$ ssh USERNAME@EXTERNAL_IP

# Add the package repositories
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

$ sudo apt-get update && sudo apt-get install -y nvidia-docker2
$ sudo systemctl restart docker
```

3. Modify the following file

```bash
# SSH into the worker machine with GPU
$ ssh USERNAME@EXTERNAL_IP
$ sudo tee /etc/docker/daemon.json <

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ertis-research/kafka-ml

Awesome Lists containing this project

README