Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/daleonpz/stwin_ai_vowel_recognition

Recognition of air-handwritten vowels in the air using the IMU of the STWIN module.
https://github.com/daleonpz/stwin_ai_vowel_recognition
air-writing imu machine-learning tinyml vowels
Last synced: 5 days ago
JSON representation
Recognition of air-handwritten vowels in the air using the IMU of the STWIN module.
Host: GitHub
URL: https://github.com/daleonpz/stwin_ai_vowel_recognition
Owner: daleonpz
License: mit
Created: 2023-01-05T13:37:58.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2023-05-13T07:32:08.000Z (over 1 year ago)
Last Synced: 2023-05-13T08:23:15.959Z (over 1 year ago)
Topics: air-writing, imu, machine-learning, tinyml, vowels
Language: C
Homepage:
Size: 26 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        # Clone repo

```

git clone [email protected]:daleonpz/stwin_AI_vowel_recognition.git

dvc remote add origin https://dagshub.com/daleonpz/stwin_AI_vowel_recognition.dvc

dvc pull -r origin

```

# Introduction

The objective of this project was to train a neural network for **vowel recognition** based on **Inertial Movement Unit (IMU)** measurements and deploy it on the dev board [STEVAL-STWINKT1](https://www.st.com/en/evaluation-tools/steval-stwinkt1.html) from STM32.

The neural network is based on two layers of CNN + BN + ReLU, a Global Average Pooling layers and a Fully Connected + Softmax layer. The size of the neural network is **13KB**, and has **95% accuracy on the testset**. 

Workflow overview:

![Project Workflow](/docs/_images/workflow.png)

## Used tools

- Pytorch 1.10.2+cu102

- ONNIX-tensorflow 1.10.0

- STM32 CUBE-AI 7.1.0

- Sensor ISM330DHCX

- 18.04.1-Ubuntu

# Problem definition 

Vowel recognition is a classification problem, so I defined the following setup for the neural network.

* Number of classes: 5 (A, E, I, O, U)

* Data:

    * 200 samples/class = 1000 samples

    * Split 80/10/10

* Input: a 20x20x6 matrix based on IMU data.

* Output: a 5x1 probability vector.

* Architecture: Convolutional Neural Network + Fully Connected + Softmax. 

* Loss function: Cross Entropy, which is typical in multiclass classification. 

#  Data collection and preparation

I collected **200 samples for each class** and each sample corresponds to **2sec** of data from the **ISM330DHCX sensor** at a **sampling rate of 200Hz**. I decided to collect data from the accelerometer and gyroscope, because it has been proven that using multiple sensors improves gesture characterization. [1](https://www.mdpi.com/1424-8220/21/17/5713), [2](https://www.mdpi.com/2076-3417/10/18/6507), [3](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6603535/).

![Class pattern](/docs/_images/vowels_pattern.png)

That means that each data sample have the following form. $s^T = [ a_x, a_y, a_z, g_x, g_y, g_z]$, where $a_i$ and $g_i$ are the accelerometer and gyroscope measurements along i-axis. 

```math

\mathbf{F} = 

\begin{bmatrix}

 s^T_1 \\

 \vdots \\ 

 s^T_{200}  

\end{bmatrix} = \begin{bmatrix} \mathbf{a_x} & \mathbf{a_y} & \mathbf{a_z} &\mathbf{g_x} & \mathbf{g_y} & \mathbf{g_z}  \end{bmatrix}

= \begin{bmatrix} \mathbf{f_1} & \mathbf{f_2} & \mathbf{f_3} &\mathbf{f_4} & \mathbf{f_5} & \mathbf{f_6}  \end{bmatrix}

``` 

such that 

```math

\mathbf{a_x} = \begin{bmatrix} a^{(1)}_x \\ \vdots \\  a^{(200)}_x  \end{bmatrix},

\mathbf{g_x} = \begin{bmatrix} g^{(1)}_x \\ \vdots \\  g^{(200)}_x  \end{bmatrix}

```

Before inputting data into the model, I conducted the normalization operation. Since the IMU signals differ in value and range, thus I normalize each signal $\mathbf{f_i}$ between (1, 0) with the following function.

```math

\mathbf{f_n(i)} = \frac{\mathbf{f_i} - a_{min}}{  a_{max} - a_{min} }, i = 1,2,3

```

```math

\mathbf{f_n(i)} = \frac{\mathbf{f_i} - g_{min}}{  g_{max} - g_{min} }, i = 4,5,6

```

where

```math

a_{min} = \min_{ a_i \in \{ \mathbf{a_x}, \mathbf{a_y},\mathbf{a_z} \} } a_i^{(j)} , j = 1,\cdots , 200 \\

```

```math

a_{max} = \max_{ a_i \in \{ \mathbf{a_x}, \mathbf{a_y},\mathbf{a_z} \} } a_i^{(j)} , j = 1,\cdots , 200 \\

```

```math

g_{min} = \min_{ a_i \in \{ \mathbf{a_x}, \mathbf{a_y},\mathbf{a_z} \} } a_i^{(j)} , j = 1,\cdots , 200 \\ 

```

```math

g_{max} = \max_{ a_i \in \{ \mathbf{a_x}, \mathbf{a_y},\mathbf{a_z} \} } a_i^{(j)} , j = 1,\cdots , 200

```

Thus the normalized feature matrix $\mathbf{F_n}$ is defined as follows: 

```math

\mathbf{F_n} = \begin{bmatrix} \mathbf{f_n(1)} & \mathbf{f_n(2)} & \mathbf{f_n(3)} &\mathbf{f_n(4)} & \mathbf{f_n(5)} & \mathbf{f_n(6)}  \end{bmatrix}

``` 

[4](https://arxiv.org/vc/arxiv/papers/1803/1803.09052v1.pdf) showed that encoding feature vectors or matrices as images could take advantage of the performance of CNN on images. Thus the shape of input image $\mathbf{I_f}$ is 20x20x6. 

![Input image](/docs/_images/input.png)

For the data collection I wrote some python scripts and C code for the microcontroller. All the relevant code for this task can be found [here](/datacollector).  The main script is [collect.sh](/datacollector/pythonscripts/collect.sh), where I defined the labels for each class. 

# Neural Network Architecture

I tried two architectures based on Convolution neural and fully connected networks. The papers I used as a reference are the following:

- Jianjie, Lu & Raymond, Tong. (2018). Encoding Accelerometer Signals as Images for Activity Recognition Using Residual Neural Network. 

- Jiang, Yujian & Song, Lin & Zhang, Junming & Song, Yang & Yan, Ming. (2022). Multi-Category Gesture Recognition Modeling Based on sEMG and IMU Signals. Sensors. 22. 5855. 10.3390/s22155855. 

## First architecture

The first architecture was the following. 

![first architecture](/docs/_images/CNN1.png)

* Number of parameters: 

   * Conv2D + Batch + ReLU: $24*6*3*3 + 24 + 24 + 24 + 0 = 1368$

   * Conv2D + Batch + ReLU: $48*24*3*3 + 48 + 48 + 48 + 0 = 10512$

   * FC + ReLU:  $512*4800 + 512 + 0 = 2458112$

   * FC + ReLU:  $512*32 + 32 + 0 = 16416$

   * FC:    $5*32 + 5 = 165$

   * Total number of parameters: 2486573 

* Number of elements per parameter: 4 (float)

* Size of the model:  $2486573 * 4 = 9713.754 KB = 9.48 MB$ 

As you can see, the model size is too big 9.48 MB for the STWINK devboard, which has ARM Cortex-M4 MCU with 2048 kbytes of flash. But I still trained the model to see if I was on a right direction.

Here are the results:

| Model        | n_samples | num_epochs | learning_rate | criterion    | optimizer | batch_size | train_acc | train_loss | val_acc | val_loss | test_acc | test_loss | size(float32) |

| ------------ | --------- | ---------- | ------------- | ------------ | --------- | ---------- | --------- | ---------- | ------- | -------- | -------- | --------- | ------------- |

| cnn (512,32) | 200       | 10         | 0.00001       | CrossEntropy | Adam      | 16         | 91.67     | 0.514      | 83.33   | 0.544    | 82.5     | 0.5727    | 9MB           |

| cnn (512,32) | 200       | 50         | 0.00001       | CrossEntropy | Adam      | 16         | 100       | 0.328      | 95      | 0.376    | 97.5     | 0.3458    | 9MB           |

| cnn (512,32) | 200       | 20         | 0.00001       | CrossEntropy | Adam      | 16         | 96.67     | 0.402      | 86.67   | 0.469    | 92.5     | 0.4352    | 9MB           |

As you can see, the accuracy in all datasets (train, validation and test) is above 90% which is a good indication. Although I plotted the curves "Accuracy vs epoch" and "Loss vs epoch", as well as the "confusion matrix" that validate performance of this model, I do not shown them because this model was not deployed.

## Model 

The model I ended up deploying is the following

![Architecture](/docs/_images/CNN2.png)

* Number of parameters: 

   * Conv2D + Batch + ReLU: $12*6*3*3 + 12 + 12 + 12 + 0  = 684$

   * Conv2D + Batch + ReLU: $24*12*3*3 + 24 + 24 + 24 + 0 = 2664$

   * Global Average Pooling:  $0$. With this layer I could reduce the number of parameters and code size

   * FC:    $5*24 + 5 = 125$

   * Total number of parameters: 3473 

* Number of elements per parameter: 4 (float)

* Size of the model:  $3473 * 4 = 13.863 KB$ 

This model is small enough to be deployed on the microcontroller, it uses only 0.68% of the available flash memory.

The results of training are the following:

![Results](/docs/_images/results.png)

The overall performance is above 90%. 

For the training I wrote python scripts and C files for the microcontroller. All the relevant code for this task can be found [here](/trainer).  Run `python3 train_model.py  --dataset_path ../data --num_epochs 200 --learning_rate 0.0005 --batch_size 128` to train the model. 

```sh

Ep 199/200: Accuracy : Train:98.25       Val:83.00 || Loss: Train 0.956          Val 1.101:  99%|█████████████████████████████████████████████████████▍| 

Ep 199/200: Accuracy : Train:98.25       Val:83.00 || Loss: Train 0.956          Val 1.101: 100%|██████████████████████████████████████████████████████| 

Ep 199/200: Accuracy : Train:98.25       Val:83.00 || Loss: Train 0.956          Val 1.101: 100%|██████████████████████████████████████████████████████| 

              precision    recall  f1-score   support                                

                                                                                                                             

           0      1.000     1.000     1.000        23                                              

           1      0.966     1.000     0.982        28       

           2      0.955     0.875     0.913        24                                

           3      0.800     0.857     0.828        14                                     

           4      0.818     0.818     0.818        11                      

                                                                                        

    accuracy                          0.930       100                                        

   macro avg      0.908     0.910     0.908       100

weighted avg      0.931     0.930     0.930       100

Test: acc 93.0    loss 0.9912329316139221

```

# Deployment on the microcontroller

The CUBE-AI tool from STMicroelectronics doesn't support pytorch models, so I had two options. Either train the whole model again in tensorflow and then export it to tensorflow lite, or convert the pytorch model to ONNX (Open Neural Network Exchange). I chose the latter.

To export from pytorch to ONNX is straighforward:

```python

from models.cnn_2    import CNN

# Load pytorch model

loadedmodel     = CNN(fc_num_output=5, fc_hidden_size=[8]).to(DEVICE) # my model

loadedmodel.load_state_dict(torch.load('results/model.pth'))

loadedmodel.eval()

# Fuse some modules. it may save on memory access, make the model run faster, and improve its accuracy.

# https://pytorch.org/tutorials/recipes/fuse.html

torch.quantization.fuse_modules(loadedmodel,

                                [['conv1', 'bn1','relu1'], 

                                 ['conv2', 'bn2','relu2']],

                                inplace=True)

# Convert to ONNX. 

# Explanation on why we need a dummy input

# https://github.com/onnx/tutorials/issues/158

dummy_input = torch.randn(1, 6, 20, 20) 

torch.onnx.export(loadedmodel,

                  dummy_input, 

                  'model.onnx', 

                  input_names=['input'], 

                  output_names=['output'])

```

Once the the model is exported in ONNX format, it's time to import it into cube ai 7.1.0. STM has a CLI tool `stm32ai` that imports the ONNX model and generates the corresponding C files. 

```sh

$ stm32ai generate -m /model.onnx --type onnx -o  --name 

```

I wrote a script for that ([link](/trainer/create_C_files_for_stm32.sh)).  The script generates the following files:

```sh

$  ll

drwxrwxr-x 2 me me   4096 13.01.2023 22:56 stm32ai_ws/

-rw-rw-r-- 1 me me  24356 13.01.2023 22:56 model.c

-rw-rw-r-- 1 me me   1500 13.01.2023 22:56 model_config.h

-rw-rw-r-- 1 me me  90718 13.01.2023 22:56 model_data.c

-rw-rw-r-- 1 me me   2624 13.01.2023 22:56 model_data.h

-rw-rw-r-- 1 me me  17926 13.01.2023 22:56 model_generate_report.txt

-rw-rw-r-- 1 me me   8766 13.01.2023 22:56 model.h

```

[Here](/docs/embedded_client_api.html) you can find the documentation in HTML of the API for the CUBE-AI framework.  You could also see the function definitions in my code in the following links:

- [Tensors definition](/deployment/Core/Src/main.c#L63)

- [API initialization](/deployment/Core/Src/main.c#L118)

- [Inference](/deployment/Core/Src/main.c#L159)

## C Implementation Details

For the deployment the following modules were required:

- Ring buffer of size 600x6, because the sampling rate of the sensor is 200Hz and each sample is 2 seconds that means I needed at least an array of (200x2x6) elements. [link](/deployment/Core/Src/ring_buffer.c)

- Normalization of the samples between (0,1) before feeding the model. [link](/deployment/Core/Src/main.c#L224)

- Inference module with a threshold to detect the vowels. I set the threshold at 0.8.  [link](/deployment/Core/Src/main.c#L247)

Since the model is small and the quantization model from Pytorch is not as good as from Tensorflow, I decide to use floats for my inference. In the future, I would build a quantized model using tensorflow and see how it performs. 

The C code can be found under [deployment](/deployment). The main is defined in the [main.c](/deployment/Core/Src/main.c) file. 

# Results

Here are the results of the inference on the microcontroller.  

  

```sh

Welcome to minicom 2.7.1

OPTIONS: I18n

Compiled on Aug 13 2017, 15:25:34.

Port /dev/ttyACM0, 19:45:52

Press CTRL-A Z for help on special keys

        (HAL 1.13.0_0)

        Compiled Jan 17 2023 19:34:40 (STM32CubeIDE)

        Send Every    5mS Acc/Gyro/Magneto

system clock ----> 120000000

Testing label: 0

model output:

0.966853

0.004895

0.027935

0.000016

0.000300

inference result: 0

gesture: A

Testing label: 1

model output:

0.000000

0.999976

0.000006

0.000018

0.000000

inference result: 1

gesture: E

Testing label: 2

model output:

0.001132

0.000371

0.990676

0.000235

0.007587

inference result: 2

gesture: I

Testing label: 3

model output:

0.000021

0.007765

0.000039

0.990865

0.001310

inference result: 3

gesture: O

Testing label: 4

model output:

0.000081

0.000070

0.000067

0.005766

0.994015

model result: 4

gesture: U

```

# TODO

I still having troubles with the timing between sensor reading and model inference. For that reason, I wanted to make sure that the inference on the microcontroller works, so I fed the model with some samples (`ai float` arrays) from the Testset and run the inference, which was successful.  

# Update

I fixed the timing problem and the model is working, but it is still difficult for the model to differentiate between A and I, as the patterns are similar.