Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/eric-canas/drums-app

Play drums in your browser with your webcam
https://github.com/eric-canas/drums-app

browser-game computer-vision deep-learning keras music-generation neural-network tensorflow-js

Last synced: 3 months ago
JSON representation

Play drums in your browser with your webcam

Awesome Lists containing this project

README

        

# Drums-app
Play Drums in your Browser.

Drums-app allows you to simulate in your browser any percussion instrument, by using only your Webcam. All machine learning models run locally, so no user information is sent to the server.

Check the demo at drums-app.com

### Quick Start

Simply run the src/index.html in server mode, or enter at drums-app.com.

Select **Set Template** for building your own drums template by uploading some images and attaching your sounds to them.

Set Template

Turn on your **webcam** and enjoy it!

Play!
*No cats were harmed during this recording

# Implementation Details

This web application is built with MediaPipe and TensorFlow.js.
The pipeline uses two Machine Learning models.


  • Hands Model: A Computer Vision model offered by MediaPipe for detecting 21 landmarks for each hand (x, y, z).

  • HitNet: An LSTM model that has been developed in Keras for this application and then converted to TensorFlow.js. It takes the last N positions of a hand and predicts the probability of this sequence to correspond with a Hit.

## HitNet Details

### Building the Dataset

The dataset used for training has been built in the following way:


  1. A representative landmark (Index Finger Dip [Y]) of each detected hand is plotted in an interactive chart, using Chart.js.

  2. Any time that a key is pressed, a grey mark is plotted on the same chart.

  3. I start to play drums with one hand while pressing a key on the keyboard (with the other hand) every time that I beat an imaginary drum. [Gif Left]

  4. I use the mouse for selecting in the chart those points that should be considered as a hit. [Gif Right]

  5. When click the "Save Dataset" button, all hand positions together with their correspondent tags (1 if the frame was considered a hit or 0 otherwise) are downloaded as a JSON file .


DatasetGeneration
DataTag

### Defining the Architecture

HitNet has been built in Python, using Keras, and then exported to TensorFlow.js. In order to not produce any dissonance between the hit on the drum and the produced sound **HitNet** must run as fast as possible, for this reason it implements an extremely simple architecture.

HitNet Architecture

It takes as input the 4 last detections of a hand [Flatten version of its 21 landmarks (x,y,z)] and outputs the probability of this sequence to correspond with a hit. It is only composed by an LSTM layer followed by a ReLU activation (using dropout with p = 0.25) and a Dense output layer with only 1 unit, followed by a sigmoid activation.

### Training the model

HitNet has been trained in Keras, using the following parameterization:


  • Epochs: 3000.

  • Optimizer: Adam.

  • Loss: Weighted Binary Cross Entropy*.

  • Training/Val Split: 0.85-0.15.

  • Data Augmentation:


    • Mirroring: X axis.

    • Shift: Shift applied in block for the whole sequence.


      • X Shift: ±0.3.

      • Y Shift: ±0.3.

      • Z Shift: ±0.5.


    • Interframe Noise: Small shift applied independently to each frame of the sequence.


      • Interframe Noise X: ±0.01.

      • Interframe Noise Y: ±0.01.

      • Interframe Noise Z: ±0.0025.


    • Intraframe Noise: Extremely small shift applied independently to each single part of a hand.


      • Intraframe Noise X: ±0.0025.

      • Intraframe Noise Y: ±0.0025.

      • Intraframe Noise Z: ±0.0001.






The weights exported to TensorFlow.js are not the ones of the last epoch, but the ones that maximized the Validation Loss at any intermediate epoch.

*Loss is weighted since the positive class is extremely underrepresented in the training set.

### Analyzing Results

Confusion matrices show that results are pretty high for both classes putting the confidence threshold at 0.5.

Train Confusion Matrix
Validation Confusion Matrix

Despite these False Positives and False Negatives could worsen the user experience in a network that is executed several times each second, it does not really affect the playtime in a real situation. It is due to three factors:


  1. Most False Positives come from the frames anterior or posterior to the hit. In practice, it is solved by emptying the sequence buffers every time that a hit is detected.

  2. The small amount of False Negatives detected in the train set comes from Data Augmentation or because it is detected on the previous or the following frame. In real cases, these displacements does not affect to the experience.

  3. The rest of False Positives does not use to appear in real cases since, during playtime, only the sequences including detections entering in the predefined drums are analyzed. In practice it works as double check for the positive cases.

Evolution of the Train/Validation Loss during training confirms that there has been no overfitting.

Loss