{"id":13715576,"url":"https://github.com/filippogiruzzi/voice_activity_detection","last_synced_at":"2025-05-07T04:30:54.157Z","repository":{"id":37616648,"uuid":"227330159","full_name":"filippogiruzzi/voice_activity_detection","owner":"filippogiruzzi","description":"Voice Activity Detection based on Deep Learning \u0026 TensorFlow","archived":false,"fork":false,"pushed_at":"2023-03-24T22:24:40.000Z","size":244,"stargazers_count":355,"open_issues_count":11,"forks_count":69,"subscribers_count":12,"default_branch":"master","last_synced_at":"2024-11-14T03:34:31.781Z","etag":null,"topics":["artificial-intelligence","deep-learning","deep-neural-networks","deeplearning","librispeech","librispeech-dataset","machine-learning","mfcc-features","python","resnet","speech","speech-detection","speech-recognition","tensorflow","time-series","time-series-classification","vad","voice-activity-detection"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/filippogiruzzi.png","metadata":{"files":{"readme":"ReadMe.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-12-11T09:46:37.000Z","updated_at":"2024-11-10T07:30:24.000Z","dependencies_parsed_at":"2024-01-14T22:03:30.085Z","dependency_job_id":null,"html_url":"https://github.com/filippogiruzzi/voice_activity_detection","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/filippogiruzzi%2Fvoice_activity_detection","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/filippogiruzzi%2Fvoice_activity_detection/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/filippogiruzzi%2Fvoice_activity_detection/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/filippogiruzzi%2Fvoice_activity_detection/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/filippogiruzzi","download_url":"https://codeload.github.com/filippogiruzzi/voice_activity_detection/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252813693,"owners_count":21808372,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","deep-learning","deep-neural-networks","deeplearning","librispeech","librispeech-dataset","machine-learning","mfcc-features","python","resnet","speech","speech-detection","speech-recognition","tensorflow","time-series","time-series-classification","vad","voice-activity-detection"],"created_at":"2024-08-03T00:01:00.706Z","updated_at":"2025-05-07T04:30:53.713Z","avatar_url":"https://github.com/filippogiruzzi.png","language":"Python","funding_links":[],"categories":["Uncategorized"],"sub_categories":["Uncategorized"],"readme":"# Voice Activity Detection project\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://github.com/filippogiruzzi/voice_activity_detection/actions/workflows/ci.yml\" alt=\"CI\"\u003e\n        \u003cimg src=\"https://github.com/filippogiruzzi/voice_activity_detection/actions/workflows/ci.yml/badge.svg\" /\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/filippogiruzzi/voice_activity_detection/actions/workflows/cd.yml\" alt=\"CD\"\u003e\n        \u003cimg src=\"https://github.com/filippogiruzzi/voice_activity_detection/actions/workflows/cd.yml/badge.svg\" /\u003e\u003c/a\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://github.com/filippogiruzzi/voice_activity_detection\"\u003e\u003cimg src=\"https://img.shields.io/github/stars/filippogiruzzi/voice_activity_detection?logo=github\" alt=\"GitHub stars\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/filippogiruzzi/voice_activity_detection\"\u003e\u003cimg src=\"https://img.shields.io/github/forks/filippogiruzzi/voice_activity_detection?logo=github\" alt=\"GitHub forks\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://hub.docker.com/repository/docker/filippogrz/tf-vad\"\u003e\u003cimg src=\"https://img.shields.io/docker/pulls/filippogrz/tf-vad?logo=docker\" alt=\"Docker Pulls\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003ccenter\u003eKeywords: Python, TensorFlow, Deep Learning, Time Series classification\u003c/center\u003e\n\n\n## Table of contents\n\n1. [ Installation ](#1-installation)  \n    1.1 [ Basic installation ](#11-basic-installation)  \n    1.2 [ Virtual environment installation ](#12-virtual-environment-installation)  \n    1.3 [ Docker installation ](#13-docker-installation)\n2. [ Introduction ](#2-introduction)  \n    2.1 [ Goal ](#21-goal)  \n    2.2 [ Results ](#22-results)  \n3. [ Project structure ](#3-project-structure)\n4. [ Dataset ](#4-dataset)\n5. [ Project usage ](#5-project-usage)  \n    5.1 [ Dataset automatic labeling ](#51-dataset-automatic-labeling)  \n    5.2 [ Record raw data to .tfrecord format ](#52-record-raw-data-to-tfrecord-format)  \n    5.3 [ Train a CNN to classify Speech \u0026 Noise signals ](#53-train-a-cnn-to-classify-speech--noise-signals)  \n    5.4 [ Export trained model \u0026 run inference on Test set ](#54-export-trained-model--run-inference-on-test-set)  \n6. [ Todo ](#6-todo)\n7. [ Resources ](#7-resources)\n\n## 1. Installation\n\nThis project was designed for:\n* Ubuntu 20.04\n* Python 3.7.3\n* TensorFlow 1.15.4\n\n```bash\n$ cd /path/to/project/\n$ git clone https://github.com/filippogiruzzi/voice_activity_detection.git\n$ cd voice_activity_detection/\n```\n\n### 1.1 Basic installation\n\n:warning: It is recommended to use virtual environments !\n\n```bash\n$ pyenv install 3.7.3\n$ pyenv virtualenv 3.7.3 vad-venv\n$ pyenv activate vad-venv\n```\n\n```bash\n$ pip install -r requirements.txt\n$ pip install -e .\n```\n\n## 1.2 Virtual environment installation\n\n## 1.3 Docker installation\n\nYou can pull the latest image from DockerHub and run Python commands inside the container:\n```bash\n$ docker pull filippogrz/tf-vad:latest\n$ docker run --rm --gpus all -v /var/run/docker.sock:/var/run/docker.sock -it --entrypoint /bin/bash -e TF_FORCE_GPU_ALLOW_GROWTH=true filippogrz/tf-vad\n```\n\nIf you want to build the docker image and run the container from scratch, run the following commands.\n\nBuild the docker image:\n```bash\n$ make build\n```\n(This might take a while.)\n\nRun the docker image:\n```bash\n$ make local-nobuild\n```\n\n## 2. Introduction\n\n### 2.1 Goal\n\nThe purpose of this project is to design and implement \na real-time Voice Activity Detection algorithm based on Deep Learning.\n\nThe designed solution is based on MFCC feature extraction and \na 1D-Resnet model that classifies whether a audio signal is \nspeech or noise.\n\n### 2.2 Results\n\n| Model | Train acc. | Val acc. | Test acc. |\n| :---: |:---:| :---:| :---: |\n| 1D-Resnet | 99 % | 98 % | 97 % |\n\nRaw and post-processed inference results on a test audio signal are shown below.\n\n![alt text](pics/inference_raw.png \"Raw VAD inference\")\n![alt text](pics/inference_smooth.png \"VAD inference with post-processing\")\n\n## 3. Project structure\n\nThe project `voice_activity_detection/` has the following structure:\n* `vad/data_processing/`: raw data labeling, processing, \nrecording \u0026 visualization\n* `vad/training/`: data, input pipeline, model \n\u0026 training / evaluation / prediction\n* `vad/inference/`: exporting trained model \u0026 inference\n\n## 4. Dataset\n\nPlease download the LibriSpeech ASR corpus dataset from https://openslr.org/12/, \nand extract all files to : `/path/to/LibriSpeech/`.\n\nThe dataset contains approximately 1000 hours of 16kHz read English speech \nfrom audiobooks, and is well suited for Voice Activity Detection.\n\nI automatically annotated the `test-clean` set of the dataset with a \npretrained VAD model.\n\nPlease feel free to use the `labels/` folder and the pre-trained VAD model (only for inference) from this \n[ link ](https://drive.google.com/open?id=1ZPQ6wnMhHeE7XP5dqpAEmBAryFzESlin).\n\n## 5. Project usage\n\n```bash\n$ cd /path/to/project/voice_activity_detection/vad/\n```\n\n### 5.1 Dataset automatic labeling\n\nSkip this subsection if you already have the `labels/` folder, that contains annotations \nfrom a different pre-trained model.\n\n```bash\n$ python data_processing/librispeech_label_data.py --data-dir /path/to/LibriSpeech/test-clean/ --exported-model /path/to/pretrained/model/\n```\n\nThis will record the annotations into `/path/to/LibriSpeech/labels/` as \n`.json` files.\n\n### 5.2 Record raw data to .tfrecord format\n\n```bash\n$ python data_processing/data_to_tfrecords.py --data-dir /path/to/LibriSpeech/\n```\n\nThis will record the splitted data to `.tfrecord` format in `/path/to/LibriSpeech/tfrecords/`\n\n### 5.3 Train a CNN to classify Speech \u0026 Noise signals\n\n```bash\n$ python training/train.py --data-dir /path/to/LibriSpeech/tfrecords/\n```\n\n### 5.4 Export trained model \u0026 run inference on Test set\n\n```bash\n$ python inference/export_model.py --model-dir /path/to/trained/model/dir/\n$ python inference/inference.py --data-dir /path/to/LibriSpeech/ --exported-model /path/to/exported/model/ --smoothing\n```\n\nThe trained model will be recorded in `/path/to/LibriSpeech/tfrecords/models/resnet1d/`. \nThe exported model will be recorded inside this directory.\n\n## 6. Todo\n\n- [ ] Compare Deep Learning model to a simple baseline\n- [ ] Train on full dataset\n- [ ] Improve data balancing\n- [ ] Add time series data augmentation\n- [ ] Study ROC curve \u0026 classification threshold\n- [ ] Add online inference\n- [ ] Evaluate quantitatively post-processing methods on the Test set\n- [ ] Add model description \u0026 training graphs\n- [ ] Add Google Colab demo\n\n## 7. Resources\n\n* _Voice Activity Detection for Voice User Interface_, \n[Medium](https://medium.com/linagoralabs/voice-activity-detection-for-voice-user-interface-2d4bb5600ee3)\n* _Deep learning for time series classifcation: a review_,\nFawaz et al., 2018, [Arxiv](https://arxiv.org/abs/1809.04356)\n* _Time Series Classification from Scratch \nwith Deep Neural Networks: A Strong Baseline_, Wang et al., 2016,\n[Arxiv](https://arxiv.org/abs/1611.06455)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffilippogiruzzi%2Fvoice_activity_detection","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffilippogiruzzi%2Fvoice_activity_detection","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffilippogiruzzi%2Fvoice_activity_detection/lists"}