{"id":13416258,"url":"https://github.com/astorfi/lip-reading-deeplearning","last_synced_at":"2025-04-08T04:14:36.464Z","repository":{"id":41168892,"uuid":"94938983","full_name":"astorfi/lip-reading-deeplearning","owner":"astorfi","description":":unlock: Lip Reading - Cross Audio-Visual Recognition using 3D Architectures","archived":false,"fork":false,"pushed_at":"2022-11-07T13:58:59.000Z","size":113239,"stargazers_count":1861,"open_issues_count":7,"forks_count":327,"subscribers_count":55,"default_branch":"master","last_synced_at":"2025-04-01T03:34:34.218Z","etag":null,"topics":["3d-convolutional-network","computer-vision","deep-learning","speech-recognition","tensorflow"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/astorfi.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":"CONTRIBUTING.rst","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null},"funding":{"github":["astorfi"]}},"created_at":"2017-06-20T22:05:34.000Z","updated_at":"2025-03-24T19:38:10.000Z","dependencies_parsed_at":"2023-01-21T11:16:12.111Z","dependency_job_id":null,"html_url":"https://github.com/astorfi/lip-reading-deeplearning","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/astorfi%2Flip-reading-deeplearning","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/astorfi%2Flip-reading-deeplearning/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/astorfi%2Flip-reading-deeplearning/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/astorfi%2Flip-reading-deeplearning/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/astorfi","download_url":"https://codeload.github.com/astorfi/lip-reading-deeplearning/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247773719,"owners_count":20993639,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["3d-convolutional-network","computer-vision","deep-learning","speech-recognition","tensorflow"],"created_at":"2024-07-30T21:00:56.116Z","updated_at":"2025-04-08T04:14:36.436Z","avatar_url":"https://github.com/astorfi.png","language":"Python","readme":"===========================================================================================================================\nLip Reading - Cross Audio-Visual Recognition using 3D Convolutional Neural Networks - `Official Project Page`_\n===========================================================================================================================\n\n.. image:: https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat\n    :target: https://github.com/astorfi/3D-convolutional-Audio-Visual/pulls\n.. image:: https://badges.frapsoft.com/os/v2/open-source.svg?v=102\n    :target: https://github.com/ellerbrock/open-source-badge/\n.. image:: https://coveralls.io/repos/github/astorfi/3D-convolutional-Audio-Visual/badge.svg?branch=master\n    :target: https://coveralls.io/github/astorfi/3D-convolutional-Audio-Visual?branch=master\n.. image:: https://zenodo.org/badge/94938983.svg\n   :target: https://zenodo.org/badge/latestdoi/94938983\n.. image:: https://img.shields.io/twitter/follow/amirsinatorfi.svg?label=Follow\u0026style=social\n      :target: https://twitter.com/amirsinatorfi\n\nThis repository contains the code developed by TensorFlow_ for the following paper:\n\n\n| `3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition`_,\n| by: `Amirsina Torfi`_\n\n\n.. _3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition: http://ieeexplore.ieee.org/document/8063416/\n.. _TensorFlow: https://www.tensorflow.org/\n.. _Official Project Page: https://codeocean.com/2017/07/14/3d-convolutional-neural-networks-for-audio-visual-recognition/code\n.. _Amirsina Torfi: https://astorfi.github.io/\n.. _Seyed Mehdi Iranmanesh: http://community.wvu.edu/~seiranmanesh/\n.. _Nasser M. Nasrabadi: http://nassernasrabadi.wixsite.com/mysite\n\n\n.. |im1| image:: readme_images/1.gif\n\n\n.. |im2| image:: readme_images/2.gif\n\n\n.. |im3| image:: readme_images/3.gif\n\n\n|im1| |im2| |im3|\n\nThe input pipeline must be prepared by the users. This code is aimed to provide the implementation for **Coupled 3D Convolutional Neural Networks** for\naudio-visual matching. **Lip-reading** can be a specific application for this work.\n\nIf you used this code, please kindly consider citing the following paper:\n\n.. code:: shell\n\n  @article{torfi20173d,\n    title={3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition},\n    author={Torfi, Amirsina and Iranmanesh, Seyed Mehdi and Nasrabadi, Nasser and Dawson, Jeremy},\n    journal={IEEE Access},\n    year={2017},\n    publisher={IEEE}\n    }\n\n#################\nTable of Contents\n#################\n.. contents::\n  :local:\n  :depth: 3\n\n\n-----\nDEMO\n-----\n\n~~~~~~~~~~~~~~~~~~~~~~~~\nTraining/Evaluation DEMO\n~~~~~~~~~~~~~~~~~~~~~~~~\n\n|training|\n\n.. |training| image:: readme_images/liptrackingdemo.png\n    :target: https://asciinema.org/a/kXIDzZt1UzRioL1gDPzOy9VkZ\n\n~~~~~~~~~~~~~~~~~\nLip Tracking DEMO\n~~~~~~~~~~~~~~~~~\n\n|liptrackingdemo|\n\n.. |liptrackingdemo| image:: readme_images/liptrackingdemo.png\n    :target: https://asciinema.org/a/RiZtscEJscrjLUIhZKkoG3GVm\n.. https://asciinema.org/a/m1r1OaoUXsEECNZKzpkfAXg7y\n\n--------------\nGeneral View\n--------------\n\n*Audio-visual recognition* (AVR) has been considered as\na solution for speech recognition tasks when the audio is\ncorrupted, as well as a visual recognition method used\nfor speaker verification in multi-speaker scenarios. The approach of AVR systems is to leverage the extracted\ninformation from one modality to improve the recognition ability of\nthe other modality by complementing the missing information.\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nThe Problem and the Approach\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nThe essential problem is to find the correspondence between the audio and visual streams, which is the goal\nof this work. **We proposed the utilization of a coupled 3D Convolutional Neural Network (CNN) architecture that can map\nboth modalities into a representation space to evaluate the correspondence of audio-visual streams using the learned\nmultimodal features**.\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nHow to leverage 3D Convolutional Neural Networks?\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nThe proposed architecture will incorporate both spatial and temporal information jointly to\neffectively find the correlation between temporal information\nfor different modalities. By using a relatively small network architecture and much\nsmaller dataset, our proposed\nmethod surpasses the performance of the existing similar\nmethods for audio-visual matching which use CNNs for\nfeature representation. We also demonstrate that effective\npair selection method can significantly increase the performance.\n\n\n--------------------\nCode Implementation\n--------------------\n\nThe input pipeline must be provided by the user. The rest of the implementation consider the dataset\nwhich contains the utterance-based extracted features.\n\n~~~~~~~~~~~~~\nLip Tracking\n~~~~~~~~~~~~~\n\nFor lip tracking, the desired video must be fed as the input. At first, cd to the\ncorresponding directory:\n\n.. code:: shell\n\n    cd code/lip_tracking\n\nThe run the dedicated ``python file`` as below:\n\n.. code:: shell\n\n    python VisualizeLip.py --input input_video_file_name.ext --output output_video_file_name.ext\n\nRunning the aforementioned script extracts the lip motions by saving the mouth\narea of each frame and create the output video with a rectangular around the\nmouth area for better visualization.\n\nThe required ``arguments`` are defined by the following python script which\nhave been defined in the ``VisualizeLip.py`` file:\n\n.. code:: python\n\n  ap = argparse.ArgumentParser()\n  ap.add_argument(\"-i\", \"--input\", required=True,\n               help=\"path to input video file\")\n  ap.add_argument(\"-o\", \"--output\", required=True,\n               help=\"path to output video file\")\n  ap.add_argument(\"-f\", \"--fps\", type=int, default=30,\n               help=\"FPS of output video\")\n  ap.add_argument(\"-c\", \"--codec\", type=str, default=\"MJPG\",\n               help=\"codec of output video\")\n  args = vars(ap.parse_args())\n\nSome of the defined arguments have their default values and no further action is\nrequired by them.\n\n\n\n~~~~~~~~~~~\nProcessing\n~~~~~~~~~~~\n\nIn the visual section, the videos are post-processed to have an equal frame rate of 30 f/s. Then, face tracking and mouth area extraction are performed on the videos using the\ndlib library [dlib]_. Finally, all mouth areas are resized to have the same size and concatenated to form the input feature\ncube. The dataset does not contain any audio files. The audio files are extracted from\nvideos using FFmpeg framework [ffmpeg]_. The processing pipeline is the below figure.\n\n.. image:: readme_images/processing.gif\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nInput Pipeline for this work\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. .. image:: https://github.com/astorfi/3D-convolutional-speaker-recognition/blob/master/_images/Speech_GIF.gif\n..     :target: https://github.com/astorfi/3D-convolutional-speaker-recognition/blob/master/_images/Speech_GIF.gif\n\nThe proposed architecture utilizes two non-identical ConvNets which uses a pair of speech and video\nstreams. The network input is a pair of features that represent lip movement and\nspeech features extracted from 0.3 second of a video clip. The main task is to determine if a\nstream of audio corresponds with a lip motion clip within the desired stream duration. In the two next sub-sections,\nwe are going to explain the inputs for speech and visual streams.\n\n\n**Speech Net**\n\n\nOn the time axis, the temporal features are non-overlapping\n20ms windows which are used for the generation of spectrum features\nthat possess a local characteristic.\nThe input speech feature map, which is represented as an image cube,\ncorresponds to the spectrogram\nas well as the first and second order derivatives of the\nMFEC features. These three channels correspond to the image depth. Collectively from a 0.3 second\nclip, 15 temporal feature sets (each\nforms 40 MFEC features) can be derived which form a\nspeech feature cube. Each input feature map for a single audio stream has the dimensionality of 15 × 40 × 3.\nThis representation is depicted in the following figure:\n\n.. image:: readme_images/Speech_GIF.gif\n\nThe **speech features** have been extracted using [SpeechPy]_ package.\n\n**Please refer to** ``code/speech_input/input_feature.py`` **for having an idea about how the input pipeline works.**\n\n**Visual Net**\n\nThe frame rate of each video clip used in this effort is 30 f/s.\nConsequently, 9 successive image frames form the 0.3 second visual stream.\nThe input of the visual stream of the network is a cube of size 9x60x100,\nwhere 9 is the number of frames that represent the temporal information. Each\nchannel is a 60x100 gray-scale image of mouth region.\n\n.. image:: readme_images/lip_motion.jpg\n\n\n\n~~~~~~~~~~~~\nArchitecture\n~~~~~~~~~~~~\n\nThe architecture is a **coupled 3D convolutional neural network** in which *two\ndifferent networks with different sets of weights must be trained*.\nFor the visual network, the lip motions spatial information alongside the temporal information are\nincorporated jointly and will be fused for exploiting the temporal\ncorrelation. For the audio network, the extracted energy features are\nconsidered as a spatial dimension, and the stacked audio frames form the\ntemporal dimension. In the proposed 3D CNN architecture, the convolutional operations\nare performed on successive temporal frames for both audio-visual streams.\n\n.. image:: readme_images/DNN-Coupled.png\n\n\n----------------------\nTraining / Evaluation\n----------------------\n\nAt first, clone the repository. Then, cd to the dedicated directory:\n\n.. code:: shell\n\n    cd code/training_evaluation\n\nFinally, the ``train.py`` file must be executed:\n\n.. code:: shell\n\n    python train.py\n\nFor evaluation phase, a similar script must be executed:\n\n.. code:: shell\n\n    python test.py\n\n\n--------\nResults\n--------\n\nThe below results demonstrate effects of the proposed method on the accuracy\nand the speed of convergence.\n\n.. |accuracy| image:: readme_images/accuracy-bar-pairselection.png\n\n\n.. |converge| image:: readme_images/convergence-speed.png\n\n\n|accuracy|\n\nThe best results, which is the right-most one, belongs to our proposed method.\n\n|converge|\n\nThe effect of proposed **Online Pair Selection** method has been shown in the figure.\n\n-------------\nDisclaimer\n-------------\n\nThe current version of the code does not contain the adaptive pair selection method proposed by `3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition`_ paper. Just a simple pair selection with hard thresholding is included at the moment.\n\n\n\n-------------\nContribution\n-------------\n\nWe are looking forward to your kind feedback. Please help us to improve the code and make\nour work better. For contribution, please create the pull request and we will investigate it promptly.\nOnce again, we appreciate your feedback and code inspections.\n\n\n.. rubric:: references\n\n.. [SpeechPy] @misc{amirsina_torfi_2017_810392,\n                    author       = {Amirsina Torfi},\n                    title        = {astorfi/speech_feature_extraction: SpeechPy},\n                    month        = jun,\n                    year         = 2017,\n                    doi          = {10.5281/zenodo.810392},\n                    url          = {https://doi.org/10.5281/zenodo.810391}}\n\n.. [dlib] D. E. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10:1755–1758, 2009.\n.. [ffmpeg] F. Developers. FFmpeg tool (version be1d324) [software], 2016.\n","funding_links":["https://github.com/sponsors/astorfi"],"categories":["Python","Models/Projects","📦 Legacy \u0026 Inactive Projects","人工智能 \u003ca name=\"AI\"\u003e\u003c/a\u003e","模型项目"],"sub_categories":["微信群"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fastorfi%2Flip-reading-deeplearning","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fastorfi%2Flip-reading-deeplearning","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fastorfi%2Flip-reading-deeplearning/lists"}