{"id":25560658,"url":"https://github.com/loubnabnl/sign-segmentation-with-transformers","last_synced_at":"2025-10-20T05:28:07.699Z","repository":{"id":41880972,"uuid":"452880413","full_name":"loubnabnl/Sign-Segmentation-with-Transformers","owner":"loubnabnl","description":"Detection of temporal boundaries in sign language videos, as part of the Object Recognition \u0026 Computer Vision course in the MVA master program.","archived":false,"fork":false,"pushed_at":"2022-11-10T18:57:53.000Z","size":10054,"stargazers_count":8,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-12T05:10:26.351Z","etag":null,"topics":["cnn","computer-vision","i3d-inception-architecture","sign-language-segmentation","temporal-cnn","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/loubnabnl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-01-27T23:34:33.000Z","updated_at":"2024-01-13T16:45:30.000Z","dependencies_parsed_at":"2022-07-09T15:01:33.050Z","dependency_job_id":null,"html_url":"https://github.com/loubnabnl/Sign-Segmentation-with-Transformers","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/loubnabnl/Sign-Segmentation-with-Transformers","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/loubnabnl%2FSign-Segmentation-with-Transformers","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/loubnabnl%2FSign-Segmentation-with-Transformers/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/loubnabnl%2FSign-Segmentation-with-Transformers/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/loubnabnl%2FSign-Segmentation-with-Transformers/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/loubnabnl","download_url":"https://codeload.github.com/loubnabnl/Sign-Segmentation-with-Transformers/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/loubnabnl%2FSign-Segmentation-with-Transformers/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":280018310,"owners_count":26259362,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-20T02:00:06.978Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cnn","computer-vision","i3d-inception-architecture","sign-language-segmentation","temporal-cnn","transformers"],"created_at":"2025-02-20T17:35:33.933Z","updated_at":"2025-10-20T05:28:07.649Z","avatar_url":"https://github.com/loubnabnl.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Detecting Temporal Boundaries in Sign Language videos (MVA RecVis project)\n\n## Motivations \n\nSign language automatic indexing is an important challenge to develop better communication tools for the deaf community. However, annotated datasets for sign langage are limited, and there are few people with skills to anotate such data, which makes it hard to train performant machine learning models. An important challenge is therefore to : \n\n*  Increase available training datasets. \n*  Make labeling easier for professionnals to reduce risks of bad annotations. \n\nIn this context, techniques have emerged to perform automatic sign segmentation in videos, by marking the boundaries between individual signs in sign language videos. The developpment of such tools offers the potential to alleviate the limited supply of labelled dataset currently available for sign research. \n\n![demo](demo/results/demo.gif)\n\n\n## Previous work and personal contribution \n\nThis repository provides code for the Object Recognition \u0026 Computer Vision (RecVis) course Final project. For more details please refer the the project report `report.pdf`.\nIn this project, we first reproduced the results obtained on the following paper (by using the code from this [ repository](https://github.com/RenzKa/sign-segmentation)) :  \n\n- [Katrin Renz](https://www.katrinrenz.de), [Nicolaj C. Stache](https://www.hs-heilbronn.de/nicolaj.stache), [Samuel Albanie](https://www.robots.ox.ac.uk/~albanie/) and [Gül Varol](https://www.robots.ox.ac.uk/~gul),\n*Sign language segmentation with temporal convolutional networks*, ICASSP 2021.  [[arXiv]](https://arxiv.org/abs/2011.12986)\n\nWe used the pre-extracted frame-level features obtained by applying the I3D model on videos to retrain the MS-TCN architecture for frame-level binary classification and reproduce the papers results. The `tests` folder proposes a notebook for reproducing the original paper results, with a meanF1B = 68.68 on the evaluation set of the BSL Corpus. \n\nWe further implemented new models in order to improve this result. We wanted to try attention based models as they have received recently a huge gain of interest in the vision research community. We first tried to train a Vanilla Transformer Encoder from scratch, but the results were not satisfactory. \n\n- [*Attention Is All You Need*](https://arxiv.org/abs/1706.03762), Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin:  (2018). \n\nWe then implemented the ASFormer model (Transformer for Action Segementation), using this [code](https://github.com/ChinaYi/ASFormer) : a hybrid transformer model using some interesting ideas from the MS-TCN architecture. The motivations behind the model and its architecture are detailed in the following paper : \n\n- [ASFormer: Transformer for Action Segmentation](https://arxiv.org/abs/2110.08568), Fangqiu Yi, Hongyu Wen, Tingting Jiang (2021).\n\n\nWe trained this model on the I3D extracted features and obtained an improvement over the MS-TCN architecture. The results are given in the following table : \n\n|ID | Model | mF1B | mF1S | \n|   -   |   -  |   -  |   -   | \n| 1 | MS-TCN | 68.68\u003csub\u003e±0.6\u003c/sub\u003e |47.71\u003csub\u003e±0.8\u003c/sub\u003e |\n| 2 | Transformer Encoder | 60.28\u003csub\u003e±0.3\u003c/sub\u003e |42.70\u003csub\u003e±0.2\u003c/sub\u003e |\n| 3 | ASFormer | **69.79\u003csub\u003e±0.2\u003c/sub\u003e** |**49.23\u003csub\u003e±1.2\u003c/sub\u003e**|\n\n## Contents\n* [Setup](#setup)\n* [Data and models](#data-and-models)\n* [Demo](#demo)\n* [Training](#training)\n* [Citation](#citation)\n* [License](#license)\n* [Acknowledgements](#acknowledgements)\n\n## Setup\n\n``` bash\n# Clone this repository\ngit clone https://github.com/loubnabnl/Sign-Segmentation-with-Transformers.git\ncd Sign-Segmentation-with-Transformers/\n# Create signseg_env environment\nconda env create -f environment.yml\nconda activate signseg_env\n```\n\n\n## Data and models\nYou can download the pretrained models (I3D and MS-TCN) (`models.zip [302MB]`) and data (`data.zip [5.5GB]`) used in the experiments [here](https://drive.google.com/drive/folders/17DaatdfD4GRnLJJ0RX5TcSfHGMxMS0Lm?usp=sharing) or by executing `download/download_*.sh`. The unzipped `data/` and `models/` folders should be located on the root directory of the repository (for using the demo downloading the `models` folder is sufficient).\n\nYou can download our best pretrained ASFormer model weights [here](https://drive.google.com/file/d/1WZ3PR05BMbj54SAmK-TsZ1YmfuEH6DxT/view?usp=sharing).\n\n\n### Data\nPlease cite the original datasets when using the data: [BSL Corpus](https://bslcorpusproject.org/cava/acknowledgements-and-citation/) \nThe authors of [github.com/RenzKa/sign-segmentation](https://github.com/RenzKa/sign-segmentation) provided the pre-extracted features and metadata. See [here](data/README.md) for a detailed description of the data files. \n- Features: `data/features/*/*/features.mat`\n- Metadata: `data/info/*/info.pkl`\n\n### Models\n- I3D weights, trained for sign classification: `models/i3d/*.pth.tar`\n- MS-TCN weights for the demo (see tables below for links to the other models): `models/ms-tcn/*.model`\n- As_former weights of our best model : `models/asformer/*.model`\n\nThe folder structure should be as below:\n```\nsign-segmentation/models/\n  i3d/\n    i3d_kinetics_bslcp.pth.tar\n  ms-tcn/\n    mstcn_bslcp_i3d_bslcp.model\n  asformer/\n    best_asformer_bslcp.model\n```\n## Demo\nThe demo folder contains a sample script to estimate the segments of a given sign language video, one can run `demo.py`to get a visualization on a sample video.\n\n```\ncd demo\npython demo.py\n```\n\nThe demo will: \n1. use the `models/i3d/i3d_kinetics_bslcp.pth.tar` pretrained I3D model to extract features,\n2. use the `models/asformer/best_asformer_model.model` pretrained ASFormer model to predict the segments out of the features.\n3. save results.\n\n## Training\nTo train I3D please refer to [github.com/RenzKa/sign-segmentation](https://github.com/RenzKa/sign-segmentation). To train ASFormer on the pre-extracted I3D features run `main.py`, you can change hyperparameters in the arguments inside the file. Or you can run the notebook in the folder `test_asformer`.\n\n## Citation\nIf you use this code and data, please cite the original papers following:\n\n```\n@inproceedings{Renz2021signsegmentation_a,\n    author       = \"Katrin Renz and Nicolaj C. Stache and Samuel Albanie and G{\\\"u}l Varol\",\n    title        = \"Sign Language Segmentation with Temporal Convolutional Networks\",\n    booktitle    = \"ICASSP\",\n    year         = \"2021\",\n}\n```\n```\n@article{yi2021asformer,\n  title={Asformer: Transformer for action segmentation},\n  author={Yi, Fangqiu and Wen, Hongyu and Jiang, Tingting},\n  journal={arXiv preprint arXiv:2110.08568},\n  year={2021}\n}\n```\n## License\nThe license in this repository only covers the code. For data.zip and models.zip we refer to the terms of conditions of original datasets.\n\n\n## Acknowledgements\nThe code builds on the [github.com/RenzKa/sign-segmentation](https://github.com/RenzKa/sign-segmentation) and [github.com/ChinaYi/ASFormer](https://github.com/ChinaYi/ASFormer) repositories. \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Floubnabnl%2Fsign-segmentation-with-transformers","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Floubnabnl%2Fsign-segmentation-with-transformers","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Floubnabnl%2Fsign-segmentation-with-transformers/lists"}