{"id":18439510,"url":"https://github.com/idiap/atco2-corpus","last_synced_at":"2025-04-07T21:32:35.346Z","repository":{"id":62982138,"uuid":"564199144","full_name":"idiap/atco2-corpus","owner":"idiap","description":"A Corpus for Research on Robust Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications","archived":false,"fork":false,"pushed_at":"2023-03-24T07:35:04.000Z","size":5134,"stargazers_count":57,"open_issues_count":2,"forks_count":5,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-03-23T01:02:27.724Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/idiap.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-11-10T07:51:21.000Z","updated_at":"2025-03-22T23:06:15.000Z","dependencies_parsed_at":"2024-11-06T06:27:50.631Z","dependency_job_id":"cf2633e4-5fc6-4617-a817-47d7747d1821","html_url":"https://github.com/idiap/atco2-corpus","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idiap%2Fatco2-corpus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idiap%2Fatco2-corpus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idiap%2Fatco2-corpus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idiap%2Fatco2-corpus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/idiap","download_url":"https://codeload.github.com/idiap/atco2-corpus/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247732700,"owners_count":20986907,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T06:25:12.875Z","updated_at":"2025-04-07T21:32:30.329Z","avatar_url":"https://github.com/idiap.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications \n\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://github.com/idiap/atco2-corpus/blob/master/LICENSE\"\u003e\n        \u003cimg alt=\"GitHub\" src=\"https://img.shields.io/badge/License-MIT-green.svg\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/idiap/atco2-corpus\"\u003e\n        \u003cimg alt=\"GitHub\" src=\"https://img.shields.io/badge/GitHub-Open%20source-green\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/psf/black\"\u003e\n        \u003cimg alt=\"Black\" src=\"https://img.shields.io/badge/code%20style-black-000000.svg\"\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://huggingface.co/Jzuluaga/wav2vec2-large-960h-lv60-self-en-atc-uwb-atcc-and-atcosim\"\u003e\n        \u003cimg alt=\"GitHub\" src=\"https://img.shields.io/badge/%F0%9F%A4%97-ASR%20on%20Hub-yellow\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://huggingface.co/Jzuluaga/bert-base-ner-atc-en-atco2-1h\"\u003e\n        \u003cimg alt=\"GitHub\" src=\"https://img.shields.io/badge/%F0%9F%A4%97-NER%20on%20Hub-yellow\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://huggingface.co/Jzuluaga/bert-base-token-classification-for-atc-en-uwb-atcc\"\u003e\n        \u003cimg alt=\"GitHub\" src=\"https://img.shields.io/badge/%F0%9F%A4%97-Speaker_role%20on%20Hub-yellow\"\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\nCode for the paper [ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications](https://arxiv.org/abs/2211.04054).\n\n\u003cdetails\u003e\n  \u003csummary markdown=\"span\"\u003e\u003cb\u003ePersonal assistants, automatic speech recognizers and dialogue understanding systems are becoming more critical in our interconnected digital world\u003c/b\u003e. A clear example is air traffic control (ATC) communications....\u003c/summary\u003e  \n  \n    ATC aims at guiding aircraft and controlling the \n    airspace in a safe and optimal manner. These voice-based dialogues \n    are carried between an air traffic controller (ATCO) and pilots via \n    very-high frequency radio channels. In order to incorporate these \n    novel technologies into ATC (low-resource domain), large-scale \n    annotated datasets are required to develop the data-driven AI \n    systems. Two examples are automatic speech recognition (ASR) and \n    natural language understanding (NLU). In this paper, we introduce the \n    ATCO2 corpus, a dataset that aims at fostering research on the \n    challenging ATC field, which has lagged behind due to lack of \n    annotated data. The ATCO2 corpus covers 1) data collection and pre-\n    processing, 2) pseudo-annotations of speech data, and 3) extraction \n    of ATC-related named entities. The ATCO2 corpus is split into three \n    subsets. 1) ATCO2-test-set corpus contains 4 hours of ATC speech with \n    manual transcripts and a subset with gold annotations for named-\n    entity recognition (callsign, command, value). 2) The ATCO2-PL-set \n    corpus consists of 5281 hours of unlabeled ATC data enriched with \n    automatic transcripts from an in-domain speech recognizer, contextual \n    information, speaker turn information, signal-to-noise ratio estimate \n    and English language detection score per sample. Both available for \n    purchase through ELDA at this http URL. 3) The ATCO2-test-set-1h \n    corpus is a one-hour subset from the original test set corpus, that \n    we are offering for free at this https URL. We expect the ATCO2 \n    corpus will foster research on robust ASR and NLU not only in the \n    field of ATC communications but also in the general research \n    community.\n\u003c/details\u003e\n\n\u003cp align=\"center\"\u003e\n \u003cfigure\u003e\n  \u003cimg src=\"data/imgs/atco2_corpus_info.png\" alt=\"Our system\" style=\"width:90%\"\u003e\n  \u003cfigcaption\u003e ATCO2 corpus ecosystem. Blue circles denote annotations only available for ATCO2 test set corpus. Green circles denote annotations and metadata available for both ATCO2 test set and ATCO2 pseudo-labeled corpus sets. \u003c/figcaption\u003e\n\u003c/figure\u003e \n\u003c/p\u003e\n\n\n**Repository written by**: [Juan Pablo Zuluaga](https://juanpzuluaga.github.io/).\n\n---\n## Table of Contents\n- [Preparing Environment](#preparing-environment)\n- [Usage](#usage)\n    - [Download the Data](#download-the-data)\n    - [Training one model](#training-one-model)\n    - [Train baselines](#train-baselines)\n    - [Train your LM with KenLM (optional)](#train-your-lm-with-kenlm-optional)\n    - [Evaluate models (optional)](#evaluate-models-optional)\n- [Related work](#related-work)\n- [Cite us](#how-to-cite-us)\n\n# Preparing Environment\n\nThe first step is to create your environment with the required packages for data preparation, formatting, and to carry out the experiments. You can run the following commands to create the conda environment (assuming CUDA - 11.7):\n\n- Step 1: Using `python 3.10`: install python and the requirements\n\n```bash\ngit clone https://github.com/idiap/w2v2-air-traffic\nconda create -n atco2_corpus python==3.10\nconda activate atco2_corpus\npython -m pip install -r requirements.txt\n```\n\nBefore running any script, make sure you have `en_US` locale set and `PYTHONPATH` in repository root folder.\n\n```bash\nexport LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8\nexport PYTHONPATH=$PYTHONPATH:$(pwd) #assuming you are in root repository folder\n```\n\n# Usage\n\nThere are several steps to replicate/use our proposed models:\n\n## Out-of-the box model on HuggingFace\n\n\n# What can you do with ATCO2 corpus? \n\n### Automatic Speech Recognition\n\n- This system allows to optain the text level information of what was said in the ATC communication. It is normally used later in the next systems below \n\n\n### Speaker Role Identification\n\n\n- With this module, you can detect who is talking in the given communication\n### Named-Entity Recognition\n\n- Here, you aim at understanding what was said in the communicaiton. With ATCO2 corpus you can train a system that can detect callsigns, commands and values in the communication.\n\n\n\n\n# Related work\n\nHere is a list of papers that are somehow related to AI/ML targeted to Air traffic control communications:\n\n- Fine-tuning a pretrained BERT model on the named entity recognition task to perform text-based diarization for ATC communications: \n    - paper: [BERTraffic: BERT-based Joint Speaker Role and Speaker Change Detection for Air Traffic Control Communications](https://arxiv.org/abs/2110.05781)\n    - code: https://github.com/idiap/bert-text-diarization-atc\n- Fine-tuning a pretrained [Wav2vec 2.0](https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec) model for automatic speech recognition: \n    - paper: [How Does Pre-trained Wav2Vec 2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications](https://arxiv.org/abs/2203.16822)\n    - code: https://github.com/idiap/w2v2-air-traffic\n\n- How to use contextual data (biasing) in ATC automatic speech recognition:\n    - Paper: [A two-step approach to leverage contextual data: speech recognition in air-traffic communications](https://arxiv.org/abs/2202.03725)\n- Ethics in collection of ATC audio data: [Legal and Ethical Challenges in Recording Air Traffic Control Speech](https://aclanthology.org/2022.legal-1.14/)\n\n\nSome other papers:\n\n- [Boosting of contextual information in ASR for air-traffic call-sign recognition](http://www.fit.vutbr.cz/research/groups/speech/publi/2021/kocour21_interspeech.pdf)\n- [Grammar Based Identification Of Speaker Role For Improving ATCO And Pilot ASR](https://arxiv.org/abs/2108.12175)\n- [Contextual Semi-Supervised Learning: An Approach To Leverage Air-Surveillance and Untranscribed ATC Data in ASR Systems](https://arxiv.org/abs/2104.03643)\n- [Automatic Processing Pipeline for Collecting and Annotating Air-Traffic Voice Communication Data](https://www.mdpi.com/2673-4591/13/1/8)\n- [Automatic Call Sign Detection: Matching Air Surveillance Data with Air Traffic Spoken Communications](https://www.mdpi.com/2504-3900/59/1/14)\n- [Improving callsign recognition with air-surveillance data in air-traffic communication](https://arxiv.org/abs/2108.12156)\n- [Automatic Speech Recognition Benchmark for Air-Traffic Communications](https://arxiv.org/abs/2006.10304)\n\n\n---\n# How to cite us\n\nIf you use this code for your research, please cite our papers with the following bibtex items:\n\n```\n# article 1 - MAIN\n@article{zuluaga2022atco2,\n  title={ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications},\n  author={Zuluaga-Gomez, Juan and Vesel{\\'y}, Karel and Sz{\\\"o}ke, Igor and Motlicek, Petr and others},\n  journal={arXiv preprint arXiv:2211.04054},\n  year={2022}\n}\n\n# article 2 - Mainly on ASR\n@inproceedings{zuluaga2023does,\n  title={How does pre-trained Wav2Vec 2.0 perform on domain-shifted ASR? An extensive benchmark on air traffic control communications},\n  author={Zuluaga-Gomez, Juan and Prasad, Amrutha and Nigmatulina, Iuliia and Sarfjoo, Seyyed Saeed and Motlicek, Petr and Kleinert, Matthias and Helmke, Hartmut and Ohneiser, Oliver and Zhan, Qingran},\n  booktitle={2022 IEEE Spoken Language Technology Workshop (SLT)},\n  pages={205--212},\n  year={2023},\n  organization={IEEE}\n}\n\n# article 3 - Mainly on sequence classification and BERT  \n@inproceedings{zuluaga2023bertraffic,\n  title={Bertraffic: Bert-based joint speaker role and speaker change detection for air traffic control communications},\n  author={Zuluaga-Gomez, Juan and Sarfjoo, Seyyed Saeed and Prasad, Amrutha and Nigmatulina, Iuliia and Motlicek, Petr and Ondrej, Karel and Ohneiser, Oliver and Helmke, Hartmut},\n  booktitle={2022 IEEE Spoken Language Technology Workshop (SLT)},\n  pages={633--640},\n  year={2023},\n  organization={IEEE}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fidiap%2Fatco2-corpus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fidiap%2Fatco2-corpus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fidiap%2Fatco2-corpus/lists"}