{"id":18720415,"url":"https://github.com/qcri/arabic_speech_code_switching","last_synced_at":"2026-01-25T21:37:06.652Z","repository":{"id":54347495,"uuid":"165069913","full_name":"qcri/Arabic_speech_code_switching","owner":"qcri","description":"The first Dialectal Arabic Code Switching - DACS corpus from broadcast speech. Annotated at the token-level, considering both the linguistic and the acoustic cues. This dataset is a potential benchmark for DCS in spontaneous speech.","archived":false,"fork":false,"pushed_at":"2022-04-03T12:53:12.000Z","size":274064,"stargazers_count":14,"open_issues_count":1,"forks_count":1,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-05-19T14:52:26.000Z","etag":null,"topics":["acoustic","arabic","asr","codeswitching","dialect-identification","egyptian","evaluation","lexical","mordern-standard-arabic"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/qcri.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-01-10T14:02:32.000Z","updated_at":"2024-08-26T09:19:04.000Z","dependencies_parsed_at":"2022-08-13T12:50:36.695Z","dependency_job_id":null,"html_url":"https://github.com/qcri/Arabic_speech_code_switching","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/qcri/Arabic_speech_code_switching","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qcri%2FArabic_speech_code_switching","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qcri%2FArabic_speech_code_switching/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qcri%2FArabic_speech_code_switching/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qcri%2FArabic_speech_code_switching/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/qcri","download_url":"https://codeload.github.com/qcri/Arabic_speech_code_switching/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qcri%2FArabic_speech_code_switching/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28759417,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-25T20:56:06.009Z","status":"ssl_error","status_checked_at":"2026-01-25T20:54:48.203Z","response_time":113,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["acoustic","arabic","asr","codeswitching","dialect-identification","egyptian","evaluation","lexical","mordern-standard-arabic"],"created_at":"2024-11-07T13:30:56.878Z","updated_at":"2026-01-25T21:37:06.634Z","avatar_url":"https://github.com/qcri.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!-- # Dialectal-Arabic-Code-Switching-Dataset--\u003e\n\n# Dialectal Arabic Code-Switching Dataset.\n\nThis release includes the annotated two-hours Egyptian dataset from the ADI-5 development split in the MGB-3 challenge [1].\nThe released MGB-3 data includes speech features and textual features extracted from ASR transcription.\n\nUnlike MGB-3:EGY, this dataset is *manually segmented* to the audio into smaller utterances (with 500 msec silence or more) and *transcribed* the speech verbatim by a lay native Egyptian speaker.\n\nThe transcribed data is then annotated for word-level Code-Switching (CS) information by 3 annotators. Using the guideline mentioned in the paper,\nthe annotators were asked to classify the words into one of the following four categories:\n(i) *MSA*: MSA word with MSA pronunciations; (ii) *EGY*: Egyptian word; (iii) *MIX*: MSA word with dialectal pronunciations and (iv) *FRN*: Foreign word, i.e., not Arabic.\nIn addition, a 'NULL' tag was assigned in case the word is unintelligible or cannot be categorised to one of the four labels.\n\nMore details in paper:\n\n```\n@inproceedings{chowdhury2020cs,\n  title={Effects of Dialectal Code-Switching on Speech Modules: A Study using Egyptian Arabic Broadcast Speech},\n  author={Chowdhury, Shammur Absar  and Samih, Younes and Eldesouki, Mohamed and Ali, Ahmed},\n  booktitle={INTERSPEECH},\n  year={2020}\n}\n```\n\nalso available in [Paper](http://www.interspeech2020.org/uploadfile/pdf/Wed-1-10-5.pdf)\n\n## Data Format\n*DACS_word_level.feat*\nThe input file -- containing words and corresponding labels, are presented in `DACS_word_level.feat`. The file contains the following fields (space seperated), including\n`#id word_index_in_sentence word word_start word_duration word_end label1 label2 label3`\n\nwhere\n`#id` is the corresponding wav id\n\n`word_index_in_sentence` indicates the position of the word in the utterance.\n\n`word` manually transcribed word (in Buckwalter transliteration format)\n\n`word_start` start time of the word in secs.\n\n`word_duration` duration of the word in secs.\n\n`word_end` end time of the word in secs.\n\n`phone phone_conf phone_start phone_duration phone_end` same info for phone (forced aligned)\n\n`label[1-3]` annotation label provided by annotator [1-3]\n\n*segments_dacs*\nThe file include information of the manually segmented MGB-3:EGY to utterances. The file includes:\n`segmented_id audio_id segment_start segment_end`\n\nwhere\n`segmented_id` is the wav id of the utterance\n\n`audio_id` is the id of original audio file from MGB-3:EGY.\n\n`segment_start/end` the start and the end time of the segmented utterances.\n\n*mgb3_audio_list.txt*\nA list of audio files (MGB-3:EGY) used for this dataset. Can be directly downloaded given the url of the audio.\n\n\n\n\n[1] Ali, Ahmed, Stephan Vogel, and Steve Renals. \"Speech recognition challenge in the wild: Arabic MGB-3.\" 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2017.\n\n\n\u003c!-- booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC'20)}, --\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqcri%2Farabic_speech_code_switching","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fqcri%2Farabic_speech_code_switching","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqcri%2Farabic_speech_code_switching/lists"}