{"id":41848006,"url":"https://github.com/camel-lab/camel_arabic_frequency_lists","last_synced_at":"2026-01-25T10:05:41.388Z","repository":{"id":246647086,"uuid":"820400411","full_name":"CAMeL-Lab/Camel_Arabic_Frequency_Lists","owner":"CAMeL-Lab","description":"The repository for the CAMeL Arabic Frequency Lists dataset","archived":false,"fork":false,"pushed_at":"2025-02-15T16:31:08.000Z","size":19,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-09-09T22:06:21.934Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CAMeL-Lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-06-26T11:50:24.000Z","updated_at":"2025-02-22T22:15:39.000Z","dependencies_parsed_at":"2025-09-09T20:33:35.490Z","dependency_job_id":"a0874c04-8dec-4651-91c3-f1277f9e0367","html_url":"https://github.com/CAMeL-Lab/Camel_Arabic_Frequency_Lists","commit_stats":null,"previous_names":["camel-lab/camel_arabic_frequency_lists"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/CAMeL-Lab/Camel_Arabic_Frequency_Lists","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2FCamel_Arabic_Frequency_Lists","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2FCamel_Arabic_Frequency_Lists/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2FCamel_Arabic_Frequency_Lists/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2FCamel_Arabic_Frequency_Lists/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CAMeL-Lab","download_url":"https://codeload.github.com/CAMeL-Lab/Camel_Arabic_Frequency_Lists/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2FCamel_Arabic_Frequency_Lists/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28751113,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-25T09:58:17.166Z","status":"ssl_error","status_checked_at":"2026-01-25T09:55:56.104Z","response_time":113,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-25T10:05:40.501Z","updated_at":"2026-01-25T10:05:41.373Z","avatar_url":"https://github.com/CAMeL-Lab.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# CAMeL_Arabic_Frequency_Lists\n\n## Summary\nThe CAMeL Arabic Frequency Lists dataset is derived from the pretraining datasets used to pretrain the family of [CAMeLBERT models](https://huggingface.co/collections/CAMeL-Lab/camelbert-653f42bfcbc8ae32a51a692d) (16.1M unique word types / 17.3B word tokens). Three main varieties of Arabic were used: Classical Arabic (CA), Dialectal Arabic (DA), and Modern Standard Arabic (MSA).\n\n**To download, please click on the link below**:\n- [CA_freq_lists.tsv.zip](https://github.com/CAMeL-Lab/Camel_Arabic_Frequency_Lists/releases/download/v1.0/CA_freq_lists.tsv.zip): Classical Arabic frequency list.\n- [DA_freq_lists.tsv.zip](https://github.com/CAMeL-Lab/Camel_Arabic_Frequency_Lists/releases/download/v1.0/DA_freq_lists.tsv.zip): Dialectal Arabic (mixed dialects) frequency list.\n- [MSA_freq_lists.tsv.zip](https://github.com/CAMeL-Lab/Camel_Arabic_Frequency_Lists/releases/download/v1.0/MSA_freq_lists.tsv.zip): Modern Standard Arabic frequency list.\n- [MIX_freq_lists.tsv.zip](https://github.com/CAMeL-Lab/Camel_Arabic_Frequency_Lists/releases/download/v1.0/MIX_freq_lists.tsv.zip): All CA+DA+MSA frequency list (a union of all three sets with frequencies aggregated)\n\nFor details about the different genres and sources of the data, please refer to the CAMeLBERT paper [here](https://aclanthology.org/2021.wanlp-1.10/).\n\nEach of the frequency list files contains unique types of Arabic only words along with their frequencies as they appeared in the pretraining data. We excluded digits, punctuation, and non-Arabic script tokens.\n\nAll files are tab-separated with the first column being the word in Arabic script and the second column being the frequency (note that due to the mixed text direction the the order may be *displayed* in reverse). See the following example:\n\n- Examples from CA: out of 2.4M unique word types from a corpus of 847M word tokens.\n```\nفي\t16664531\nمن\t15695517\nبن\t13571947\nالله\t11433931\nعن\t9140820\n...\nالمستعان\t6285\nالورقة\t6284\nالروياني\t6284\nالثريا\t6283\nيسافر\t6283\n\n....\nفكعمرة\t4\nفكعرض\t4\nفكضامن\t4\nفكرؤيته\t4\nفكتفريق\t4\n```\n\n- Examples from DA: out of 6.7M unique word types from a corpus of 5.8B word tokens.\n```\nمن\t127245884\nفي\t101567242\nالله\t72525262\nعلي\t65410197\nلا\t52420507\n...\nقضيته\t70256\nدره\t70235\nتعطيك\t70226\nتهديد\t70216\nالاوراق\t70213\n...\nهالمكااان\t35\nهالشوز\t35\nهالرغد\t35\nهالثبات\t35\nنننس\t35\n\n```\n\n- Examples from MSA: out of 11.4M unique word types from a corpus of 12.6B word tokens.\n```\nفي\t255725161\nمن\t205864175\nعلى\t122591931\nو\t68783652\nأن\t64519408\n...\nالسائل\t128423\nثانوى\t128420\nالحيوانية\t128417\nنزيف\t128393\nعصابة\t128386\n...\nسهرن\t52\nستنسيه\t52\nستمتلكه\t52\nستكفينا\t52\nستضره\t52\n\n```\n\n- Examples from MIX: out of 16.1M unique word types from a corpus of 17.3B word tokens.\n```\nفي\t373956934\nمن\t348805576\nعلى\t132084198\nو\t121102569\nالله\t111745498\n...\nوفدا\t213505\nالمنافقين\t213483\nالبيلاروسي\t213461\nالطيبين\t213441\nاساسي\t213409\n...\nكهلون\t91\nكفعال\t91\nكعروة\t91\nكالوفرة\t91\nكالمستهزىء\t91\n\n```\n\n## Citation\n```\n@software{Khalifa:2021:Camel_Frequency,\nauthor = {Khalifa, Salam and Inoue, Go and Alhafni, Bashar and Baimukan, Nurpeiis and Bouamor, Houda and Habash, Nizar},\ntitle = {{Camel Arabic Frequency Lists }},\nyear = 2021,\nurl = {https://github.com/CAMeL-Lab/Camel_Arabic_Frequency_Lists}\n}\n```\n\n## Contributors\n- Salam Khalifa\n- CAMeLBERT paper authors\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcamel-lab%2Fcamel_arabic_frequency_lists","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcamel-lab%2Fcamel_arabic_frequency_lists","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcamel-lab%2Fcamel_arabic_frequency_lists/lists"}