{"id":41847753,"url":"https://github.com/camel-lab/gumar-ngrams","last_synced_at":"2026-01-25T10:04:47.891Z","repository":{"id":131318182,"uuid":"161309255","full_name":"CAMeL-Lab/Gumar-Ngrams","owner":"CAMeL-Lab","description":"The complete [1 to 5]-gram Gumar Corpus in the style of Google n-grams. ","archived":false,"fork":false,"pushed_at":"2020-02-05T08:57:48.000Z","size":59,"stargazers_count":10,"open_issues_count":0,"forks_count":5,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-09-09T22:06:34.687Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CAMeL-Lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2018-12-11T09:25:57.000Z","updated_at":"2024-11-16T13:24:35.000Z","dependencies_parsed_at":null,"dependency_job_id":"e8396eef-66e3-415a-b470-76d09a9cf6b5","html_url":"https://github.com/CAMeL-Lab/Gumar-Ngrams","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/CAMeL-Lab/Gumar-Ngrams","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2FGumar-Ngrams","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2FGumar-Ngrams/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2FGumar-Ngrams/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2FGumar-Ngrams/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CAMeL-Lab","download_url":"https://codeload.github.com/CAMeL-Lab/Gumar-Ngrams/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2FGumar-Ngrams/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28751106,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-25T09:58:17.166Z","status":"ssl_error","status_checked_at":"2026-01-25T09:55:56.104Z","response_time":113,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-25T10:04:47.784Z","updated_at":"2026-01-25T10:04:47.850Z","avatar_url":"https://github.com/CAMeL-Lab.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# The Gumar Corpus N-grams\n\n\u003e Copyright © 2017-2018 New York University Abu Dhabi\n\u003e\n\u003e Computational Approaches to Modeling Language (CAMeL) Lab\n\n## About\n\nWe present the Gumar Corpus n-grams.\nThe n-grams are generated from the\n[Gumar Corpus](https://camel.abudhabi.nyu.edu/gumar/), a large-scale corpus of\nGulf Arabic containing more than 100 million words [1,2].\nThe n-grams are in order of 5, that is 5, 4, 3, 2 and 1 grams with their\nrespective frequency counts and the number of documents they appear in.\nThe n-grams are counted across the entire corpus and also across each dialect\ncategory individually.\nThe format of the n-gram files follows a similar format of Google n-grams with\nthe exception of the year column which we don't produce.\n\n## Preprocessing\n\n* All documents of the corpus are converted into plain text.\n* Basic UTF-8 character cleaning.\n* Punctuation separation.\n\n## Dialect Categorization\n\nBelow are categorizations of the dialects and their respective document counts.\nFor specific information per document please refer to the spreadsheet attached\nwith this package.\n\n| Tag      | Dialect                                      | Document Count |\n|:--------:|:--------------------------------------------:|:--------------:|\n| SA       | Saudi                                        | 770            |\n| AE       | Emirati                                      | 115            |\n| KW       | Kuwaiti                                      | 87             |\n| OM       | Omani                                        | 14             |\n| QA       | Qatari                                       | 10             |\n| BA       | Bahraini                                     | 8              |\n| MSA      | Modern Standard Arabic                       | 82             |\n| EGY      | Egyptian                                     | 3              |\n| LEV      | Levantine                                    | 5              |\n| MOR      | Moroccan                                     | 1              |\n| IRQ      | Iraqi                                        | 5              |\n| YEM      | Yemeni                                       | 1              |\n| UNID_GA  | Unidentified Gulf Arabic                     | 116            |\n| MIXED_GA | Mixed Gulf Arabic                            | 11             |\n| MIXED    | Gulf Arabic mixed with other Arabic dialects | 4              |\n\n## Download\n\nYou can\n[download the GUMAR n-grams here](https://github.com/CAMeL-Lab/Gumar-Ngrams/releases).\n\nThe n-grams are split by dialect into seperate compressed folders of the form\n`\u003cTAG\u003e.tar.xz` where *\\\u003cTAG\u003e* is one of the dialect tags listed above.\nThere is an additional file `ALL.tar.xz` that contains n-grams of all the\ndialects combined.\n\nOnce downloaded, you can extract the files by running the following:\n\n```bash\ntar -xJf \u003cTAG\u003e.tar.xz\n```\n\nThis will generate a folder `\u003cTAG\u003e/` in the current working directory.\n\n## Directory Structure\n\nEach folder contains the following n-gram files:\n\n* `1-grams_\u003cTAG\u003e.tsv`\n* `2-grams_\u003cTAG\u003e.tsv`\n* `3-grams_\u003cTAG\u003e.tsv`\n* `4-grams_\u003cTAG\u003e.tsv`\n* `5-grams_\u003cTAG\u003e.tsv`\n\n## Format\n\nEach n-gram file consists of three tab separated columns as follows:\n\n    \u003cn-gram\u003e TAB \u003cfrequency\u003e TAB \u003c# of documents\u003e NEWLINE\n\nEach \\\u003cn-gram\u003e larger than one is single space separated.\n\nExample of a 2-grams row:\n\n\u003cpre dir=\"rtl\"\u003e\nانتظر منك\t85\t69\n\u003c/pre\u003e\n\n*\\* Note that the example above is displayed right-to-left but the columns are\nin the order described.*\n\nEach n-gram file is sorted by `\u003cfrequency\u003e` in descending order.\n\n## Data Sources\n\nIf you would like more details on the data used to generate the n-grams,\ntake a look at the [Gumar_Info.tsv](./Gumar_Info.tsv) file.\nIt is a Tab Separated Values file containing author and title\ninformation for each document, as well as its dialect and the link it was\ndownloaded from. Duplicate entries for title-author pairs indicate that a\ndocument was split into multiple files.\n\n*\\* Please note that some entries in [Gumar_Info.tsv](./Gumar_Info.tsv)\ncontaining double-quotes have been escaped. We recommend using a TSV reader\n(eg. Microsoft Excel, Apple Numbers, Google Docs, etc.) to parse these\nproperly.*\n\n## Citation\n\nPlease use the following citation when referencing or using this resource:\n\n\u003e Khalifa, Salam, Nizar Habash, Dana Abdulrahim, and Sara Hassan.\n\u003e \"A Large Scale Corpus of Gulf Arabic.\" In Language Resources and Evaluation\n\u003e Conference. 2016. Portorož, Slovenia\n\n## License\n\nThe Gumar Corpus n-grams are licensed under a\n[Creative Commons Attribution 3.0 Unported License](http://creativecommons.org/licenses/by/3.0/).\n\n## References\n\n[1] [Khalifa, Salam, Nizar Habash, Dana Abdulrahim, and Sara Hassan. \"A Large Scale Corpus of Gulf Arabic.\" In Language Resources and Evaluation Conference. 2016. Portorož, Slovenia](http://www.lrec-conf.org/proceedings/lrec2016/pdf/823_Paper.pdf)\n\n[2] [Khalifa, Salam, Nizar Habash, Fadhl Eryani, Ossama Obeid, Dana Abdulrahim, and Meera Al Kaabi. \"A Morphologically Annotated Corpus of Emirati Arabic\". In Language Resources and Evaluation Conference. 2018. Miyazaki, Japan](http://www.lrec-conf.org/proceedings/lrec2018/pdf/529.pdf)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcamel-lab%2Fgumar-ngrams","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcamel-lab%2Fgumar-ngrams","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcamel-lab%2Fgumar-ngrams/lists"}