https://github.com/camel-lab/gumar-ngrams
The complete [1 to 5]-gram Gumar Corpus in the style of Google n-grams.
https://github.com/camel-lab/gumar-ngrams
Last synced: 4 months ago
JSON representation
The complete [1 to 5]-gram Gumar Corpus in the style of Google n-grams.
- Host: GitHub
- URL: https://github.com/camel-lab/gumar-ngrams
- Owner: CAMeL-Lab
- Created: 2018-12-11T09:25:57.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2020-02-05T08:57:48.000Z (over 6 years ago)
- Last Synced: 2025-09-09T22:06:34.687Z (9 months ago)
- Size: 57.6 KB
- Stars: 10
- Watchers: 1
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# The Gumar Corpus N-grams
> Copyright © 2017-2018 New York University Abu Dhabi
>
> Computational Approaches to Modeling Language (CAMeL) Lab
## About
We present the Gumar Corpus n-grams.
The n-grams are generated from the
[Gumar Corpus](https://camel.abudhabi.nyu.edu/gumar/), a large-scale corpus of
Gulf Arabic containing more than 100 million words [1,2].
The n-grams are in order of 5, that is 5, 4, 3, 2 and 1 grams with their
respective frequency counts and the number of documents they appear in.
The n-grams are counted across the entire corpus and also across each dialect
category individually.
The format of the n-gram files follows a similar format of Google n-grams with
the exception of the year column which we don't produce.
## Preprocessing
* All documents of the corpus are converted into plain text.
* Basic UTF-8 character cleaning.
* Punctuation separation.
## Dialect Categorization
Below are categorizations of the dialects and their respective document counts.
For specific information per document please refer to the spreadsheet attached
with this package.
| Tag | Dialect | Document Count |
|:--------:|:--------------------------------------------:|:--------------:|
| SA | Saudi | 770 |
| AE | Emirati | 115 |
| KW | Kuwaiti | 87 |
| OM | Omani | 14 |
| QA | Qatari | 10 |
| BA | Bahraini | 8 |
| MSA | Modern Standard Arabic | 82 |
| EGY | Egyptian | 3 |
| LEV | Levantine | 5 |
| MOR | Moroccan | 1 |
| IRQ | Iraqi | 5 |
| YEM | Yemeni | 1 |
| UNID_GA | Unidentified Gulf Arabic | 116 |
| MIXED_GA | Mixed Gulf Arabic | 11 |
| MIXED | Gulf Arabic mixed with other Arabic dialects | 4 |
## Download
You can
[download the GUMAR n-grams here](https://github.com/CAMeL-Lab/Gumar-Ngrams/releases).
The n-grams are split by dialect into seperate compressed folders of the form
`.tar.xz` where *\* is one of the dialect tags listed above.
There is an additional file `ALL.tar.xz` that contains n-grams of all the
dialects combined.
Once downloaded, you can extract the files by running the following:
```bash
tar -xJf .tar.xz
```
This will generate a folder `/` in the current working directory.
## Directory Structure
Each folder contains the following n-gram files:
* `1-grams_.tsv`
* `2-grams_.tsv`
* `3-grams_.tsv`
* `4-grams_.tsv`
* `5-grams_.tsv`
## Format
Each n-gram file consists of three tab separated columns as follows:
TAB TAB <# of documents> NEWLINE
Each \ larger than one is single space separated.
Example of a 2-grams row:
انتظر منك 85 69
*\* Note that the example above is displayed right-to-left but the columns are
in the order described.*
Each n-gram file is sorted by `` in descending order.
## Data Sources
If you would like more details on the data used to generate the n-grams,
take a look at the [Gumar_Info.tsv](./Gumar_Info.tsv) file.
It is a Tab Separated Values file containing author and title
information for each document, as well as its dialect and the link it was
downloaded from. Duplicate entries for title-author pairs indicate that a
document was split into multiple files.
*\* Please note that some entries in [Gumar_Info.tsv](./Gumar_Info.tsv)
containing double-quotes have been escaped. We recommend using a TSV reader
(eg. Microsoft Excel, Apple Numbers, Google Docs, etc.) to parse these
properly.*
## Citation
Please use the following citation when referencing or using this resource:
> Khalifa, Salam, Nizar Habash, Dana Abdulrahim, and Sara Hassan.
> "A Large Scale Corpus of Gulf Arabic." In Language Resources and Evaluation
> Conference. 2016. Portorož, Slovenia
## License
The Gumar Corpus n-grams are licensed under a
[Creative Commons Attribution 3.0 Unported License](http://creativecommons.org/licenses/by/3.0/).
## References
[1] [Khalifa, Salam, Nizar Habash, Dana Abdulrahim, and Sara Hassan. "A Large Scale Corpus of Gulf Arabic." In Language Resources and Evaluation Conference. 2016. Portorož, Slovenia](http://www.lrec-conf.org/proceedings/lrec2016/pdf/823_Paper.pdf)
[2] [Khalifa, Salam, Nizar Habash, Fadhl Eryani, Ossama Obeid, Dana Abdulrahim, and Meera Al Kaabi. "A Morphologically Annotated Corpus of Emirati Arabic". In Language Resources and Evaluation Conference. 2018. Miyazaki, Japan](http://www.lrec-conf.org/proceedings/lrec2018/pdf/529.pdf)