https://github.com/camel-lab/gumar-ngrams

The complete [1 to 5]-gram Gumar Corpus in the style of Google n-grams.
https://github.com/camel-lab/gumar-ngrams

Last synced: 5 months ago
JSON representation

The complete [1 to 5]-gram Gumar Corpus in the style of Google n-grams.

Host: GitHub
URL: https://github.com/camel-lab/gumar-ngrams
Owner: CAMeL-Lab
Created: 2018-12-11T09:25:57.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2020-02-05T08:57:48.000Z (over 6 years ago)
Last Synced: 2025-09-09T22:06:34.687Z (10 months ago)
Size: 57.6 KB
Stars: 10
Watchers: 1
Forks: 5
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# The Gumar Corpus N-grams

> Copyright © 2017-2018 New York University Abu Dhabi
>
> Computational Approaches to Modeling Language (CAMeL) Lab

## About

We present the Gumar Corpus n-grams.
The n-grams are generated from the
[Gumar Corpus](https://camel.abudhabi.nyu.edu/gumar/), a large-scale corpus of
Gulf Arabic containing more than 100 million words [1,2].
The n-grams are in order of 5, that is 5, 4, 3, 2 and 1 grams with their
respective frequency counts and the number of documents they appear in.
The n-grams are counted across the entire corpus and also across each dialect
category individually.
The format of the n-gram files follows a similar format of Google n-grams with
the exception of the year column which we don't produce.

## Preprocessing

* All documents of the corpus are converted into plain text.
* Basic UTF-8 character cleaning.
* Punctuation separation.

## Dialect Categorization

Below are categorizations of the dialects and their respective document counts.
For specific information per document please refer to the spreadsheet attached
with this package.

| Tag | Dialect | Document Count |
|:--------:|:--------------------------------------------:|:--------------:|
| SA | Saudi | 770 |
| AE | Emirati | 115 |
| KW | Kuwaiti | 87 |
| OM | Omani | 14 |
| QA | Qatari | 10 |
| BA | Bahraini | 8 |
| MSA | Modern Standard Arabic | 82 |
| EGY | Egyptian | 3 |
| LEV | Levantine | 5 |
| MOR | Moroccan | 1 |
| IRQ | Iraqi | 5 |
| YEM | Yemeni | 1 |
| UNID_GA | Unidentified Gulf Arabic | 116 |
| MIXED_GA | Mixed Gulf Arabic | 11 |
| MIXED | Gulf Arabic mixed with other Arabic dialects | 4 |

## Download

You can
[download the GUMAR n-grams here](https://github.com/CAMeL-Lab/Gumar-Ngrams/releases).

The n-grams are split by dialect into seperate compressed folders of the form
`.tar.xz` where *\* is one of the dialect tags listed above.
There is an additional file `ALL.tar.xz` that contains n-grams of all the
dialects combined.

Once downloaded, you can extract the files by running the following:

```bash
tar -xJf .tar.xz
```

This will generate a folder `/` in the current working directory.

## Directory Structure

Each folder contains the following n-gram files:

* `1-grams_.tsv`
* `2-grams_.tsv`
* `3-grams_.tsv`
* `4-grams_.tsv`
* `5-grams_.tsv`

## Format

Each n-gram file consists of three tab separated columns as follows:

TAB TAB <# of documents> NEWLINE

Each \ larger than one is single space separated.

Example of a 2-grams row:


انتظر منك	85	69

*\* Note that the example above is displayed right-to-left but the columns are
in the order described.*

Each n-gram file is sorted by `` in descending order.

## Data Sources

If you would like more details on the data used to generate the n-grams,
take a look at the [Gumar_Info.tsv](./Gumar_Info.tsv) file.
It is a Tab Separated Values file containing author and title
information for each document, as well as its dialect and the link it was
downloaded from. Duplicate entries for title-author pairs indicate that a
document was split into multiple files.

*\* Please note that some entries in [Gumar_Info.tsv](./Gumar_Info.tsv)
containing double-quotes have been escaped. We recommend using a TSV reader
(eg. Microsoft Excel, Apple Numbers, Google Docs, etc.) to parse these
properly.*

## Citation

Please use the following citation when referencing or using this resource:

> Khalifa, Salam, Nizar Habash, Dana Abdulrahim, and Sara Hassan.
> "A Large Scale Corpus of Gulf Arabic." In Language Resources and Evaluation
> Conference. 2016. Portorož, Slovenia

## License

The Gumar Corpus n-grams are licensed under a
[Creative Commons Attribution 3.0 Unported License](http://creativecommons.org/licenses/by/3.0/).

## References

[1] [Khalifa, Salam, Nizar Habash, Dana Abdulrahim, and Sara Hassan. "A Large Scale Corpus of Gulf Arabic." In Language Resources and Evaluation Conference. 2016. Portorož, Slovenia](http://www.lrec-conf.org/proceedings/lrec2016/pdf/823_Paper.pdf)

[2] [Khalifa, Salam, Nizar Habash, Fadhl Eryani, Ossama Obeid, Dana Abdulrahim, and Meera Al Kaabi. "A Morphologically Annotated Corpus of Emirati Arabic". In Language Resources and Evaluation Conference. 2018. Miyazaki, Japan](http://www.lrec-conf.org/proceedings/lrec2018/pdf/529.pdf)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/camel-lab/gumar-ngrams

Awesome Lists containing this project

README