Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/layik/kurdi

dump kurdi stuff here
https://github.com/layik/kurdi

Last synced: 12 days ago
JSON representation

dump kurdi stuff here

Awesome Lists containing this project

README

        

---
title: Various Kurdi related work done by Kurdish developers.
output: github_document
---
# kurdi

There are some hopefully useful files/scripts/chunks etc. to share with Kurdi developers.

1. kurdi_words.txt: a list of Kurdish words (currently 1,668,692), unique and alphabetically ordered (thanks to @dolanskurd).
Note that in the bar chart below, each of (و) and (ی) counted as both vowel and consonant.

2. unicode_list.txt: list of unicode values for Kurdish alphabet (Arabic script) standard accepted and published on http://unicode.ekrg.org/ku_unicodes.html

3. [gettext](https://en.wikipedia.org/wiki/Gettext) translations, includes ku.po for Drupal. Most of the translations come from https://localize.drupal.org/translate/languages/ku (now almost dead

4. KRG health institutions data (lat/lng and names) throughout KRG (see health)

Now that we have some good unique and cleaned up word-iest. We can do some statistics on them (in R for now):

```{r}
w = readLines("https://raw.githubusercontent.com/layik/kurdi/master/corpus/kurdi_words.txt")
length(unique(w)) == length(w)
length(w)
# sample of those including ئا
length(grep("ئا", w))
# read in list of Kurdi chars
ku_v = readLines("https://raw.githubusercontent.com/layik/kurdi/master/corpus/letters_lines.txt")
message("Kurdish alphabet: ", length(ku_v), " letters.")

letters_used = sapply(ku_v, function(x){
length(grep(x, w))
})

# change h to doucheshme
names(letters_used)[names(letters_used) == 'ه'] = "ھ"
letters_used = sort(letters_used, decreasing = TRUE)

library(ggplot2)
ggplot() +
geom_bar(aes(x=names(letters_used),y=letters_used), stat='identity') +
xlab('Alphabet') + ylab('Frequency') +
theme(axis.text.x = element_text(face = "bold", size = 18)) +
scale_y_continuous(labels = scales::comma) +
scale_x_discrete(limits=names(letters_used))

letters_used['ە']
```

The dataset could give us the wrong information. Just because we do not know the curation and also authors of the entries picked up from the World WILD Web. What can we do? How about looking at the CKB Wikipedia? At least the titles of the articles would have gone through some good revisions.

OK, let us go for it:

```{r}
source("ckb-kurdi.R")
```