https://github.com/layik/kurdi

dump kurdi stuff here
https://github.com/layik/kurdi

Last synced: 3 months ago
JSON representation

dump kurdi stuff here

Host: GitHub
URL: https://github.com/layik/kurdi
Owner: layik
Created: 2015-06-25T17:55:21.000Z (about 10 years ago)
Default Branch: master
Last Pushed: 2021-09-12T00:44:00.000Z (almost 4 years ago)
Last Synced: 2025-02-14T14:43:22.455Z (5 months ago)
Language: HTML
Size: 10.9 MB
Stars: 19
Watchers: 4
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.Rmd

Awesome Lists containing this project

awesome-kurdish - kurdi - Various Kurdi related work done by Kurdish developers. (Natural Language Processing / Libraries and Tools)

README

        ---

title: Various Kurdi related work done by Kurdish developers.

output: github_document

---

# kurdi

There are some hopefully useful files/scripts/chunks etc. to share with Kurdi developers.

1. kurdi_words.txt: a list of Kurdish words (currently 1,668,692), unique and alphabetically ordered (thanks to @dolanskurd).

Note that in the bar chart below, each of (و) and (ی) counted as both vowel and consonant. 

2. unicode_list.txt: list of unicode values for Kurdish alphabet (Arabic script) standard accepted and published on http://unicode.ekrg.org/ku_unicodes.html

3. [gettext](https://en.wikipedia.org/wiki/Gettext) translations, includes ku.po for Drupal. Most of the translations come from https://localize.drupal.org/translate/languages/ku (now almost dead

4. KRG health institutions data (lat/lng and names) throughout KRG (see health)

Now that we have some good unique and cleaned up word-iest. We can do some statistics on them (in R for now):

```{r}

w = readLines("https://raw.githubusercontent.com/layik/kurdi/master/corpus/kurdi_words.txt")

length(unique(w)) == length(w)

length(w)

# sample of those including ئا

length(grep("ئا", w))

# read in list of Kurdi chars

ku_v = readLines("https://raw.githubusercontent.com/layik/kurdi/master/corpus/letters_lines.txt")

message("Kurdish alphabet: ",  length(ku_v), " letters.")

letters_used = sapply(ku_v, function(x){

  length(grep(x, w))

})

# change h to doucheshme

names(letters_used)[names(letters_used) == 'ه'] = "ھ"

letters_used = sort(letters_used, decreasing = TRUE)

library(ggplot2)

ggplot() + 

  geom_bar(aes(x=names(letters_used),y=letters_used), stat='identity') + 

  xlab('Alphabet') + ylab('Frequency') + 

  theme(axis.text.x = element_text(face = "bold", size = 18)) + 

  scale_y_continuous(labels = scales::comma) + 

  scale_x_discrete(limits=names(letters_used))

letters_used['ە']

```

The dataset could give us the wrong information. Just because we do not know the curation and also authors of the entries picked up from the World WILD Web. What can we do? How about looking at the CKB Wikipedia? At least the titles of the articles would have gone through some good revisions.

OK, let us go for it:

```{r}

source("ckb-kurdi.R")

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/layik/kurdi

Awesome Lists containing this project

README