Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/layik/kurdi
dump kurdi stuff here
https://github.com/layik/kurdi
Last synced: 12 days ago
JSON representation
dump kurdi stuff here
- Host: GitHub
- URL: https://github.com/layik/kurdi
- Owner: layik
- Created: 2015-06-25T17:55:21.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2021-09-12T00:44:00.000Z (over 3 years ago)
- Last Synced: 2024-11-12T09:50:23.602Z (about 2 months ago)
- Language: HTML
- Size: 10.9 MB
- Stars: 19
- Watchers: 4
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project
- awesome-kurdish - kurdi - Various Kurdi related work done by Kurdish developers. (Natural Language Processing / Libraries and Tools)
README
---
title: Various Kurdi related work done by Kurdish developers.
output: github_document
---
# kurdiThere are some hopefully useful files/scripts/chunks etc. to share with Kurdi developers.
1. kurdi_words.txt: a list of Kurdish words (currently 1,668,692), unique and alphabetically ordered (thanks to @dolanskurd).
Note that in the bar chart below, each of (و) and (ی) counted as both vowel and consonant.2. unicode_list.txt: list of unicode values for Kurdish alphabet (Arabic script) standard accepted and published on http://unicode.ekrg.org/ku_unicodes.html
3. [gettext](https://en.wikipedia.org/wiki/Gettext) translations, includes ku.po for Drupal. Most of the translations come from https://localize.drupal.org/translate/languages/ku (now almost dead
4. KRG health institutions data (lat/lng and names) throughout KRG (see health)
Now that we have some good unique and cleaned up word-iest. We can do some statistics on them (in R for now):
```{r}
w = readLines("https://raw.githubusercontent.com/layik/kurdi/master/corpus/kurdi_words.txt")
length(unique(w)) == length(w)
length(w)
# sample of those including ئا
length(grep("ئا", w))
# read in list of Kurdi chars
ku_v = readLines("https://raw.githubusercontent.com/layik/kurdi/master/corpus/letters_lines.txt")
message("Kurdish alphabet: ", length(ku_v), " letters.")letters_used = sapply(ku_v, function(x){
length(grep(x, w))
})# change h to doucheshme
names(letters_used)[names(letters_used) == 'ه'] = "ھ"
letters_used = sort(letters_used, decreasing = TRUE)library(ggplot2)
ggplot() +
geom_bar(aes(x=names(letters_used),y=letters_used), stat='identity') +
xlab('Alphabet') + ylab('Frequency') +
theme(axis.text.x = element_text(face = "bold", size = 18)) +
scale_y_continuous(labels = scales::comma) +
scale_x_discrete(limits=names(letters_used))letters_used['ە']
```The dataset could give us the wrong information. Just because we do not know the curation and also authors of the entries picked up from the World WILD Web. What can we do? How about looking at the CKB Wikipedia? At least the titles of the articles would have gone through some good revisions.
OK, let us go for it:
```{r}
source("ckb-kurdi.R")
```