https://github.com/dohliam/hawaiian-corpus
Data from a corpus of written Hawaiian
https://github.com/dohliam/hawaiian-corpus
bigrams corpora corpus corpus-data corpus-linguistics frequency frequency-list hawaii hawaiian hawaiian-electronic-library hawaiian-language n-grams ngram olelo-hawaii stoplist stopwords ulukau
Last synced: 5 months ago
JSON representation
Data from a corpus of written Hawaiian
- Host: GitHub
- URL: https://github.com/dohliam/hawaiian-corpus
- Owner: dohliam
- License: other
- Created: 2016-06-27T16:53:06.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2016-06-27T16:58:39.000Z (almost 10 years ago)
- Last Synced: 2025-01-26T10:08:45.186Z (over 1 year ago)
- Topics: bigrams, corpora, corpus, corpus-data, corpus-linguistics, frequency, frequency-list, hawaii, hawaiian, hawaiian-electronic-library, hawaiian-language, n-grams, ngram, olelo-hawaii, stoplist, stopwords, ulukau
- Homepage: https://dohliam.github.io/corpus/haw/
- Size: 22.2 MB
- Stars: 14
- Watchers: 3
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# hawaiian-corpus - Data from a corpus of written Hawaiian
This repository contains data based on a corpus of texts written in the Hawaiian language (_ʻŌlelo Hawaiʻi_). The data includes frequency lists, stopwords, and lists of most common n-grams. The text in the corpus was obtained from [Ulukau](http://ulukau.org/), the Hawaiian Electronic Library.
There are a total of 10.7 million words in the corpus, which was restricted to modern (post-20th century) and non-scriptural text. An overview of statistics for the corpus (including the top most common words and n-grams) can be seen [here](corpus_stats-haw.md).
## Data
Files included in this repository:
* [Hawaiian frequency list](data/freqlist_haw.txt): A list of all the words in the corpus, arranged by frequency
* [Hawaiian stopwords list](data/stoplist_haw.txt): A list of stopwords derived from the frequency file (this is being actively verified and updated for eventual inclusion in the [stopwords-json](https://github.com/6/stopwords-json) project)
* [List of Hawaiian bigrams](data/ngrams/2grams_haw.txt) - A list of the most common sequences of two words, arranged by frequency
* [List of Hawaiian 3-grams](data/ngrams/3grams_haw.txt) - A list of the most common sequences of three words, arranged by frequency
* [List of Hawaiian 4-grams](data/ngrams/4grams_haw.txt) - A list of the most common sequences of four words, arranged by frequency
* [Statistics for the Hawaiian corpus](data/corpus_stats-haw.md)
## License
CC0.