Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/alexgustafsson/word-frequencies
Data and tools to compile word frequencies, trigrams and more for use with NLP, spelling correction etc.
https://github.com/alexgustafsson/word-frequencies
language nlp nltk numpy python python3 sklearn spelling-correction trigram
Last synced: 16 days ago
JSON representation
Data and tools to compile word frequencies, trigrams and more for use with NLP, spelling correction etc.
- Host: GitHub
- URL: https://github.com/alexgustafsson/word-frequencies
- Owner: AlexGustafsson
- License: unlicense
- Created: 2020-09-03T15:34:42.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2021-07-14T14:14:38.000Z (over 3 years ago)
- Last Synced: 2024-12-11T02:04:29.710Z (2 months ago)
- Topics: language, nlp, nltk, numpy, python, python3, sklearn, spelling-correction, trigram
- Language: Python
- Homepage:
- Size: 1.93 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# Word frequencies
### Data and tools to compile word frequencies for use with NLP, spelling correction etc.
***### Goal
The goal of this project is to facilitate easy to use data and tools regarding word frequencies in various languages. Any and all contribution to add more data to the project is welcome.
### Available tools
The `script` directory contains several scripts that can be used individually as a library or as a CLI tool.
There are scripts to fetch data from Wikipedia articles, Gutenberg books as well as other regional sources.
The scripts in `scripts/processing` can be used to download and compile large text files for a language as well as a frequency map.
The `scripts/ai/test_ai` and `scripts/ai/train_ai` scripts can be used to train a MLE model using NLTK to predict the likelihood of a specific word being in a sentence, as well as generating new sentences.
### Available data
As of now, this repository contains data for the Swedish and English language. No data is available within the repository itself since v1.0.0. Instead, releases are made containing the data.
#### Swedish
1. Word frequencies - roughly 300 000 words and how often they occur.
1. Character frequencies - roughly 600 characters and how often they occur.
2. bigrams - roughly 1 600 000 unique bigrams.
3. trigrams - roughly 2 150 650 unique trigrams.Note: Due to the nature of the sources from which the data was retrieved, it is likely that the character frequencies contain foreign characters such as `海` and characters from the international phonetic alphabet. To mitigate this, when using the character frequencies, filter out characters that are not used more than `n` times where `n` is low (0-15).
#### English
The word frequencies were compiled by using the compilation script in this repository.
#### Others
Is your language missing from the released compilations? Fear not! This repository holds scripts that are designed to be as multilingual as possible. The available languages are compiled due to my personal interest, but the scripts are purposefully designed to allow fetching any language available on Wikipedia, Project Gutenberg etc. with a modular approach to easily be able to add regional sources such as local media etc.
### Contributing
Any help with the project is more than welcome. If you're unable to add a change yourself, open an issue and let someone else take a look at it.
### Disclaimer
_All of the content compiled in releases were created by third parties. The selection is unbiased from this project's point of view, but may contain biased content. It is a non-goal of this project to provide a censored, altered or cherry-picked dataset. As such there may exist racial slurs and derogatory terms. Keep that in mind if you use the dataset for creating applications or AIs._