Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/aajanki/finnish-word-frequencies
Counting word frequencies in a Finnish text corpus
https://github.com/aajanki/finnish-word-frequencies
finnish nlp word-frequency
Last synced: about 1 month ago
JSON representation
Counting word frequencies in a Finnish text corpus
- Host: GitHub
- URL: https://github.com/aajanki/finnish-word-frequencies
- Owner: aajanki
- License: mit
- Created: 2024-03-10T08:29:53.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-03-19T18:27:36.000Z (9 months ago)
- Last Synced: 2024-04-24T03:01:44.344Z (8 months ago)
- Topics: finnish, nlp, word-frequency
- Language: Python
- Homepage:
- Size: 7.63 MB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Finnish word frequencies
A script for counting the word frequencies in the Finnish subset of
[the C4 dataset](https://huggingface.co/datasets/allenai/c4).```
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtpython -m src.main --limit 10000
```In Docker:
```
docker build --network host --tag fi-vocabulary:latest .
docker volume create fi-frequencies
docker volume inspect fi-frequencies
docker run -it --rm --mount source=fi-frequencies,target=/app/results --dns 8.8.8.8 fifrequencies:latest python -m src.main --limit 10000
```## Text classifiers
The models directory contains simple models for detecting spam and
computer code. They are used to filter out uninteresting documents.The models have been trained using the scripts at src/trainmodels with
manually labelled training samples. The training data is under the
trainingdata directory. The documents are part of the [C4
dataset](https://huggingface.co/datasets/allenai/c4) which is made
available under the [ODC Attribution
license](https://opendatacommons.org/licenses/by/1-0/).