Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/aajanki/finnish-word-frequencies

Counting word frequencies in a Finnish text corpus
https://github.com/aajanki/finnish-word-frequencies

finnish nlp word-frequency

Last synced: about 1 month ago
JSON representation

Counting word frequencies in a Finnish text corpus

Awesome Lists containing this project

README

        

# Finnish word frequencies

A script for counting the word frequencies in the Finnish subset of
[the C4 dataset](https://huggingface.co/datasets/allenai/c4).

```
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

python -m src.main --limit 10000
```

In Docker:

```
docker build --network host --tag fi-vocabulary:latest .
docker volume create fi-frequencies
docker volume inspect fi-frequencies
docker run -it --rm --mount source=fi-frequencies,target=/app/results --dns 8.8.8.8 fifrequencies:latest python -m src.main --limit 10000
```

## Text classifiers

The models directory contains simple models for detecting spam and
computer code. They are used to filter out uninteresting documents.

The models have been trained using the scripts at src/trainmodels with
manually labelled training samples. The training data is under the
trainingdata directory. The documents are part of the [C4
dataset](https://huggingface.co/datasets/allenai/c4) which is made
available under the [ODC Attribution
license](https://opendatacommons.org/licenses/by/1-0/).