Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ccoreilly/spacy-catala-generator

Training and dataset used for the catalan spacy model
https://github.com/ccoreilly/spacy-catala-generator

catala catalan catalan-language spacy spacy-models

Last synced: 8 days ago
JSON representation

Training and dataset used for the catalan spacy model

Awesome Lists containing this project

README

        

# Training script and dataset for [spacy-catala](https://github.com/ccoreilly/spacy-catala)

> Note: this repository uses Git LFS for the files under train directory

```
$ ./train.sh
Usage: train.sh --nvectors [vector_size]
E.g.: train.sh ca_fasttext_md 1.2.0 cc.ca.300.vec.gz train dev --nvectors 50000
Train a spacy model.
```

`vectors_path` expects gzipped text vectors. These are not included, you can download them with:

```
$ curl -x -o cc.ca.300.vec.gz https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.ca.300.vec.gz
```

In order to recreate the large model in [spacy-catala](https://github.com/ccoreilly/spacy-catala) run:

```
$ ./train.sh ca_fasttext_wiki_lg 1.0.0 cc.ca.300.vec.gz train dev
```

The medium sized model has been pruned to the most common 20000 vectors using:

```
$ ./train.sh ca_fasttext_wiki_md 1.0.0 cc.ca.300.vec.gz train dev --nvectors 20000
```