Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ccoreilly/spacy-catala-generator
Training and dataset used for the catalan spacy model
https://github.com/ccoreilly/spacy-catala-generator
catala catalan catalan-language spacy spacy-models
Last synced: 8 days ago
JSON representation
Training and dataset used for the catalan spacy model
- Host: GitHub
- URL: https://github.com/ccoreilly/spacy-catala-generator
- Owner: ccoreilly
- Created: 2020-03-08T15:49:50.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2021-03-02T09:51:14.000Z (almost 4 years ago)
- Last Synced: 2024-10-15T00:46:04.914Z (2 months ago)
- Topics: catala, catalan, catalan-language, spacy, spacy-models
- Language: Shell
- Homepage:
- Size: 17.5 MB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Training script and dataset for [spacy-catala](https://github.com/ccoreilly/spacy-catala)
> Note: this repository uses Git LFS for the files under train directory
```
$ ./train.sh
Usage: train.sh --nvectors [vector_size]
E.g.: train.sh ca_fasttext_md 1.2.0 cc.ca.300.vec.gz train dev --nvectors 50000
Train a spacy model.
````vectors_path` expects gzipped text vectors. These are not included, you can download them with:
```
$ curl -x -o cc.ca.300.vec.gz https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.ca.300.vec.gz
```In order to recreate the large model in [spacy-catala](https://github.com/ccoreilly/spacy-catala) run:
```
$ ./train.sh ca_fasttext_wiki_lg 1.0.0 cc.ca.300.vec.gz train dev
```The medium sized model has been pruned to the most common 20000 vectors using:
```
$ ./train.sh ca_fasttext_wiki_md 1.0.0 cc.ca.300.vec.gz train dev --nvectors 20000
```