https://github.com/multivacplatform/multivac-ml
Pre-trained ML models for Apache Spark
https://github.com/multivacplatform/multivac-ml
machine-learning nlp spark spark-ml
Last synced: 3 months ago
JSON representation
Pre-trained ML models for Apache Spark
- Host: GitHub
- URL: https://github.com/multivacplatform/multivac-ml
- Owner: multivacplatform
- License: mit
- Created: 2018-11-15T19:35:45.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2019-02-23T15:13:45.000Z (over 7 years ago)
- Last Synced: 2025-09-05T05:50:12.366Z (10 months ago)
- Topics: machine-learning, nlp, spark, spark-ml
- Language: Scala
- Homepage: https://multivac.iscpif.fr
- Size: 947 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# multivac-ml [](https://github.com/multivacplatform/multivac-ml/blob/master/LICENSE) [](https://travis-ci.org/multivacplatform/multivac-ml) [](https://discourse.iscpif.fr/c/multivac) [](https://chat.iscpif.fr/channel/multivac) [](https://app.codacy.com/app/maziyarpanahi/multivac-ml?utm_source=github.com&utm_medium=referral&utm_content=multivacplatform/multivac-ml&utm_campaign=Badge_Grade_Dashboard)
Pre-trained Apache Spark's ML Pipeline for NLP, Classification, etc.
## Project Structure
- [models](models) Offline ML Models (for downloads)
- [models/word2vec](models/word2vec) (Word2Vec Model)
- [models/nlp](models/nlp) (Part of Speech Models)
- [demo](demo) Demo project
## Facts and Figures
### POS Tagger models
**Enlgish POS tagger model (UD_English-EWT)**
Only `en_ewt-ud-train.conllu` file was used to train the model:
Precision, Recall and F1-Score against the test dataset `en_ewt-ud-test.conllu`
|Tokens |Precision |Recall |F1-Score |
|-------|-----------|-------|---------|
| 25831 |0.93 |0.91 |0.92 |
Precision, Recall and F1-Score against the training dataset `en_ewt-ud-train.conllu`
|Tokens |Precision |Recall |F1-Score |
|-------|-----------|-------|---------|
| 63785 |0.98 |0.98 |0.98 |
> **Precision** is "how useful the POS results are", and **Recall** is "how complete the results are". Precision can be seen as a measure of **exactness or quality**, whereas recall is a measure of **completeness or quantity**. https://en.wikipedia.org/wiki/Precision_and_recall
> The **F1 score** is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. https://en.wikipedia.org/wiki/F1_score



[Read more on evaluation of the models](models/nlp)
## Open Data
**Multivac ML data**: [https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/WSWU7K](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/WSWU7K)
**Multivac Open Data**: [https://dataverse.harvard.edu/dataverse/multivac](https://dataverse.harvard.edu/dataverse/multivac)
## Dataset Citation
> Panahi, Maziyar;Chavalarias, David, 2018, "Multivac Machine Learning Models", https://doi.org/10.7910/DVN/WSWU7K, Harvard Dataverse, V2
## Code of Conduct
This, and all github.com/multivacplatform projects, are under the [Multivac Platform Open Source Code of Conduct](https://github.com/multivacplatform/code-of-conduct/blob/master/code-of-conduct.md). Additionally, see the [Typelevel Code of Conduct](http://typelevel.org/conduct) for specific examples of harassing behavior that are not tolerated.
## Copyright and License
Code and documentation copyright (c) 2018-2019 [ISCPIF - CNRS](http://iscpif.fr). Code released under the [MIT license](https://github.com/multivacplatform/multivac-ml/blob/master/LICENSE).