Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/multivacplatform/multivac-ml
Pre-trained ML models for Apache Spark
https://github.com/multivacplatform/multivac-ml
machine-learning nlp spark spark-ml
Last synced: 29 days ago
JSON representation
Pre-trained ML models for Apache Spark
- Host: GitHub
- URL: https://github.com/multivacplatform/multivac-ml
- Owner: multivacplatform
- License: mit
- Created: 2018-11-15T19:35:45.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2019-02-23T15:13:45.000Z (almost 6 years ago)
- Last Synced: 2024-11-13T08:37:23.828Z (3 months ago)
- Topics: machine-learning, nlp, spark, spark-ml
- Language: Scala
- Homepage: https://multivac.iscpif.fr
- Size: 947 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# multivac-ml [![GitHub license](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/multivacplatform/multivac-ml/blob/master/LICENSE) [![Build Status](https://travis-ci.org/multivacplatform/multivac-ml.svg?branch=master)](https://travis-ci.org/multivacplatform/multivac-ml) [![multivac discuss](https://img.shields.io/badge/multivac-discuss-ff69b4.svg)](https://discourse.iscpif.fr/c/multivac) [![multivac channel](https://img.shields.io/badge/multivac-chat-ff69b4.svg)](https://chat.iscpif.fr/channel/multivac) [![Codacy Badge](https://api.codacy.com/project/badge/Grade/0df6364b08e84dadadf83e1bc902a58b)](https://app.codacy.com/app/maziyarpanahi/multivac-ml?utm_source=github.com&utm_medium=referral&utm_content=multivacplatform/multivac-ml&utm_campaign=Badge_Grade_Dashboard)
Pre-trained Apache Spark's ML Pipeline for NLP, Classification, etc.## Project Structure
- [models](models) Offline ML Models (for downloads)
- [models/word2vec](models/word2vec) (Word2Vec Model)
- [models/nlp](models/nlp) (Part of Speech Models)
- [demo](demo) Demo project## Facts and Figures
### POS Tagger models**Enlgish POS tagger model (UD_English-EWT)**
Only `en_ewt-ud-train.conllu` file was used to train the model:Precision, Recall and F1-Score against the test dataset `en_ewt-ud-test.conllu`
|Tokens |Precision |Recall |F1-Score |
|-------|-----------|-------|---------|
| 25831 |0.93 |0.91 |0.92 |Precision, Recall and F1-Score against the training dataset `en_ewt-ud-train.conllu`
|Tokens |Precision |Recall |F1-Score |
|-------|-----------|-------|---------|
| 63785 |0.98 |0.98 |0.98 |> **Precision** is "how useful the POS results are", and **Recall** is "how complete the results are". Precision can be seen as a measure of **exactness or quality**, whereas recall is a measure of **completeness or quantity**. https://en.wikipedia.org/wiki/Precision_and_recall
> The **F1 score** is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. https://en.wikipedia.org/wiki/F1_score
![Precision](https://wikimedia.org/api/rest_v1/media/math/render/svg/26106935459abe7c266f7b1ebfa2a824b334c807)
![Recall](https://wikimedia.org/api/rest_v1/media/math/render/svg/4c233366865312bc99c832d1475e152c5074891b)
![F1 Score](https://wikimedia.org/api/rest_v1/media/math/render/svg/057ffc6b4fa80dc1c0e1f2f1f6b598c38cdd7c23)
[Read more on evaluation of the models](models/nlp)
## Open Data
**Multivac ML data**: [https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/WSWU7K](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/WSWU7K)**Multivac Open Data**: [https://dataverse.harvard.edu/dataverse/multivac](https://dataverse.harvard.edu/dataverse/multivac)
## Dataset Citation
> Panahi, Maziyar;Chavalarias, David, 2018, "Multivac Machine Learning Models", https://doi.org/10.7910/DVN/WSWU7K, Harvard Dataverse, V2## Code of Conduct
This, and all github.com/multivacplatform projects, are under the [Multivac Platform Open Source Code of Conduct](https://github.com/multivacplatform/code-of-conduct/blob/master/code-of-conduct.md). Additionally, see the [Typelevel Code of Conduct](http://typelevel.org/conduct) for specific examples of harassing behavior that are not tolerated.## Copyright and License
Code and documentation copyright (c) 2018-2019 [ISCPIF - CNRS](http://iscpif.fr). Code released under the [MIT license](https://github.com/multivacplatform/multivac-ml/blob/master/LICENSE).