Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/multivacplatform/multivac-ml

Pre-trained ML models for Apache Spark
https://github.com/multivacplatform/multivac-ml

machine-learning nlp spark spark-ml

Last synced: 29 days ago
JSON representation

Pre-trained ML models for Apache Spark

Host: GitHub
URL: https://github.com/multivacplatform/multivac-ml
Owner: multivacplatform
License: mit
Created: 2018-11-15T19:35:45.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2019-02-23T15:13:45.000Z (almost 6 years ago)
Last Synced: 2024-11-13T08:37:23.828Z (3 months ago)
Topics: machine-learning, nlp, spark, spark-ml
Language: Scala
Homepage: https://multivac.iscpif.fr
Size: 947 KB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # multivac-ml [![GitHub license](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/multivacplatform/multivac-ml/blob/master/LICENSE) [![Build Status](https://travis-ci.org/multivacplatform/multivac-ml.svg?branch=master)](https://travis-ci.org/multivacplatform/multivac-ml) [![multivac discuss](https://img.shields.io/badge/multivac-discuss-ff69b4.svg)](https://discourse.iscpif.fr/c/multivac) [![multivac channel](https://img.shields.io/badge/multivac-chat-ff69b4.svg)](https://chat.iscpif.fr/channel/multivac) [![Codacy Badge](https://api.codacy.com/project/badge/Grade/0df6364b08e84dadadf83e1bc902a58b)](https://app.codacy.com/app/maziyarpanahi/multivac-ml?utm_source=github.com&utm_medium=referral&utm_content=multivacplatform/multivac-ml&utm_campaign=Badge_Grade_Dashboard)

Pre-trained Apache Spark's ML Pipeline for NLP, Classification, etc.

## Project Structure

-   [models](models) Offline ML Models (for downloads)

    -   [models/word2vec](models/word2vec) (Word2Vec Model)

    -   [models/nlp](models/nlp) (Part of Speech Models)

-   [demo](demo) Demo project

## Facts and Figures

### POS Tagger models

**Enlgish POS tagger model (UD_English-EWT)**

Only `en_ewt-ud-train.conllu` file was used to train the model:

Precision, Recall and F1-Score against the test dataset `en_ewt-ud-test.conllu`

|Tokens |Precision  |Recall |F1-Score |

|-------|-----------|-------|---------|

| 25831 |0.93       |0.91   |0.92     |

Precision, Recall and F1-Score against the training dataset `en_ewt-ud-train.conllu`

|Tokens |Precision  |Recall |F1-Score |

|-------|-----------|-------|---------|

| 63785 |0.98       |0.98   |0.98     |

> **Precision** is "how useful the POS results are", and **Recall** is "how complete the results are". Precision can be seen as a measure of **exactness or quality**, whereas recall is a measure of **completeness or quantity**. https://en.wikipedia.org/wiki/Precision_and_recall

> The **F1 score** is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. https://en.wikipedia.org/wiki/F1_score

![Precision](https://wikimedia.org/api/rest_v1/media/math/render/svg/26106935459abe7c266f7b1ebfa2a824b334c807)

![Recall](https://wikimedia.org/api/rest_v1/media/math/render/svg/4c233366865312bc99c832d1475e152c5074891b)

![F1 Score](https://wikimedia.org/api/rest_v1/media/math/render/svg/057ffc6b4fa80dc1c0e1f2f1f6b598c38cdd7c23)

[Read more on evaluation of the models](models/nlp)

## Open Data

**Multivac ML data**: [https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/WSWU7K](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/WSWU7K)

**Multivac Open Data**: [https://dataverse.harvard.edu/dataverse/multivac](https://dataverse.harvard.edu/dataverse/multivac)

## Dataset Citation

> Panahi, Maziyar;Chavalarias, David, 2018, "Multivac Machine Learning Models", https://doi.org/10.7910/DVN/WSWU7K, Harvard Dataverse, V2

## Code of Conduct

This, and all github.com/multivacplatform projects, are under the [Multivac Platform Open Source Code of Conduct](https://github.com/multivacplatform/code-of-conduct/blob/master/code-of-conduct.md). Additionally, see the [Typelevel Code of Conduct](http://typelevel.org/conduct) for specific examples of harassing behavior that are not tolerated.

## Copyright and License

Code and documentation copyright (c) 2018-2019 [ISCPIF - CNRS](http://iscpif.fr). Code released under the [MIT license](https://github.com/multivacplatform/multivac-ml/blob/master/LICENSE).