https://github.com/bees4ever/seaqube

Semantic Quality Benchmark for Word Embeddings, i.e. Natural Language Models in Python. Acronym `SeaQuBe` or `seaqube`.
https://github.com/bees4ever/seaqube

augmentation benchmark fasttext gensim nlp spacy spacy-nlp wordembeddings

Last synced: 6 months ago
JSON representation

Semantic Quality Benchmark for Word Embeddings, i.e. Natural Language Models in Python. Acronym `SeaQuBe` or `seaqube`.

Host: GitHub
URL: https://github.com/bees4ever/seaqube
Owner: bees4ever
License: mit
Created: 2020-09-17T12:58:07.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2021-01-28T19:08:08.000Z (over 4 years ago)
Last Synced: 2025-04-20T00:59:16.985Z (6 months ago)
Topics: augmentation, benchmark, fasttext, gensim, nlp, spacy, spacy-nlp, wordembeddings
Language: Python
Homepage:
Size: 5.89 MB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          


    


    






# SeaQuBe

Semantic Quality Benchmark for Word Embeddings, i.e. Natural Language Models in Python. Acronym `SeaQuBe` or `seaqube`.

This python framework provides several text augmentation implementations and word embedding quality evaluation methods. It is designed to fit in your machine learning pipeline. The `BaseAugmentation` class provides the same api as the python package [nlpaug](https://github.com/makcedward/nlpaug/), so that this packages can used together smoothly. However `BaseAugmentation` provides also other methods. Detailed examples see beneath.

`SeaQuBe` provides also a toolkit to wrap a trained nlp model to a nice interactive tool.

 [![PyPI version](https://badge.fury.io/py/seaqube.svg)](https://badge.fury.io/py/seaqube)

## Features

*  Text Data Augmentation

*  Chaining and Reducing of Text Data Augmentations

*  Word Embedding Quality Methods

*  Interactive NLM Model Wrapper

## Demo

*   [Augmentation in three lines](https://github.com/bees4ever/SeaQuBe#quick-demo)

*   [Example of Basic Text Augmentation](https://github.com/bees4ever/SeaQuBe/blob/master/examples/basic_augmentation.ipynb)

*   [Example of Text Augmentation Chaining](https://github.com/bees4ever/SeaQuBe/blob/master/examples/chained_augmentation.ipynb)

*   [Example of Word Embedding Evaluation](https://github.com/bees4ever/SeaQuBe/blob/master/examples/word_embedding_evaluation.ipynb)

*   [Example of Interactive NLP](https://github.com/bees4ever/SeaQuBe/blob/master/examples/nlp.ipynb)

## Augmentation

| Level  | Augmenter  | Description |

|:---:|:---:|:---:|

| Character | QwertyAugmentation | Simulate keyboard distance error |

| Corpus | UnigramAugmentation | Replace ubiquitous words with other ubiquitous words |

| Word | Active2PassiveAugmentation | Change surface of document using an simple active-to-passive transformer |

| Word | EDAAugmentation | Augment document using the [EDA](https://github.com/jasonwei20/eda_nlp) algorithm |

| Word | EmbeddingAugmentation | Replace similar word using [WordNet](https://wordnet.princeton.edu/) |

| Word | TranslationAugmentation | Change surface of document using translation and back-translation (with [GoogleTranslate](https://translate.google.com/))|

## Augmentation Chainer

The streaming feature of augmentation is implemented in the ``AugmentationStreamer`` class. One `Reduceing` class exist, more can implemented

extending the ``BaseReduction`` class.  

| Action  | Class  | Description |

|:---:|:---:|:---:|

|Streaming|AugmentationStreamer| Run augmentation for each document through all chained augmentations.  |

|Reducing| UniqueCorpusReduction | Getting a list of documents, only unique documents are returned.  

## Word Embedding Evaluation

| Method  | Description |

|:---:|:---:|

|WordAnalogyBenchmark|This method benchmark how go relations of the type: `a is to b as c is to d` can be solved correctly.|

|WordSimilarityBenchmark|This methods compares the similarity of a word pair, calculated by a model with a human estimated similarity score.|

|WordOutliersBenchmark|This method benchmark how good a outlier of a group of words can be detected.|

|SemanticWordnetBenchmark|Based on the WordNet graph, the goodnes of the semantic / similarity of a nlp model is benchmarked.|

## Installation

`SeaQuBe` can be installed from PyPip using: `pip install seaqube` or run in the main directory: `python setup.py install`.

### External Dependencies

Some external dependencies are not installed automatically, but `seaqube` or `nltk` might throw errors with an instruction what to do.

For example ``seqube`` might ask you to run:

````bash 

python -c "from seaqube import download;download('vec4ir')"

````

## Quick Demo

````python

from seaqube.augmentation.word import Active2PassiveAugmentation, EDAAugmentation, TranslationAugmentation, EmbeddingAugmentation

translate = TranslationAugmentation(max_length=2)

translate.doc_augment(['This', 'is', 'a', 'tokenized', 'corpus'])

````

## Setup Dev Environment

_TODO_

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bees4ever/seaqube

Awesome Lists containing this project

README