https://github.com/x-tabdeveloping/scikit-embeddings

Tokenization, streaming and embedding components for scikit-learn pipelines.
https://github.com/x-tabdeveloping/scikit-embeddings

Last synced: 11 months ago
JSON representation

Tokenization, streaming and embedding components for scikit-learn pipelines.

Host: GitHub
URL: https://github.com/x-tabdeveloping/scikit-embeddings
Owner: x-tabdeveloping
License: mit
Created: 2023-08-13T13:14:44.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-09-30T15:28:55.000Z (over 2 years ago)
Last Synced: 2025-02-08T17:14:39.506Z (about 1 year ago)
Language: Python
Homepage:
Size: 111 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          

# scikit-embeddings




Utilites for training, storing and using word and document embeddings in scikit-learn pipelines.

## WARNING: DO NOT USE THIS REPO FOR ANYTHING SERIOUS

This was a stupid experiment, and I will almost definitely phase it out in favour of [yasep](https://github.com/x-tabdeveloping/yasep). Please do not rely on this repo for your projects.

Love, Marton <3

## Features

 - Train Word and Paragraph embeddings in scikit-learn compatible pipelines.

 - Fast and performant trainable tokenizer components from `tokenizers`.

 - Easy to integrate components and pipelines in your scikit-learn workflows and machine learning pipelines.

 - Easy serialization and integration with HugginFace Hub for quickly publishing your embedding pipelines.

### What scikit-embeddings is not for:

 - Training transformer models and deep neural language models (if you want to do this, do it with [transformers](https://huggingface.co/docs/transformers/index))

 - Using pretrained sentence transformers (use [embetter](https://github.com/koaning/embetter))

## Installation

You can easily install scikit-embeddings from PyPI:

```bash

pip install scikit-embeddings

```

If you want to use GloVe embedding models, install alogn with glovpy:

```bash

pip install scikit-embeddings[glove]

```

## Example Pipelines

You can use scikit-embeddings with many many different pipeline architectures, I will list a few here:

### Word Embeddings

You can train classic vanilla word embeddings by building a pipeline that contains a `WordLevel` tokenizer and an embedding model:

```python

from skembedding.tokenizers import WordLevelTokenizer

from skembedding.models import Word2VecEmbedding

from skembeddings.pipeline import EmbeddingPipeline

embedding_pipe = EmbeddingPipeline(

    WordLevelTokenizer(),

    Word2VecEmbedding(n_components=100, algorithm="cbow")

)

embedding_pipe.fit(texts)

```

### Fasttext-like

You can train an embedding pipeline that uses subword information by using a tokenizer that does that.

You may want to use `Unigram`, `BPE` or `WordPiece` for these purposes.

Fasttext also uses skip-gram by default so let's change to that.

```python

from skembedding.tokenizers import UnigramTokenizer

from skembedding.models import Word2VecEmbedding

from skembeddings.pipeline import EmbeddingPipeline

embedding_pipe = EmbeddingPipeline(

    UnigramTokenizer(),

    Word2VecEmbedding(n_components=250, algorithm="sg")

)

embedding_pipe.fit(texts)

```

### Paragraph Embeddings

You can train Doc2Vec paragpraph embeddings with the chosen choice of tokenization.

```python

from skembedding.tokenizers import WordPieceTokenizer

from skembedding.models import ParagraphEmbedding

from skembeddings.pipeline import EmbeddingPipeline, PretrainedPipeline

embedding_pipe = EmbeddingPipeline(

    WordPieceTokenizer(),

    ParagraphEmbedding(n_components=250, algorithm="dm")

)

embedding_pipe.fit(texts)

```

## Serialization

Pipelines can be safely serialized to disk:

```python

embedding_pipe.to_disk("output_folder/")

pretrained = PretrainedPipeline("output_folder/")

```

Or published to HugginFace Hub:

```python

from huggingface_hub import login

login()

embedding_pipe.to_hub("username/name_of_pipeline")

pretrained = PretrainedPipeline("username/name_of_pipeline")

```

## Text Classification

You can include an embedding model in your classification pipelines by adding some classification head.

```python

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y)

cls_pipe = make_pipeline(pretrained, LogisticRegression())

cls_pipe.fit(X_train, y_train)

y_pred = cls_pipe.predict(X_test)

print(classification_report(y_test, y_pred))

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/x-tabdeveloping/scikit-embeddings

Awesome Lists containing this project

README