https://github.com/x-tabdeveloping/scikit-embeddings
Tokenization, streaming and embedding components for scikit-learn pipelines.
https://github.com/x-tabdeveloping/scikit-embeddings
Last synced: 11 months ago
JSON representation
Tokenization, streaming and embedding components for scikit-learn pipelines.
- Host: GitHub
- URL: https://github.com/x-tabdeveloping/scikit-embeddings
- Owner: x-tabdeveloping
- License: mit
- Created: 2023-08-13T13:14:44.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-09-30T15:28:55.000Z (over 2 years ago)
- Last Synced: 2025-02-08T17:14:39.506Z (about 1 year ago)
- Language: Python
- Homepage:
- Size: 111 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README

# scikit-embeddings
Utilites for training, storing and using word and document embeddings in scikit-learn pipelines.
## WARNING: DO NOT USE THIS REPO FOR ANYTHING SERIOUS
This was a stupid experiment, and I will almost definitely phase it out in favour of [yasep](https://github.com/x-tabdeveloping/yasep). Please do not rely on this repo for your projects.
Love, Marton <3
## Features
- Train Word and Paragraph embeddings in scikit-learn compatible pipelines.
- Fast and performant trainable tokenizer components from `tokenizers`.
- Easy to integrate components and pipelines in your scikit-learn workflows and machine learning pipelines.
- Easy serialization and integration with HugginFace Hub for quickly publishing your embedding pipelines.
### What scikit-embeddings is not for:
- Training transformer models and deep neural language models (if you want to do this, do it with [transformers](https://huggingface.co/docs/transformers/index))
- Using pretrained sentence transformers (use [embetter](https://github.com/koaning/embetter))
## Installation
You can easily install scikit-embeddings from PyPI:
```bash
pip install scikit-embeddings
```
If you want to use GloVe embedding models, install alogn with glovpy:
```bash
pip install scikit-embeddings[glove]
```
## Example Pipelines
You can use scikit-embeddings with many many different pipeline architectures, I will list a few here:
### Word Embeddings
You can train classic vanilla word embeddings by building a pipeline that contains a `WordLevel` tokenizer and an embedding model:
```python
from skembedding.tokenizers import WordLevelTokenizer
from skembedding.models import Word2VecEmbedding
from skembeddings.pipeline import EmbeddingPipeline
embedding_pipe = EmbeddingPipeline(
WordLevelTokenizer(),
Word2VecEmbedding(n_components=100, algorithm="cbow")
)
embedding_pipe.fit(texts)
```
### Fasttext-like
You can train an embedding pipeline that uses subword information by using a tokenizer that does that.
You may want to use `Unigram`, `BPE` or `WordPiece` for these purposes.
Fasttext also uses skip-gram by default so let's change to that.
```python
from skembedding.tokenizers import UnigramTokenizer
from skembedding.models import Word2VecEmbedding
from skembeddings.pipeline import EmbeddingPipeline
embedding_pipe = EmbeddingPipeline(
UnigramTokenizer(),
Word2VecEmbedding(n_components=250, algorithm="sg")
)
embedding_pipe.fit(texts)
```
### Paragraph Embeddings
You can train Doc2Vec paragpraph embeddings with the chosen choice of tokenization.
```python
from skembedding.tokenizers import WordPieceTokenizer
from skembedding.models import ParagraphEmbedding
from skembeddings.pipeline import EmbeddingPipeline, PretrainedPipeline
embedding_pipe = EmbeddingPipeline(
WordPieceTokenizer(),
ParagraphEmbedding(n_components=250, algorithm="dm")
)
embedding_pipe.fit(texts)
```
## Serialization
Pipelines can be safely serialized to disk:
```python
embedding_pipe.to_disk("output_folder/")
pretrained = PretrainedPipeline("output_folder/")
```
Or published to HugginFace Hub:
```python
from huggingface_hub import login
login()
embedding_pipe.to_hub("username/name_of_pipeline")
pretrained = PretrainedPipeline("username/name_of_pipeline")
```
## Text Classification
You can include an embedding model in your classification pipelines by adding some classification head.
```python
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
X_train, X_test, y_train, y_test = train_test_split(X, y)
cls_pipe = make_pipeline(pretrained, LogisticRegression())
cls_pipe.fit(X_train, y_train)
y_pred = cls_pipe.predict(X_test)
print(classification_report(y_test, y_pred))
```