{"id":15913103,"url":"https://github.com/x-tabdeveloping/scikit-embeddings","last_synced_at":"2025-04-03T03:16:32.991Z","repository":{"id":188057613,"uuid":"678032924","full_name":"x-tabdeveloping/scikit-embeddings","owner":"x-tabdeveloping","description":"Tokenization, streaming and embedding components for scikit-learn pipelines.","archived":false,"fork":false,"pushed_at":"2023-09-30T15:28:55.000Z","size":114,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-08T17:14:39.506Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/x-tabdeveloping.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-13T13:14:44.000Z","updated_at":"2023-08-23T10:32:34.000Z","dependencies_parsed_at":"2024-10-06T16:23:14.061Z","dependency_job_id":null,"html_url":"https://github.com/x-tabdeveloping/scikit-embeddings","commit_stats":null,"previous_names":["x-tabdeveloping/tokendo","x-tabdeveloping/scikit-embeddings"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x-tabdeveloping%2Fscikit-embeddings","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x-tabdeveloping%2Fscikit-embeddings/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x-tabdeveloping%2Fscikit-embeddings/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x-tabdeveloping%2Fscikit-embeddings/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/x-tabdeveloping","download_url":"https://codeload.github.com/x-tabdeveloping/scikit-embeddings/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246927844,"owners_count":20856198,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-06T16:23:07.016Z","updated_at":"2025-04-03T03:16:32.834Z","avatar_url":"https://github.com/x-tabdeveloping.png","language":"Python","readme":"\u003cimg align=\"left\" width=\"82\" height=\"82\" src=\"assets/logo.svg\"\u003e\n\n# scikit-embeddings\n\n\u003cbr\u003e\nUtilites for training, storing and using word and document embeddings in scikit-learn pipelines.\n\n## WARNING: DO NOT USE THIS REPO FOR ANYTHING SERIOUS\nThis was a stupid experiment, and I will almost definitely phase it out in favour of [yasep](https://github.com/x-tabdeveloping/yasep). Please do not rely on this repo for your projects.\n\nLove, Marton \u003c3\n\n## Features\n - Train Word and Paragraph embeddings in scikit-learn compatible pipelines.\n - Fast and performant trainable tokenizer components from `tokenizers`.\n - Easy to integrate components and pipelines in your scikit-learn workflows and machine learning pipelines.\n - Easy serialization and integration with HugginFace Hub for quickly publishing your embedding pipelines.\n\n### What scikit-embeddings is not for:\n - Training transformer models and deep neural language models (if you want to do this, do it with [transformers](https://huggingface.co/docs/transformers/index))\n - Using pretrained sentence transformers (use [embetter](https://github.com/koaning/embetter))\n\n## Installation\n\nYou can easily install scikit-embeddings from PyPI:\n\n```bash\npip install scikit-embeddings\n```\n\nIf you want to use GloVe embedding models, install alogn with glovpy:\n\n```bash\npip install scikit-embeddings[glove]\n```\n\n## Example Pipelines\n\nYou can use scikit-embeddings with many many different pipeline architectures, I will list a few here:\n\n### Word Embeddings\n\nYou can train classic vanilla word embeddings by building a pipeline that contains a `WordLevel` tokenizer and an embedding model:\n\n```python\nfrom skembedding.tokenizers import WordLevelTokenizer\nfrom skembedding.models import Word2VecEmbedding\nfrom skembeddings.pipeline import EmbeddingPipeline\n\nembedding_pipe = EmbeddingPipeline(\n    WordLevelTokenizer(),\n    Word2VecEmbedding(n_components=100, algorithm=\"cbow\")\n)\nembedding_pipe.fit(texts)\n```\n\n### Fasttext-like\n\nYou can train an embedding pipeline that uses subword information by using a tokenizer that does that.\nYou may want to use `Unigram`, `BPE` or `WordPiece` for these purposes.\nFasttext also uses skip-gram by default so let's change to that.\n\n```python\nfrom skembedding.tokenizers import UnigramTokenizer\nfrom skembedding.models import Word2VecEmbedding\nfrom skembeddings.pipeline import EmbeddingPipeline\n\nembedding_pipe = EmbeddingPipeline(\n    UnigramTokenizer(),\n    Word2VecEmbedding(n_components=250, algorithm=\"sg\")\n)\nembedding_pipe.fit(texts)\n```\n\n### Paragraph Embeddings\n\nYou can train Doc2Vec paragpraph embeddings with the chosen choice of tokenization.\n\n```python\nfrom skembedding.tokenizers import WordPieceTokenizer\nfrom skembedding.models import ParagraphEmbedding\nfrom skembeddings.pipeline import EmbeddingPipeline, PretrainedPipeline\n\nembedding_pipe = EmbeddingPipeline(\n    WordPieceTokenizer(),\n    ParagraphEmbedding(n_components=250, algorithm=\"dm\")\n)\nembedding_pipe.fit(texts)\n```\n\n## Serialization\n\nPipelines can be safely serialized to disk:\n\n```python\nembedding_pipe.to_disk(\"output_folder/\")\n\npretrained = PretrainedPipeline(\"output_folder/\")\n```\n\nOr published to HugginFace Hub:\n\n```python\nfrom huggingface_hub import login\n\nlogin()\nembedding_pipe.to_hub(\"username/name_of_pipeline\")\n\npretrained = PretrainedPipeline(\"username/name_of_pipeline\")\n```\n\n## Text Classification\n\nYou can include an embedding model in your classification pipelines by adding some classification head.\n\n```python\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import classification_report\n\nX_train, X_test, y_train, y_test = train_test_split(X, y)\n\ncls_pipe = make_pipeline(pretrained, LogisticRegression())\ncls_pipe.fit(X_train, y_train)\n\ny_pred = cls_pipe.predict(X_test)\nprint(classification_report(y_test, y_pred))\n```\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fx-tabdeveloping%2Fscikit-embeddings","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fx-tabdeveloping%2Fscikit-embeddings","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fx-tabdeveloping%2Fscikit-embeddings/lists"}