{"id":18831652,"url":"https://github.com/riccorl/transformers-embedder","last_synced_at":"2025-12-14T12:29:22.558Z","repository":{"id":37100136,"uuid":"316529291","full_name":"Riccorl/transformers-embedder","owner":"Riccorl","description":"A Word Level Transformer layer based on PyTorch and 🤗 Transformers.","archived":false,"fork":false,"pushed_at":"2024-01-31T04:40:19.000Z","size":927,"stargazers_count":34,"open_issues_count":2,"forks_count":5,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-14T04:16:32.603Z","etag":null,"topics":["allennlp","bert","deep-learning","embeddings","hidden-states","huggingface","huggingface-transformers","language-model","natural-language-processing","nlp","preprocess","pretrained-models","python","pytorch","sentences","tokenizer","transformer","transformer-embedder","transformers","transformers-embedder"],"latest_commit_sha":null,"homepage":"https://riccorl.github.io/transformers-embedder","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Riccorl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2020-11-27T14:54:14.000Z","updated_at":"2024-10-07T08:43:31.000Z","dependencies_parsed_at":"2023-12-26T10:56:45.131Z","dependency_job_id":null,"html_url":"https://github.com/Riccorl/transformers-embedder","commit_stats":{"total_commits":283,"total_committers":9,"mean_commits":"31.444444444444443","dds":0.3533568904593639,"last_synced_commit":"bacf4c5c89fb0fa6b550b1b60174cf15fd03d875"},"previous_names":["riccorl/transformer-embedder"],"tags_count":83,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Riccorl%2Ftransformers-embedder","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Riccorl%2Ftransformers-embedder/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Riccorl%2Ftransformers-embedder/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Riccorl%2Ftransformers-embedder/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Riccorl","download_url":"https://codeload.github.com/Riccorl/transformers-embedder/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248819412,"owners_count":21166477,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["allennlp","bert","deep-learning","embeddings","hidden-states","huggingface","huggingface-transformers","language-model","natural-language-processing","nlp","preprocess","pretrained-models","python","pytorch","sentences","tokenizer","transformer","transformer-embedder","transformers","transformers-embedder"],"created_at":"2024-11-08T01:55:39.149Z","updated_at":"2025-12-14T12:29:22.495Z","avatar_url":"https://github.com/Riccorl.png","language":"Python","readme":"\u003cdiv align=\"center\"\u003e\n\n# Transformers Embedder\n\n[![Open in Visual Studio Code](https://img.shields.io/badge/preview%20in-vscode.dev-blue)](https://github.dev/Riccorl/transformers-embedder)\n[![PyTorch](https://img.shields.io/badge/PyTorch-orange?logo=pytorch)](https://pytorch.org/)\n[![Transformers](https://img.shields.io/badge/4.34-🤗%20Transformers-6670ff)](https://huggingface.co/transformers/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000)](https://github.com/psf/black)\n\n[![Upload to PyPi](https://github.com/Riccorl/transformers-embedder/actions/workflows/python-publish-pypi.yml/badge.svg)](https://github.com/Riccorl/transformers-embedder/actions/workflows/python-publish-pypi.yml)\n[![Upload to PyPi](https://github.com/Riccorl/transformers-embedder/actions/workflows/python-publish-conda.yml/badge.svg)](https://github.com/Riccorl/transformers-embedder/actions/workflows/python-publish-conda.yml)\n[![PyPi Version](https://img.shields.io/github/v/release/Riccorl/transformers-embedder)](https://github.com/Riccorl/transformers-embedder/releases)\n[![Anaconda-Server Badge](https://anaconda.org/riccorl/transformers-embedder/badges/version.svg)](https://anaconda.org/riccorl/transformers-embedder)\n[![DeepSource](https://deepsource.io/gh/Riccorl/transformers-embedder.svg/?label=active+issues)](https://deepsource.io/gh/Riccorl/transformers-embedder/?ref=repository-badge)\n\n\u003c/div\u003e\n\nA Word Level Transformer layer based on PyTorch and 🤗 Transformers.\n\n## How to use\n\nInstall the library from [PyPI](https://pypi.org/project/transformers-embedder):\n\n```bash\npip install transformers-embedder\n```\n\nor from [Conda](https://anaconda.org/riccorl/transformers-embedder):\n\n```bash\nconda install -c riccorl transformers-embedder\n```\n\nIt offers a PyTorch layer and a tokenizer that support almost every pretrained model from Huggingface \n[🤗Transformers](https://huggingface.co/transformers/) library. Here is a quick example:\n\n```python\nimport transformers_embedder as tre\n\ntokenizer = tre.Tokenizer(\"bert-base-cased\")\n\nmodel = tre.TransformersEmbedder(\n    \"bert-base-cased\", subword_pooling_strategy=\"sparse\", layer_pooling_strategy=\"mean\"\n)\n\nexample = \"This is a sample sentence\"\ninputs = tokenizer(example, return_tensors=True)\n```\n\n```text\n{\n   'input_ids': tensor([[ 101, 1188, 1110, 170, 6876, 5650,  102]]),\n   'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]),\n   'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]])\n   'scatter_offsets': tensor([[0, 1, 2, 3, 4, 5, 6]]),\n   'sparse_offsets': {\n        'sparse_indices': tensor(\n            [\n                [0, 0, 0, 0, 0, 0, 0],\n                [0, 1, 2, 3, 4, 5, 6],\n                [0, 1, 2, 3, 4, 5, 6]\n            ]\n        ), \n        'sparse_values': tensor([1., 1., 1., 1., 1., 1., 1.]), \n        'sparse_size': torch.Size([1, 7, 7])\n    },\n   'sentence_length': 7  # with special tokens included\n}\n```\n\n```python\noutputs = model(**inputs)\n```\n\n```text\n# outputs.word_embeddings.shape[1:-1]       # remove [CLS] and [SEP]\ntorch.Size([1, 5, 768])\n# len(example)\n5\n```\n\n## Info\n\nOne of the annoyance of using transformer-based models is that it is not trivial to compute word embeddings \nfrom the sub-token embeddings they output. With this API it's as easy as using 🤗Transformers to get \nword-level embeddings from theoretically every transformer model it supports.\n\n### Model\n\n#### Subword Pooling Strategy\n\nThe `TransformersEmbedder` class offers 3 ways to get the embeddings:\n\n- `subword_pooling_strategy=\"sparse\"`: computes the mean of the embeddings of the sub-tokens of each word \n  (i.e. the embeddings of the sub-tokens are pooled together) using a sparse matrix multiplication. This \n  strategy is the default one.\n- `subword_pooling_strategy=\"scatter\"`: computes the mean of the embeddings of the sub-tokens of each word\n  using a scatter-gather operation. It is not deterministic, but it works with ONNX export.\n- `subword_pooling_strategy=\"none\"`: returns the raw output of the transformer model without sub-token pooling.\n\nHere a little feature table:\n\n|             |      Pooling       |   Deterministic    |        ONNX        |\n|-------------|:------------------:|:------------------:|:------------------:|\n| **Sparse**  | :white_check_mark: | :white_check_mark: |        :x:         |\n| **Scatter** | :white_check_mark: |        :x:         | :white_check_mark: |\n| **None**    |        :x:         | :white_check_mark: | :white_check_mark: |\n\n#### Layer Pooling Strategy\n\nThere are also multiple type of outputs you can get using `layer_pooling_strategy` parameter:\n\n- `layer_pooling_strategy=\"last\"`: returns the last hidden state of the transformer model\n- `layer_pooling_strategy=\"concat\"`: returns the concatenation of the selected `output_layers` of the  \n   transformer model\n- `layer_pooling_strategy=\"sum\"`: returns the sum of the selected `output_layers` of the transformer model\n- `layer_pooling_strategy=\"mean\"`: returns the average of the selected `output_layers` of the transformer model\n- `layer_pooling_strategy=\"scalar_mix\"`: returns the output of a parameterised scalar mixture layer of the \n   selected `output_layers` of the transformer model\n\nIf you also want all the outputs from the HuggingFace model, you can set `return_all=True` to get them.\n\n```python\nclass TransformersEmbedder(torch.nn.Module):\n    def __init__(\n        self,\n        model: Union[str, tr.PreTrainedModel],\n        subword_pooling_strategy: str = \"sparse\",\n        layer_pooling_strategy: str = \"last\",\n        output_layers: Tuple[int] = (-4, -3, -2, -1),\n        fine_tune: bool = True,\n        return_all: bool = True,\n    )\n```\n\n### Tokenizer\n\nThe `Tokenizer` class provides the `tokenize` method to preprocess the input for the `TransformersEmbedder` \nlayer. You can pass raw sentences, pre-tokenized sentences and sentences in batch. It will preprocess them \nreturning a dictionary with the inputs for the model. By passing `return_tensors=True` it will return the \ninputs as `torch.Tensor`.\n\nBy default, if you pass text (or batch) as strings, it uses the HuggingFace tokenizer to tokenize them.\n\n```python\ntext = \"This is a sample sentence\"\ntokenizer(text)\n\ntext = [\"This is a sample sentence\", \"This is another sample sentence\"]\ntokenizer(text)\n```\n\nYou can pass a pre-tokenized sentence (or batch of sentences) by setting `is_split_into_words=True`\n\n```python\ntext = [\"This\", \"is\", \"a\", \"sample\", \"sentence\"]\ntokenizer(text, is_split_into_words=True)\n\ntext = [\n    [\"This\", \"is\", \"a\", \"sample\", \"sentence\", \"1\"],\n    [\"This\", \"is\", \"sample\", \"sentence\", \"2\"],\n]\ntokenizer(text, is_split_into_words=True)\n```\n\n#### Examples\n\nFirst, initialize the tokenizer\n\n```python\nimport transformers_embedder as tre\n\ntokenizer = tre.Tokenizer(\"bert-base-cased\")\n```\n\n- You can pass a single sentence as a string:\n\n```python\ntext = \"This is a sample sentence\"\ntokenizer(text)\n```\n\n```text\n{\n{\n    'input_ids': [[101, 1188, 1110, 170, 6876, 5650, 102]],\n    'token_type_ids': [[0, 0, 0, 0, 0, 0, 0]],\n    'attention_mask': [[1, 1, 1, 1, 1, 1, 1]],\n    'scatter_offsets': [[0, 1, 2, 3, 4, 5, 6]],\n    'sparse_offsets': {\n        'sparse_indices': tensor(\n            [\n                [0, 0, 0, 0, 0, 0, 0],\n                [0, 1, 2, 3, 4, 5, 6],\n                [0, 1, 2, 3, 4, 5, 6]\n            ]\n        ),\n        'sparse_values': tensor([1., 1., 1., 1., 1., 1., 1.]),\n        'sparse_size': torch.Size([1, 7, 7])\n    },\n    'sentence_lengths': [7],\n}\n```\n\n- A sentence pair\n\n```python\ntext = \"This is a sample sentence A\"\ntext_pair = \"This is a sample sentence B\"\ntokenizer(text, text_pair)\n```\n\n```text\n{\n    'input_ids': [[101, 1188, 1110, 170, 6876, 5650, 138, 102, 1188, 1110, 170, 6876, 5650, 139, 102]],\n    'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]],\n    'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],\n    'scatter_offsets': [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]],\n    'sparse_offsets': {\n        'sparse_indices': tensor(\n            [\n                [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  0],\n                [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],\n                [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]\n            ]\n        ),\n        'sparse_values': tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),\n        'sparse_size': torch.Size([1, 15, 15])\n    },\n    'sentence_lengths': [15],\n}\n```\n\n- A batch of sentences or sentence pairs. Using `padding=True` and `return_tensors=True`, the tokenizer \nreturns the text ready for the model\n\n```python\nbatch = [\n    [\"This\", \"is\", \"a\", \"sample\", \"sentence\", \"1\"],\n    [\"This\", \"is\", \"sample\", \"sentence\", \"2\"],\n    [\"This\", \"is\", \"a\", \"sample\", \"sentence\", \"3\"],\n    # ...\n    [\"This\", \"is\", \"a\", \"sample\", \"sentence\", \"n\", \"for\", \"batch\"],\n]\ntokenizer(batch, padding=True, return_tensors=True)\n\nbatch_pair = [\n    [\"This\", \"is\", \"a\", \"sample\", \"sentence\", \"pair\", \"1\"],\n    [\"This\", \"is\", \"sample\", \"sentence\", \"pair\", \"2\"],\n    [\"This\", \"is\", \"a\", \"sample\", \"sentence\", \"pair\", \"3\"],\n    # ...\n    [\"This\", \"is\", \"a\", \"sample\", \"sentence\", \"pair\", \"n\", \"for\", \"batch\"],\n]\ntokenizer(batch, batch_pair, padding=True, return_tensors=True)\n```\n\n#### Custom fields\n\nIt is possible to add custom fields to the model input and tell the `tokenizer` how to pad them using \n`add_padding_ops`. Start by initializing the tokenizer with the model name:\n\n```python\nimport transformers_embedder as tre\n\ntokenizer = tre.Tokenizer(\"bert-base-cased\")\n```\n\nThen add the custom fields to it:\n\n```python\ncustom_fields = {\n  \"custom_filed_1\": [\n    [0, 0, 0, 0, 1, 0, 0],\n    [0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0]\n  ]\n}\n```\n\nNow we can add the padding logic for our custom field `custom_filed_1`. `add_padding_ops` method takes in \ninput\n\n- `key`: name of the field in the tokenizer input\n- `value`: value to use for padding\n- `length`: length to pad. It can be an `int`, or two string value, `subword` in which the element is padded \nto match the length of the subwords, and `word` where the element is padded relative to the length of the\nbatch after the merge of the subwords.\n\n```python\ntokenizer.add_padding_ops(\"custom_filed_1\", 0, \"word\")\n```\n\nFinally, we can tokenize the input with the custom field:\n\n```python\ntext = [\n    \"This is a sample sentence\",\n    \"This is another example sentence just make it longer, with a comma too!\"\n]\n\ntokenizer(text, padding=True, return_tensors=True, additional_inputs=custom_fields)\n```\n\nThe inputs are ready for the model, including the custom filed.\n\n```text\n\u003e\u003e\u003e inputs\n\n{\n    'input_ids': tensor(\n        [\n            [ 101, 1188, 1110, 170, 6876, 5650, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n            [ 101, 1188, 1110, 1330, 1859, 5650, 1198, 1294, 1122, 2039, 117, 1114, 170, 3254, 1918, 1315, 106, 102]\n        ]\n    ),\n    'token_type_ids': tensor(\n        [\n            [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n            [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n        ]\n    ), \n    'attention_mask': tensor(\n        [\n            [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n            [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]\n        ]\n    ),\n    'scatter_offsets': tensor(\n        [\n            [ 0, 1, 2, 3, 4, 5, 6, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],\n            [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16]\n        ]\n    ),\n    'sparse_offsets': {\n        'sparse_indices': tensor(\n            [\n                [ 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  1],\n                [ 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16],\n                [ 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]\n            ]\n        ),\n        'sparse_values': tensor(\n            [1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,\n            1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,\n            1.0000, 1.0000, 0.5000, 0.5000, 1.0000, 1.0000, 1.0000]\n        ), \n        'sparse_size': torch.Size([2, 17, 18])\n    }\n    'sentence_lengths': [7, 17],\n}\n```\n\n## Acknowledgements\n\nSome code in the `TransformersEmbedder` class is taken from the [PyTorch Scatter](https://github.com/rusty1s/pytorch_scatter/)\nlibrary. The pretrained models and the core of the tokenizer is from [🤗 Transformers](https://huggingface.co/transformers/).\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Friccorl%2Ftransformers-embedder","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Friccorl%2Ftransformers-embedder","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Friccorl%2Ftransformers-embedder/lists"}