{"id":15646302,"url":"https://github.com/kaleidophon/token2index","last_synced_at":"2025-04-07T13:07:53.656Z","repository":{"id":56632920,"uuid":"269975463","full_name":"Kaleidophon/token2index","owner":"Kaleidophon","description":"A lightweight but powerful library to build token indices for NLP tasks, compatible with major Deep Learning frameworks like PyTorch and Tensorflow.","archived":false,"fork":false,"pushed_at":"2024-12-06T08:21:19.000Z","size":1271,"stargazers_count":51,"open_issues_count":0,"forks_count":6,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-03-31T11:05:39.352Z","etag":null,"topics":["deep-learning","deeplearning","i2t","i2w","indexing","itos","nlp","numpy","python","pytorch","rnn","rnns","seq2seq","stoi","t2i","tensorflow","token","transformer","transformers","w2i"],"latest_commit_sha":null,"homepage":"http://token2index.readthedocs.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Kaleidophon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-06-06T12:30:08.000Z","updated_at":"2025-01-24T04:38:01.000Z","dependencies_parsed_at":"2025-01-09T17:38:04.079Z","dependency_job_id":"c8892e61-d8a5-403c-a499-dc36ea2560db","html_url":"https://github.com/Kaleidophon/token2index","commit_stats":{"total_commits":109,"total_committers":1,"mean_commits":109.0,"dds":0.0,"last_synced_commit":"205198d5afb4f326788a228fb58dfef0447526b9"},"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kaleidophon%2Ftoken2index","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kaleidophon%2Ftoken2index/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kaleidophon%2Ftoken2index/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kaleidophon%2Ftoken2index/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Kaleidophon","download_url":"https://codeload.github.com/Kaleidophon/token2index/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247657281,"owners_count":20974345,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","deeplearning","i2t","i2w","indexing","itos","nlp","numpy","python","pytorch","rnn","rnns","seq2seq","stoi","t2i","tensorflow","token","transformer","transformers","w2i"],"created_at":"2024-10-03T12:12:21.679Z","updated_at":"2025-04-07T13:07:53.625Z","avatar_url":"https://github.com/Kaleidophon.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# :zap: :card_index: token2index: A lightweight but powerful library for token indexing\n\n[![Build](https://travis-ci.org/Kaleidophon/token2index.svg?branch=master)](https://travis-ci.org/github/Kaleidophon/token2index/builds)\n[![Documentation Status](https://readthedocs.org/projects/token2index/badge/?version=latest)](https://token2index.readthedocs.io/en/latest/?badge=latest)\n[![Coverage Status](https://coveralls.io/repos/github/Kaleidophon/token2index/badge.svg?branch=master)](https://coveralls.io/github/Kaleidophon/token2index?branch=master)\n[![Compatibility](https://img.shields.io/badge/Python-3.5%20%7C%203.6%20%7C%203.7%20%7C%203.8-blue)]()\n[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/python/black)\n\n``token2index`` is a small yet powerful library facilitating the fast and easy creation of a data structure mapping \ntokens to indices, primarily aimed at applications for Natural Language Processing. The library is fully tested, and \ndoes not require any additional requirements. The documentation can be found [here](https://token2index.readthedocs.io/en/latest/), some feature highlights are \nshown below.\n\n**Who / what is this for?**\n\nThis class is written to be used for NLP applications where we want to assign an index to every word in a sequence e.g. to be later used to look up corresponding \nword embeddings. Building an index and indexing batches of sequences for Deep Learning models using frameworks like PyTorch or Tensorflow are common steps but are often written from \nscratch every time. This package provides a ready-made package combining many useful features, like reading vocabulary files, building indices from a corpus or indexing entire batches in one single\nfunction call, all while being fully tested.\n\n### :sparkles: Feature Highlights\n\n* **Building and extending vocab**\n\n    One way to build the index from a corpus is using the build() function:\n\n    ```python\n    \u003e\u003e\u003e from t2i import T2I\n    \u003e\u003e\u003e t2i = T2I.build([\"colorless green ideas dream furiously\", \"the horse raced past the barn fell\"])\n    \u003e\u003e\u003e t2i\n    T2I(Size: 13, unk_token: \u003cunk\u003e, eos_token: \u003ceos\u003e, pad_token: \u003cpad\u003e, {'colorless': 0, 'green': 1, 'ideas': 2, 'dream': 3, 'furiously': 4, 'the': 5, 'horse': 6, 'raced': 7, 'past': 8, 'parn': 9, 'fell': 10, '\u003cunk\u003e': 11, '\u003ceos\u003e': 12, '\u003cpad\u003e': 13})\n    ```\n  \n    The index can always be extended again later using `extend()`:\n    \n    ```python\n    \u003e\u003e\u003e t2i = t2i.extend(\"completely new words\")\n    T2I(Size: 16, unk_token: \u003cunk\u003e, eos_token: \u003ceos\u003e, pad_token: \u003cpad\u003e, {'colorless': 0, 'green': 1, 'ideas': 2, 'dream': 3, 'furiously': 4, 'the': 5, 'horse': 6, 'raced': 7, 'past': 8, 'barn': 9, 'fell': 10, 'completely': 13, 'new': 14, 'words': 15, '\u003cunk\u003e': 16, '\u003ceos\u003e': 17, '\u003cpad\u003e': 18})\n    ```\n  \n    Both methods and index() also work with an already tokenized corpus in the form of \n    \n        [[\"colorless\", \"green\", \"ideas\", \"dream\", \"furiously\"], [\"the\", \"horse\", \"raced\", \"past\", \"the\", \"barn\", \"fell\"]]    \n\n* **Easy indexing (of batches)**\n    \n    Index multiple sentences at once in a single function call!\n\n    ```python\n    \u003e\u003e\u003e t2i.index([\"the green horse raced \u003ceos\u003e\", \"ideas are a dream \u003ceos\u003e\"])\n    [[5, 1, 6, 7, 12], [2, 11, 11, 3, 12]]\n    ```\n    \n    where unknown tokens are always mapped to `unk_token`.\n    \n* **Easy conversion back to strings**\n    \n    Reverting indices back to strings is equally as easy:\n    \n    ```python\n    \u003e\u003e\u003e t2i.unindex([5, 14, 16, 3, 6])\n    'the new \u003cunk\u003e dream horse'\n    ```\n\n* **Automatic padding**\n\n    You are indexing multiple sentences of different length and want to add padding? No problem! `index()` has two\n    options available via the `pad_to` argument. The first is padding to the maximum length of all the sentences:\n    \n    ```python\n    \u003e\u003e\u003e padded_sents = t2i.index([\"the green horse raced \u003ceos\u003e\", \"ideas \u003ceos\u003e\"], pad_to=\"max\")\n    \u003e\u003e\u003e padded_sents\n    [[5, 1, 6, 7, 12], [2, 12, 13, 13, 13]]\n    \u003e\u003e\u003e t2i.unindex(padded_sents)\n    [['the green horse raced \u003ceos\u003e', 'ideas \u003ceos\u003e \u003cpad\u003e \u003cpad\u003e \u003cpad\u003e']]\n    ```\n  \n    Alternatively, you can also pad to a pre-defined length:\n    \n    ```python\n    \u003e\u003e\u003e padded_sents = t2i.index([\"the green horse \u003ceos\u003e\", \"past ideas \u003ceos\u003e\"], pad_to=5)\n    \u003e\u003e\u003e padded_sents\n    [[5, 1, 6, 12, 13], [8, 2, 12, 13, 13]]\n    \u003e\u003e\u003e t2i.unindex(padded_sents)\n    [['the green horse \u003ceos\u003e \u003cpad\u003e', 'past ideas \u003ceos\u003e \u003cpad\u003e \u003cpad\u003e']]\n    ```\n    \n* **Vocab from file**\n\n    Using `T2I.from_file()`, the index can be created directly by reading from an existing vocab file. \n    Refer to its documentation [here](https://token2index.readthedocs.io/en/latest/#t2i.T2I.from_file) for more info.\n    \n* **Fixed memory size**\n\n    Although the `defaultdict` class from Python's `collections` package also posses the functionality to map unknown \n    keys to a certain value, it grows in size for every new key. `T2I` memory size stays fixed after the index is built.\n    \n* **Support for special tokens**\n    \n    To enable flexibility in modern NLP applications, `T2I` allows for an arbitrary number of special tokens (like a \n    masking or a padding token) during init! \n    \n    ```python\n    \u003e\u003e\u003e t2i = T2I(special_tokens=[\"\u003cmask\u003e\"])\n    \u003e\u003e\u003e t2i\n    T2I(Size: 3, unk_token: \u003cunk\u003e, eos_token: \u003ceos\u003e, pad_token: \u003cpad\u003e, {'\u003cunk\u003e': 0, '\u003ceos\u003e': 1, '\u003cmask\u003e': 2, '\u003cpad\u003e': 3})\n    ```\n\n* **Explicitly supported programmer laziness**\n\n    Too lazy to type? The library saves you a few keystrokes here and there. instead of calling `t2i.index(...)` you can\n    directly call `t2i(...)` to index one or multiple sequences. Furthermore, key functions like `index()`, `unindex()`,\n    `build()` and `extend()` support strings or iterables of strings as arguments alike.\n\n### :electric_plug: Compatibility with other frameworks (Numpy, PyTorch, Tensorflow)\n\nIt is also ensured that `T2I` is easily compatible with frameworks like Numpy, PyTorch and \nTensorflow, without needing them as requirements:\n\n**Numpy**\n\n```python\n\u003e\u003e\u003e import numpy as np\n\u003e\u003e\u003e t = np.array(t2i.index([\"the new words are ideas \u003ceos\u003e\", \"the green horse \u003ceos\u003e \u003cpad\u003e \u003cpad\u003e\"]))\n\u003e\u003e\u003e t\narray([[ 5, 15, 16, 17,  2, 18],\n   [ 5,  1,  6, 18, 19, 19]])\n\u003e\u003e\u003e t2i.unindex(t)\n['the new words \u003cunk\u003e ideas \u003ceos\u003e', 'the green horse \u003ceos\u003e \u003cpad\u003e \u003cpad\u003e']\n```\n\n**PyTorch**\n\n```python\n\u003e\u003e\u003e import torch\n\u003e\u003e\u003e t = torch.LongTensor(t2i.index([\"the new words are ideas \u003ceos\u003e\", \"the green horse \u003ceos\u003e \u003cpad\u003e \u003cpad\u003e\"]))\n\u003e\u003e\u003e t\ntensor([[ 5, 15, 16, 17,  2, 18],\n    [ 5,  1,  6, 18, 19, 19]])\n\u003e\u003e\u003e t2i.unindex(t)\n['the new words \u003cunk\u003e ideas \u003ceos\u003e', 'the green horse \u003ceos\u003e \u003cpad\u003e \u003cpad\u003e']\n```\n\n**Tensorflow**\n\n```python\n\u003e\u003e\u003e import tensorflow as tf\n\u003e\u003e\u003e t = tf.convert_to_tensor(t2i.index([\"the new words are ideas \u003ceos\u003e\", \"the green horse \u003ceos\u003e \u003cpad\u003e \u003cpad\u003e\"]), dtype=tf.int32)\n\u003e\u003e\u003e t\ntensor([[ 5, 15, 16, 17,  2, 18],\n    [ 5,  1,  6, 18, 19, 19]])\n\u003e\u003e\u003e t2i.unindex(t)\n['the new words \u003cunk\u003e ideas \u003ceos\u003e', 'the green horse \u003ceos\u003e \u003cpad\u003e \u003cpad\u003e']\n```\n\n### :inbox_tray: Installation\n\nInstallation can simply be done using ``pip``:\n\n    pip3 install token2index\n\n### :mortar_board: Citing\n\nIf you use ``token2index`` for research purposes, please cite the library using the following citation info:\n\n    @misc{ulmer2020token2index,\n        title={token2index: A lightweight but powerful library for token indexing},\n        author={Ulmer, Dennis},\n        journal={https://github.com/Kaleidophon/token2index},\n        year={2020}\n    }\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkaleidophon%2Ftoken2index","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkaleidophon%2Ftoken2index","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkaleidophon%2Ftoken2index/lists"}