{"id":13545293,"url":"https://github.com/google-research/retvec","last_synced_at":"2025-04-08T02:36:05.672Z","repository":{"id":180775767,"uuid":"454537837","full_name":"google-research/retvec","owner":"google-research","description":"RETVec is an efficient, multilingual, and adversarially-robust text vectorizer.","archived":false,"fork":false,"pushed_at":"2025-03-31T19:07:45.000Z","size":11391,"stargazers_count":293,"open_issues_count":24,"forks_count":24,"subscribers_count":10,"default_branch":"main","last_synced_at":"2025-04-01T01:36:38.237Z","etag":null,"topics":["deep-learning","natural-language-processing","nlp","python","tensorflow","text-classification"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/google-research.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-02-01T20:19:58.000Z","updated_at":"2025-03-15T09:20:34.000Z","dependencies_parsed_at":"2024-09-22T08:30:28.939Z","dependency_job_id":"84e82ea2-9f7b-443b-b7c8-c807f29329ee","html_url":"https://github.com/google-research/retvec","commit_stats":{"total_commits":67,"total_committers":5,"mean_commits":13.4,"dds":"0.19402985074626866","last_synced_commit":"d5f7a48d8752db16606b088fd39b7a5edbe917a0"},"previous_names":["google-research/retvec"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fretvec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fretvec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fretvec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fretvec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/google-research","download_url":"https://codeload.github.com/google-research/retvec/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247765459,"owners_count":20992314,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","natural-language-processing","nlp","python","tensorflow","text-classification"],"created_at":"2024-08-01T11:01:00.409Z","updated_at":"2025-04-08T02:36:05.665Z","avatar_url":"https://github.com/google-research.png","language":"Jupyter Notebook","funding_links":[],"categories":["Jupyter Notebook"],"sub_categories":[],"readme":"# RETVec: Resilient \u0026 Efficient Text Vectorizer\n\n_NOTE (4/3/2025): This repository has been archived and no longer actively maintained._\n\n## Overview\nRETVec is a next-gen text vectorizer designed to be efficient, multilingual, and provide built-in adversarial resilience using robust word embeddings trained with [similarity learning](https://github.com/tensorflow/similarity/). You can read the paper [here](https://arxiv.org/abs/2302.09207).\n\nRETVec is trained to be resilient against character-level manipulations including insertion, deletion, typos, homoglyphs, LEET substitution, and more. The RETVec model is trained on top of a novel character encoder which can encode all UTF-8 characters and words efficiently. Thus, RETVec works out-of-the-box on over 100 languages without the need for a lookup table or fixed vocabulary size. Furthermore, RETVec is a layer, which means that it can be inserted into any TF model without the need for a separate pre-processing step.\n\nRETVec's speed and size (~200k instead of millions of parameters) also makes it a great choice for on-device and web use cases. It is [natively supported in TensorFlow Lite](notebooks/tf_lite_retvec.ipynb) via [custom ops in TensorFlow Text](https://www.tensorflow.org/text/api_docs/python/text/utf8_binarize), and we provide a JavaScript implementation of RETVec which allows you to deploy web models via TensorFlow.js.\n\nPlease see our example colabs on how to get started with training your own models with RETVec. [train_retvec_model_tf.ipynb](notebooks/train_retvec_model_tf.ipynb) is a great starting point for training a TF model using RETVec.\n\n## Demos\n\nTo see RetVec in action, visit [our demos](https://google-research.github.io/retvec/).\n\n## Getting started\n\n\n### Installation\n\nYou can use `pip` to install the latest TensorFlow version of RETVec:\n\n```python\npip install retvec\n```\n\nRETVec has been tested on TensorFlow 2.6+ and python 3.8+.\n\n### Basic Usage\n\nYou can use RETVec as the vectorization layer in any TensorFlow model with just a single line of code. RETVec operates on raw strings with pre-processing options built-in (e.g. lowercasing text). For example:\n\n```python\nimport tensorflow as tf\nfrom tensorflow.keras import layers\n\n# Define the input layer, which accepts raw strings\ninputs = layers.Input(shape=(1, ), name=\"input\", dtype=tf.string)\n\n# Add the RETVec Tokenizer layer using the RETVec embedding model -- that's it!\nx = RETVecTokenizer(sequence_length=128)(inputs)\n\n# Create your model like normal\n# e.g. a simple LSTM model for classification with NUM_CLASSES classes\nx = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x)\nx = layers.Bidirectional(layers.LSTM(64))(x)\noutputs = layers.Dense(NUM_CLASSES, activation='softmax')(x)\nmodel = tf.keras.Model(inputs, outputs)\n```\n\nThen you can compile, train and save your model like usual! As demonstrated in our paper, models trained using RETVec are more resilient against adversarial attacks and typos, as well as computationally efficient. RETVec also offers support in TFJS and TF Lite, making it perfect for on-device mobile and web use cases.\n\n### Colabs\n\nDetailed example colabs for RETVec can be found at under [notebooks](notebooks/). These are a good way to get started with using RETVec. You can run the notebooks in Google Colab by clicking the Google Colab button. If none of the examples are similar to your use case, please let us know!\n\nWe have the following example colabs:\n\n- Training RETVec-based models using TensorFlow: [train_retvec_model_tf.ipynb](notebooks/train_retvec_model_tf.ipynb) for GPU/CPU training, and [train_tpu.ipynb](notebooks/train_tpu.ipynb) for a TPU-compatible training example.\n- Converting RETVec models into TF Lite models to run on-device: [tf_lite_retvec.ipynb](notebooks/tf_lite_retvec.ipynb)\n- (Coming soon!) Using RETVec JS to deploy RETVec models in the web using TensorFlow.js\n\n## Citing\nPlease cite this reference if you use RETVec in your research:\n\n```bibtex\n@article{retvec2023,\n    title={RETVec: Resilient and Efficient Text Vectorizer},\n    author={Elie Bursztein, Marina Zhang, Owen Vallis, Xinyu Jia, and Alexey Kurakin},\n    year={2023},\n    eprint={2302.09207}\n}\n```\n\n## Contributing\nTo contribute to the project, please check out the [contribution guidelines](CONTRIBUTING.md). Thank you!\n\n## Disclaimer\nThis is not an official Google product.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-research%2Fretvec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoogle-research%2Fretvec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-research%2Fretvec/lists"}