{"id":13608535,"url":"https://github.com/tensorflow/text","last_synced_at":"2025-05-11T05:47:02.401Z","repository":{"id":37549184,"uuid":"189305903","full_name":"tensorflow/text","owner":"tensorflow","description":"Making text a first-class citizen in TensorFlow.","archived":false,"fork":false,"pushed_at":"2025-04-04T21:04:24.000Z","size":14349,"stargazers_count":1256,"open_issues_count":195,"forks_count":352,"subscribers_count":40,"default_branch":"master","last_synced_at":"2025-05-08T17:19:21.928Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://www.tensorflow.org/beta/tutorials/tensorflow_text/intro","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tensorflow.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2019-05-29T22:10:03.000Z","updated_at":"2025-05-03T15:34:48.000Z","dependencies_parsed_at":"2022-07-12T16:22:44.900Z","dependency_job_id":"3c0dae12-9f9c-4ecf-9f29-8399e13caeec","html_url":"https://github.com/tensorflow/text","commit_stats":{"total_commits":811,"total_committers":122,"mean_commits":6.647540983606557,"dds":0.6362515413070284,"last_synced_commit":"9474dd91c98695d9a9aa147535b9fc91b122fc3a"},"previous_names":[],"tags_count":68,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tensorflow%2Ftext","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tensorflow%2Ftext/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tensorflow%2Ftext/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tensorflow%2Ftext/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tensorflow","download_url":"https://codeload.github.com/tensorflow/text/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253523720,"owners_count":21921818,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T19:01:28.068Z","updated_at":"2025-05-11T05:47:02.381Z","avatar_url":"https://github.com/tensorflow.png","language":"C++","funding_links":[],"categories":["Industrial Strength NLP","C++","AutoML NLP","Deep Learning Framework","Industry Strength Natural Language Processing","文本数据和NLP","🔹 **SentencePiece Implementations**","Other 💛💛💛💛💛\u003ca name=\"Other\" /\u003e"],"sub_categories":["High-Level DL APIs","NLP"],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/tensorflow/text/master/docs/include/tftext.png\" width=\"60%\"\u003e\u003cbr\u003e\u003cbr\u003e\n\u003c/div\u003e\n\n-----------------\n\n[![PyPI version](https://img.shields.io/pypi/v/tensorflow-text)](https://badge.fury.io/py/tensorflow-text)\n[![PyPI nightly version](https://img.shields.io/pypi/v/tensorflow-text-nightly?color=informational\u0026label=pypi%20%40%20nightly)](https://badge.fury.io/py/tensorflow-text-nightly)\n[![PyPI Python version](https://img.shields.io/pypi/pyversions/tensorflow-text)](https://pypi.org/project/tensorflow-text/)\n[![Documentation](https://img.shields.io/badge/api-reference-blue.svg)](https://github.com/tensorflow/text/blob/master/docs/api_docs/python/index.md)\n[![Contributions\nwelcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)\n[![License](https://img.shields.io/badge/License-Apache%202.0-brightgreen.svg)](https://opensource.org/licenses/Apache-2.0)\n\n\u003c!-- TODO(broken):  Uncomment when badges are made public.\n### Continuous Integration Test Status\n\n| Build      | Status |\n| ---             | ---    |\n| **Linux**   | [![Status](https://storage.googleapis.com/tf-text-badges/ubuntu-gpu-py3.svg)] |\n| **MacOS**   | [![Status](https://storage.googleapis.com/tf-text-badges/ubuntu-gpu-py3.svg)] |\n| **Windows**   | [![Status](https://storage.googleapis.com/tf-text-badges/ubuntu-gpu-py3.svg)] |\n--\u003e\n\n# TensorFlow Text - Text processing in Tensorflow\n\n**IMPORTANT**: When installing TF Text with `pip install`, please note the\nversion of TensorFlow you are running, as you should specify the corresponding\nminor version of TF Text (eg. for tensorflow==2.3.x use tensorflow_text==2.3.x).\n\n## INDEX\n* [Introduction](#introduction)\n* [Documentation](#documentation)\n* [Unicode](#unicode)\n* [Normalization](#normalization)\n* [Tokenization](#tokenization)\n  * [Whitespace Tokenizer](#whitespacetokenizer)\n  * [UnicodeScript Tokenizer](#unicodescripttokenizer)\n  * [Unicode split](#unicode-split)\n  * [Offsets](#offsets)\n  * [TF.Data Example](#tfdata-example)\n  * [Keras API](#keras-api)\n* [Other Text Ops](#other-text-ops)\n  * [Wordshape](#wordshape)\n  * [N-grams \u0026 Sliding Window](#n-grams--sliding-window)\n* [Installation](#installation)\n  * [Install using PIP](#install-using-pip)\n  * [Build from source steps:](#build-from-source-steps)\n\n## Introduction\n\nTensorFlow Text provides a collection of text related classes and ops ready to\nuse with TensorFlow 2.0. The library can perform the preprocessing regularly\nrequired by text-based models, and includes other features useful for sequence\nmodeling not provided by core TensorFlow.\n\nThe benefit of using these ops in your text preprocessing is that they are done\nin the TensorFlow graph. You do not need to worry about tokenization in\ntraining being different than the tokenization at inference, or managing\npreprocessing scripts.\n\n## Documentation\n\nPlease visit [http://tensorflow.org/text](http://tensorflow.org/text) for all\ndocumentation. This site includes API docs, guides for working with TensorFlow\nText, as well as tutorials for building specific models.\n\n## Unicode\n\nMost ops expect that the strings are in UTF-8. If you're using a different\nencoding, you can use the core tensorflow transcode op to transcode into UTF-8.\nYou can also use the same op to coerce your string to structurally valid UTF-8\nif your input could be invalid.\n\n```python\ndocs = tf.constant([u'Everything not saved will be lost.'.encode('UTF-16-BE'),\n                    u'Sad☹'.encode('UTF-16-BE')])\nutf8_docs = tf.strings.unicode_transcode(docs, input_encoding='UTF-16-BE',\n                                         output_encoding='UTF-8')\n```\n\n## Normalization\n\nWhen dealing with different sources of text, it's important that the same words\nare recognized to be identical. A common technique for case-insensitive matching\nin Unicode is case folding (similar to lower-casing). (Note that case folding\ninternally applies NFKC normalization.)\n\nWe also provide Unicode normalization ops for transforming strings into a\ncanonical representation of characters, with Normalization Form KC being the\ndefault ([NFKC](http://unicode.org/reports/tr15/)).\n\n```python\nprint(text.case_fold_utf8(['Everything not saved will be lost.']))\nprint(text.normalize_utf8(['Äffin']))\nprint(text.normalize_utf8(['Äffin'], 'nfkd'))\n```\n\n```sh\ntf.Tensor(['everything not saved will be lost.'], shape=(1,), dtype=string)\ntf.Tensor(['\\xc3\\x84ffin'], shape=(1,), dtype=string)\ntf.Tensor(['A\\xcc\\x88ffin'], shape=(1,), dtype=string)\n```\n\n## Tokenization\n\nTokenization is the process of breaking up a string into tokens. Commonly, these\ntokens are words, numbers, and/or punctuation.\n\nThe main interfaces are `Tokenizer` and `TokenizerWithOffsets` which each have a\nsingle method `tokenize` and `tokenizeWithOffsets` respectively. There are\nmultiple implementing tokenizers available now. Each of these implement\n`TokenizerWithOffsets` (which extends `Tokenizer`) which includes an option for\ngetting byte offsets into the original string. This allows the caller to know\nthe bytes in the original string the token was created from.\n\nAll of the tokenizers return RaggedTensors with the inner-most dimension of\ntokens mapping to the original individual strings. As a result, the resulting\nshape's rank is increased by one. Please review the ragged tensor guide if you\nare unfamiliar with them. https://www.tensorflow.org/guide/ragged_tensor\n\n### WhitespaceTokenizer\n\nThis is a basic tokenizer that splits UTF-8 strings on ICU defined whitespace\ncharacters (eg. space, tab, new line).\n\n```python\ntokenizer = text.WhitespaceTokenizer()\ntokens = tokenizer.tokenize(['everything not saved will be lost.', u'Sad☹'.encode('UTF-8')])\nprint(tokens.to_list())\n```\n\n```sh\n[['everything', 'not', 'saved', 'will', 'be', 'lost.'], ['Sad\\xe2\\x98\\xb9']]\n```\n\n### UnicodeScriptTokenizer\n\nThis tokenizer splits UTF-8 strings based on Unicode script boundaries. The\nscript codes used correspond to International Components for Unicode (ICU)\nUScriptCode values. See: http://icu-project.org/apiref/icu4c/uscript_8h.html\n\nIn practice, this is similar to the `WhitespaceTokenizer` with the most apparent\ndifference being that it will split punctuation (USCRIPT_COMMON) from language\ntexts (eg. USCRIPT_LATIN, USCRIPT_CYRILLIC, etc) while also separating language\ntexts from each other.\n\n```python\ntokenizer = text.UnicodeScriptTokenizer()\ntokens = tokenizer.tokenize(['everything not saved will be lost.',\n                             u'Sad☹'.encode('UTF-8')])\nprint(tokens.to_list())\n```\n\n```sh\n[['everything', 'not', 'saved', 'will', 'be', 'lost', '.'],\n ['Sad', '\\xe2\\x98\\xb9']]\n```\n\n### Unicode split\n\nWhen tokenizing languages without whitespace to segment words, it is common to\njust split by character, which can be accomplished using the\n[unicode_split](https://www.tensorflow.org/api_docs/python/tf/strings/unicode_split)\nop found in core.\n\n```python\ntokens = tf.strings.unicode_split([u\"仅今年前\".encode('UTF-8')], 'UTF-8')\nprint(tokens.to_list())\n```\n\n```sh\n[['\\xe4\\xbb\\x85', '\\xe4\\xbb\\x8a', '\\xe5\\xb9\\xb4', '\\xe5\\x89\\x8d']]\n```\n\n### Offsets\n\nWhen tokenizing strings, it is often desired to know where in the original\nstring the token originated from. For this reason, each tokenizer which\nimplements `TokenizerWithOffsets` has a *tokenize_with_offsets* method that will\nreturn the byte offsets along with the tokens. The start_offsets lists the bytes\nin the original string each token starts at (inclusive), and the end_offsets\nlists the bytes where each token ends at (exclusive, i.e., first byte *after*\nthe token).\n\n```python\ntokenizer = text.UnicodeScriptTokenizer()\n(tokens, start_offsets, end_offsets) = tokenizer.tokenize_with_offsets(\n    ['everything not saved will be lost.', u'Sad☹'.encode('UTF-8')])\nprint(tokens.to_list())\nprint(start_offsets.to_list())\nprint(end_offsets.to_list())\n```\n\n```sh\n[['everything', 'not', 'saved', 'will', 'be', 'lost', '.'],\n ['Sad', '\\xe2\\x98\\xb9']]\n[[0, 11, 15, 21, 26, 29, 33], [0, 3]]\n[[10, 14, 20, 25, 28, 33, 34], [3, 6]]\n```\n\n### TF.Data Example\n\nTokenizers work as expected with the tf.data API. A simple example is provided\nbelow.\n\n```python\ndocs = tf.data.Dataset.from_tensor_slices([['Never tell me the odds.'],\n                                           [\"It's a trap!\"]])\ntokenizer = text.WhitespaceTokenizer()\ntokenized_docs = docs.map(lambda x: tokenizer.tokenize(x))\niterator = tokenized_docs.make_one_shot_iterator()\nprint(iterator.get_next().to_list())\nprint(iterator.get_next().to_list())\n```\n\n```sh\n[['Never', 'tell', 'me', 'the', 'odds.']]\n[[\"It's\", 'a', 'trap!']]\n```\n\n### Keras API\n\nWhen you use different tokenizers and ops to preprocess your data, the resulting\noutputs are Ragged Tensors. The Keras API makes it easy now to train a model\nusing Ragged Tensors without having to worry about padding or masking the data,\nby either using the ToDense layer which handles all of these for you or relying\non Keras built-in layers support for natively working on ragged data.\n\n```python\nmodel = tf.keras.Sequential([\n  tf.keras.layers.InputLayer(input_shape=(None,), dtype='int32', ragged=True)\n  text.keras.layers.ToDense(pad_value=0, mask=True),\n  tf.keras.layers.Embedding(100, 16),\n  tf.keras.layers.LSTM(32),\n  tf.keras.layers.Dense(32, activation='relu'),\n  tf.keras.layers.Dense(1, activation='sigmoid')\n])\n```\n\n## Other Text Ops\n\nTF.Text packages other useful preprocessing ops. We will review a couple below.\n\n### Wordshape\n\nA common feature used in some natural language understanding models is to see\nif the text string has a certain property. For example, a sentence breaking\nmodel might contain features which check for word capitalization or if a\npunctuation character is at the end of a string.\n\nWordshape defines a variety of useful regular expression based helper functions\nfor matching various relevant patterns in your input text. Here are a few\nexamples.\n\n```python\ntokenizer = text.WhitespaceTokenizer()\ntokens = tokenizer.tokenize(['Everything not saved will be lost.',\n                             u'Sad☹'.encode('UTF-8')])\n\n# Is capitalized?\nf1 = text.wordshape(tokens, text.WordShape.HAS_TITLE_CASE)\n# Are all letters uppercased?\nf2 = text.wordshape(tokens, text.WordShape.IS_UPPERCASE)\n# Does the token contain punctuation?\nf3 = text.wordshape(tokens, text.WordShape.HAS_SOME_PUNCT_OR_SYMBOL)\n# Is the token a number?\nf4 = text.wordshape(tokens, text.WordShape.IS_NUMERIC_VALUE)\n\nprint(f1.to_list())\nprint(f2.to_list())\nprint(f3.to_list())\nprint(f4.to_list())\n```\n\n```sh\n[[True, False, False, False, False, False], [True]]\n[[False, False, False, False, False, False], [False]]\n[[False, False, False, False, False, True], [True]]\n[[False, False, False, False, False, False], [False]]\n```\n\n### N-grams \u0026 Sliding Window\n\nN-grams are sequential words given a sliding window size of *n*. When combining\nthe tokens, there are three reduction mechanisms supported. For text, you would\nwant to use `Reduction.STRING_JOIN` which appends the strings to each other.\nThe default separator character is a space, but this can be changed with the\nstring_separater argument.\n\nThe other two reduction methods are most often used with numerical values, and\nthese are `Reduction.SUM` and `Reduction.MEAN`.\n\n```python\ntokenizer = text.WhitespaceTokenizer()\ntokens = tokenizer.tokenize(['Everything not saved will be lost.',\n                             u'Sad☹'.encode('UTF-8')])\n\n# Ngrams, in this case bi-gram (n = 2)\nbigrams = text.ngrams(tokens, 2, reduction_type=text.Reduction.STRING_JOIN)\n\nprint(bigrams.to_list())\n```\n\n```sh\n[['Everything not', 'not saved', 'saved will', 'will be', 'be lost.'], []]\n```\n\n## Installation\n\n### Install using PIP\n\nWhen installing TF Text with `pip install`, please note the version\nof TensorFlow you are running, as you should specify the corresponding version\nof TF Text. For example, if you're using TF 2.0, install the 2.0 version of TF\nText, and if you're using TF 1.15, install the 1.15 version of TF Text.\n\n```bash\npip install -U tensorflow-text==\u003cversion\u003e\n```\n\n### A note about different operating system packages\n\nAfter version 2.10, we will only be providing pip packages for Linux x86_64 and\nIntel-based Macs. TensorFlow Text has always leveraged the release\ninfrastructure of the core TensorFlow package to more easily maintain compatible\nreleases with minimal maintenance, allowing the team to focus on TF Text itself\nand contributions to other parts of the TensorFlow ecosystem.\n\nFor other systems like Windows, Aarch64, and Apple Macs, TensorFlow relies on\n[build collaborators](https://blog.tensorflow.org/2022/09/announcing-tensorflow-official-build-collaborators.html),\nand so we will not be providing packages for them. However, we will continue to\naccept PRs to make building for these OSs easy for users, and will try to point\nto community efforts related to them.\n\n\n### Build from source steps:\n\nNote that TF Text needs to be built in the same environment as TensorFlow. Thus,\nif you manually build TF Text, it is highly recommended that you also build\nTensorFlow.\n\nIf building on MacOS, you must have coreutils installed. It is probably easiest\nto do with Homebrew.\n\n1. [build and install TensorFlow](https://www.tensorflow.org/install/source).\n1. Clone the TF Text repo:\n   ```Shell\n   git clone https://github.com/tensorflow/text.git\n   cd text\n   ```\n1. Run the build script to create a pip package:\n   ```Shell\n   ./oss_scripts/run_build.sh\n   ```\n   After this step, there should be a `*.whl` file in current directory. File name similar to `tensorflow_text-2.5.0rc0-cp38-cp38-linux_x86_64.whl`.\n1. Install the package to environment:\n   ```Shell\n   pip install ./tensorflow_text-*-*-*-os_platform.whl\n   ```\n\n### Build or test using TensorFlow's SIG docker image:\n\n1.  Pull image from\n    [Tensorflow SIG docker builds](https://hub.docker.com/r/tensorflow/build/tags).\n\n1.  Run a container based with the pulled image and create a bash session.\n    This can be done by running `docker run -it {image_name} bash`. \u003cbr /\u003e\n    `{image_name}` can be any name with `{tf_verison}-python{python_version}` format.\n    An example for python 3.10 and TF version 2.10 :- `2.10-python3.10`.\n1.  Clone the TF-Text Github repository inside container:  `git clone https://github.com/tensorflow/text.git`. \u003cbr /\u003e\n    Once cloned, change to the working directory using `cd text/`.\n1.  Run the configuration script(s): `./oss_scripts/configure.sh` and `./oss_scripts/prepare_tf_dep.sh`. \u003cbr /\u003e\n    This will update bazel and TF dependencies to installed tensorflow in the container.\n1.  To run the tests, use the bazel command: `bazel test --test_output=errors tensorflow_text:all`. This will run all the tests declared in the `BUILD` file. \u003cbr /\u003e\n    To run a specific test, modify the above command replacing `:all` with the test name (for example `:fast_bert_normalizer`).\n    \n1.  Build the pip package/wheel: \\\n    `bazel build --config=release_cpu_linux\n    oss_scripts/pip_package:build_pip_package` \\\n    `./bazel-bin/oss_scripts/pip_package/build_pip_package\n    /{wheel_dir}` \u003cbr /\u003e\n\n    Once the build is complete, you should see the wheel available under\n    `{wheel_dir}` directory.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftensorflow%2Ftext","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftensorflow%2Ftext","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftensorflow%2Ftext/lists"}