{"id":13450559,"url":"https://github.com/tensorflow/mesh","last_synced_at":"2025-10-04T14:30:43.213Z","repository":{"id":33114685,"uuid":"149666254","full_name":"tensorflow/mesh","owner":"tensorflow","description":"Mesh TensorFlow: Model Parallelism Made Easier","archived":false,"fork":false,"pushed_at":"2023-11-17T19:39:54.000Z","size":2352,"stargazers_count":1598,"open_issues_count":98,"forks_count":254,"subscribers_count":51,"default_branch":"master","last_synced_at":"2025-01-16T09:30:17.985Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tensorflow.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS"}},"created_at":"2018-09-20T20:23:34.000Z","updated_at":"2025-01-04T08:40:30.000Z","dependencies_parsed_at":"2023-02-14T12:31:21.934Z","dependency_job_id":"3df62e58-43d3-4910-ba88-b14cc333e9dc","html_url":"https://github.com/tensorflow/mesh","commit_stats":{"total_commits":654,"total_committers":51,"mean_commits":"12.823529411764707","dds":0.746177370030581,"last_synced_commit":"4513b6cf14bd7f5d6f8b108f9e06ca00e5b1b1ff"},"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tensorflow%2Fmesh","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tensorflow%2Fmesh/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tensorflow%2Fmesh/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tensorflow%2Fmesh/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tensorflow","download_url":"https://codeload.github.com/tensorflow/mesh/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":235260670,"owners_count":18961634,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T07:00:36.090Z","updated_at":"2025-10-04T14:30:37.831Z","avatar_url":"https://github.com/tensorflow.png","language":"Python","funding_links":[],"categories":["MoE Application","Topics","The Data Science Toolbox","LLM Training Frameworks","LLM训练框架","Deep Learning","分布式机器学习","TensorFlow Tools, Libraries, and Frameworks","Python","Transformers and LLMs","Tensor Flow","Open Source Projects","🛠️ Libraries"],"sub_categories":["Deep Learning Packages","LLM 评估工具","TensorFlow","Frameworks and Libraries","Automated Machine Learning","📚 arXiv"],"readme":"# Mesh TensorFlow - Model Parallelism Made Easier\n\n[![PyPI\nversion](https://badge.fury.io/py/mesh-tensorflow.svg)](https://badge.fury.io/py/mesh-tensorflow)\n[![GitHub\nIssues](https://img.shields.io/github/issues/tensorflow/mesh.svg)](https://github.com/tensorflow/mesh/issues)\n[![Contributions\nwelcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)\n[![License](https://img.shields.io/badge/License-Apache%202.0-brightgreen.svg)](https://opensource.org/licenses/Apache-2.0)\n[![Build Status](https://github.com/tensorflow/mesh/workflows/build/badge.svg)](https://github.com/tensorflow/mesh/actions?query=workflow%3Abuild)\n\n\n# Introduction\n\nMesh TensorFlow (`mtf`) is a language for distributed deep learning, capable of\nspecifying a broad class of distributed tensor computations.  The purpose of\nMesh TensorFlow is to formalize and implement distribution strategies for your\ncomputation graph over your hardware/processors. For example: \"Split the batch\nover rows of processors and split the units in the hidden layer across columns\nof processors.\" Mesh TensorFlow is implemented as a layer over TensorFlow.\n\nWatch our [YouTube video](https://www.youtube.com/watch?v=HgGyWS40g-g).\n\n\n## Do I need Mesh TensorFlow?\n\nIf you just want data-parallel training (batch-splitting), then you do not need\nMesh TensorFlow, though Mesh TensorFlow can do this.  The most common reasons\nfor more sophisticated parallel computation are:\n\n* The parameters of the model do not fit on one device - e.g. a\n5-billion-parameter language model.\n\n* An example is so large that the activations do not fit on one device. - e.g.\nlarge 3D image model([`experimental/unet.py`](https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/experimental/unet.py)).\n\n* Lower-latency parallel inference (at batch size 1).\n\n## The Mesh TensorFlow Approach to Distributed Computation\n\n* A \"Mesh\" is an n-dimensional array of processors, connected by a network.\n\n* Each tensor is distributed (split and/or replicated) across all processors\n  in a mesh.\n\n* Tensor dimensions and mesh dimensions are named.  The layouts of all tensors\n  follow from a set of user-defined layout rules which specify which\n  tensor-dimensions are split across which mesh-dimensions.  This ensures that\n  the corresponding dimensions in different tensors are split in the same\n  manner.\n\n* Layouts do not affect results - only performance.\n\n* The implementation of an operation involves parallel computation on all\n  processors in the mesh, and sometimes also collective communication.  A\n  processor usually just manipulates the slices of the input tensors already\n  resident on that processor, and produces the slice of the output that goes on\n  that processor.\n\n## Getting Started\n\n### Installation\n\nTo install the latest stable version, run\n\n```sh\npip install mesh-tensorflow\n```\n\nTo install the latest development version, run\n\n```sh\npip install -e \"git+https://github.com/tensorflow/mesh.git#egg=mesh-tensorflow\"\n```\n\nInstalling `mesh-tensorflow` does not automatically install or update\nTensorFlow. We recommend installing it via `pip install tensorflow` or `pip\ninstall tensorflow-gpu`. See TensorFlow’s\n[installation instructions for details](https://www.tensorflow.org/install/).\nIf you're using a development version of Mesh TensorFlow, you may need to\nuse TensorFlow's nightly package (`tf-nightly`).\n\n### Example Network (MNIST)\n\nTo illustrate, let us consider a simple model for the MNIST image-classification\ntask.  Our network has one hidden layer with 1024 units, and an output layer\nwith 10 units (corresponding to the 10 digit classes).\n\nThe code consists of two parts, the first describing the mathematical\noperations, and the second describing the devices and tensor/computation layout.\nFor the full example, see [`examples/mnist.py`](\nhttps://github.com/tensorflow/mesh/blob/master/examples/mnist.py).\nTODO(noam): verify that this code works.\n\n```Python\n# tf_images is a tf.Tensor with shape [100, 28, 28] and dtype tf.float32\n# tf_labels is a tf.Tensor with shape [100] and dtype tf.int32\ngraph = mtf.Graph()\nmesh = mtf.Mesh(graph, \"my_mesh\")\nbatch_dim = mtf.Dimension(\"batch\", 100)\nrows_dim = mtf.Dimension(\"rows\", 28)\ncols_dim = mtf.Dimension(\"cols\", 28)\nhidden_dim = mtf.Dimension(\"hidden\", 1024)\nclasses_dim = mtf.Dimension(\"classes\", 10)\nimages = mtf.import_tf_tensor(\n    mesh, tf_images, shape=[batch_dim, rows_dim, cols_dim])\nlabels = mtf.import_tf_tensor(mesh, tf_labels, [batch_dim])\nw1 = mtf.get_variable(mesh, \"w1\", [rows_dim, cols_dim, hidden_dim])\nw2 = mtf.get_variable(mesh, \"w2\", [hidden_dim, classes_dim])\n# einsum is a generalization of matrix multiplication (see numpy.einsum)\nhidden = mtf.relu(mtf.einsum(images, w1, output_shape=[batch_dim, hidden_dim]))\nlogits = mtf.einsum(hidden, w2, output_shape=[batch_dim, classes_dim])\nloss = mtf.reduce_mean(mtf.layers.softmax_cross_entropy_with_logits(\n    logits, mtf.one_hot(labels, classes_dim), classes_dim))\nw1_grad, w2_grad = mtf.gradients([loss], [w1, w2])\nupdate_w1_op = mtf.assign(w1, w1 - w1_grad * 0.001)\nupdate_w2_op = mtf.assign(w2, w2 - w2_grad * 0.001)\n```\n\nIn the code above, we have built a Mesh TensorFlow graph, which is simply\na Python structure.  We have completely defined the mathematical operations.\nIn the code below, we specify the mesh of processors and the layout of the\ncomputation.\n\n```Python\ndevices = [\"gpu:0\", \"gpu:1\", \"gpu:2\", \"gpu:3\"]\nmesh_shape = [(\"all_processors\", 4)]\nlayout_rules = [(\"batch\", \"all_processors\")]\nmesh_impl = mtf.placement_mesh_impl.PlacementMeshImpl(\n    mesh_shape, layout_rules, devices)\nlowering = mtf.Lowering(graph, {mesh:mesh_impl})\ntf_update_ops = [lowering.lowered_operation(update_w1_op),\n                 lowering.lowered_operation(update_w2_op)]\n```\n\nThe particular layout above implements data-parallelism, splitting the batch of\nexamples evenly across all four processors.  Any Tensor with a \"batch\" dimension\n(e.g. `images`, `h`, `logits`, and their gradients) is split in that dimension\nacross all processors, while any tensor without a \"batch\" dimension (e.g. the\nmodel parameters) is replicated identically on every processor.\n\nAlternatively, for model-parallelism, we can set\n`layout_rules=[(\"hidden\", \"all_processors\")]`.  In this case,\nany tensor with a \"hidden\" dimension (e.g. `hidden`, `w1`, `w2`)  is split,\nwhile any other tensor (e.g. `image`, `logits`) is fully replicated.\n\nWe can even combine data-parallelism and model-parallelism on a 2-dimensional\nmesh of processors.  We split the batch along one dimension of the mesh, and the\nunits in the hidden layer along the other dimension of the mesh, as below.  In\nthis case, the hidden layer is actually tiled between the four processors, being\nsplit in both the \"batch\" and \"hidden_units\" dimensions.\n\n```Python\nmesh_shape = [(\"processor_rows\", 2), (\"processor_cols\", 2)]\nlayout_rules = [(\"batch\", \"processor_rows\"), (\"hidden\", \"processor_cols\")]\n```\n\n## Where does the network communication happen?\n\nSome Mesh TensorFlow operations cause network communication.  For example, an\neinsum (generalized matrix multiplication) is computed as follows:\n\n* On each processor, compute the einsum of the slices of the two operands that\n  are local to that processor.\n* If no reduced-out dimensions are split, then we are done.\n* If reduced-out dimensions are split, then perform an \"allreduce\" operation \n  on the resulting slices - summing across any mesh dimensions over which the\n  reduced-out dimensions are split.\n\nWhere the allreduces happen depends will depend on the computation layout.\nFor example, in a data-parallel layout where the \"batch\" dimension is split,\nallreduces will happen when computing the parameter gradients, since this\ninvolves matrix multiplications which reduce out the \"batch\" dimension.\n\n## How do I pick a layout?\n\nWhile results do not depend on layout (except in the realm of roundoff errors\nand random seeds), performance and memory consumption depend heavily on layout.\nFortunately, the auto_mtf subpackage provides a method for automatically\nchoosing a layout.  For more information about what auto_mtf is doing to choose\na layout, see its [README](mesh_tensorflow/auto_mtf/README.md) file.\n\n```Python\nimport mesh_tensorflow.auto_mtf\n\ngraph = mtf.Graph()\nmesh = mtf.Mesh(graph, \"my_mesh\")\n# Insert model code here.\noutputs = [logits, loss]  # iterable of mtf.Tensor, the outputs you're computing\nmesh_shape = [(\"processor_rows\", 2), (\"processor_cols\", 2)]\nlayout_rules = mtf.auto_mtf.layout(graph, mesh_shape, outputs)\n```\n\nIt is possible for advanced users to eke out additional performance by tuning\nthe layout (and model) further.  Mesh TensorFlow helps by accumulating and\nprinting counters of computation/communication.  To start, here are some\ntricks/guidelines.\n\n* It is illegal for two dimensions of the same tensor to be split across the\n  same mesh dimension.\n* For any compute-intense operation (e.g. einsum), make sure that all\n  mesh-dimensions are used to split dimensions of the inputs or outputs.\n  Otherwise, computation is duplicated.\n* To keep the ratio of compute/communication high (i.e. not be bandwidth-bound),\n  split dimensions into large chunks.  This should be familiar in the\n  data-parallelism case, where we want a large batch size per processor to avoid\n  spending most of our time communicating.\n\n# The Mesh TensorFlow Language\n\nMesh TensorFlow (v0.0) is implemented as a Python library which can generate\npart of a TensorFlow graph.  The user first builds a `mtf.Graph` (the analog of\na TensorFlow graph) made up of `mtf.Tensor`s and `mtf.Operation`s.  As in\nTensorFlow, this graph consists of simple Python objects.  The user then creates\na `mtf.Lowering` object, which lowers the `mtf.Graph` into TensorFlow, adding to\nthe default TensorFlow graph.\n\nThe Mesh TensorFlow language is nearly identical to TensorFlow, with the\nfamiliar notion of a Graph, Tensors, Operations, and automatic gradient\ncomputation.  The principal differences are as follows:\n\n## Meshes replace devices\n\nA `Mesh` is a n-dimensional array of processors with named dimensions.  Each\n`Tensor` is assigned to a `Mesh`, instead of a device.\n\n## Tensor dimensions are named\n\nEach `Tensor` has a static `Shape`, which is a tuple of different \"Dimensions\".\nA `Dimension` is a `(name, size)` pair. For example, the shape of a `Tensor`\nrepresenting a batch of images might be:\n\n`[(\"batch\", 100), (\"rows\", 28\"), (\"cols\", 28), (\"channels\", 3)]`.\n\n## Layouts\n\nA `Tensor` is laid out on its mesh with one slice on each processor.  A `Tensor`\n\"layout\", is an injective partial map specifying which dimensions of the tensor\nare (evenly) split across which dimensions of the mesh.  No dimension of a\ntensor may be split across two dimensions of its mesh and no two dimensions of a\ntensor may be split across the same dimension of its mesh.  The user defines a\nglobal set of layout rules in the form of (tensor-dimension-name,\nmesh-dimension-name) pairs.  A dimension of a tensor is split across a dimension\nof its mesh if there is a matching rule.\n\n### Example Layouts\n\nTake our example `Tensor` `image_batch` with shape: \n`[(\"batch\", 100), (\"rows\", 28\"), (\"cols\", 28), (\"channels\", 3)]`\n\nAssume that this `Tensor` is assigned to a mesh of 8 processors with shape:\n`[(\"processor_rows\", 2), (\"processor_cols\", 4)]`\n\n* If we use an empty set of layout rules `[]`, we get no splitting.  Each\n  processor contains the whole `Tensor`.\n\n* If we use the layout rules `\"batch:processor_cols\"`, then the `\"batch\"`\n  dimension of the `Tensor` is split across the `\"processor_cols\"` dimension of\n  the batch.  This means that each processor contains a Tensor slice with shape\n  `[25, 28, 28, 3]`.  For example, processors (0, 3) and (1, 3) contain\n  identical slices - `image_batch[75:100, :, :, :]`.\n\n* If we use the layout rules `\"rows:processor_rows;cols:processor_cols\"`, \n  then the image is split in two dimensions, with each processor containing one\n  spatial tile with shape `[100, 14, 7, 3]`.   For example, processor (0, 1)\n  contains the slice `image_batch[:, 0:14, 7:14, :]`.\n\nSome layout rules would lead to illegal layouts:\n\n* `\"batch:processor_rows;rows:processor_rows\"` is illegal because two tensor\n  dimensions could not be split across the same mesh dimension.\n\n* `\"channels:processor_rows\"` is illegal because the size of the tensor\n  dimension is not evenly divisible by the size of the mesh dimension.\n\n## Einsum\n\nMesh TensorFlow uses Einstein-summation notation, `mtf.einsum(inputs,\noutput_shape)`, using the (named) `Dimensions` as the symbols.  Matrix\nmultiplication, broadcast, sum-reduction, and transposition can all be expressed\nas special cases of `mtf.einsum`, though the familiar interfaces are also\nsupported.  The operation is lowered to slice-wise `tf.einsum`s, followed by\nallreduce across any mesh-dimensions corresponding to the summed-out Tensor\ndimensions.\n\n## Reshape can be expensive\n\n`mtf.reshape(x, new_shape)` is used to change a `Tensor`'s shape, potentially\nleading to a new tensor layout and hence network communication.\n\n# CPU/GPU/TPU implementations\n\nMesh TensorFlow works on CPU, GPU and TPU.  The TPU implementation is very\ndifferent from the CPU/GPU implementation.\n\nMulti-CPU/GPU meshes are implemented with `PlacementMeshImpl`.  In this case\nMesh TensorFlow emits separate TensorFlow operations placed on the different\ndevices, all in one big TensorFlow graph.\n\nTPU meshes are implemented in with `SimdMeshImpl`.  In this case,\nMesh TensorFlow emits TensorFlow operations (and communication collectives) from\nthe perspective of one core, and this same program runs on every core, relying\non the fact that each core actually performs the same operations.  This\npiggy-backs on the TPU data-parallelism infrastructure, which operates the same\nway.  This \"SIMD\" approach keeps the TensorFlow and XLA graphs from growing with\nthe number of cores.  The differences between cores are as follows:\n\n* different slices of the variables (this works now)\n* different positions in the collective communication (this works now)\n* different slices of the infed and outfed tensors.  We currently work around\n  this by requiring that all imported/exported tensors be fully-replicated.  In\n  the future, we should handle this correctly.\n\n# Experimental features\n\nThe input pipeline of Mesh Tensorflow models might become a bottleneck, when\ntraining with large input (e.g., high resolution images). We provide new APIs\nand a new input pipeline for you to run Mesh Tensorflow models. You can find\nthem under the [`experimental/`](https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/experimental/)\nfolder. We suggest that you give them a try when your input is so large that\nrunning Mesh Tensorflow models with the default APIs is almost infeasible.\nTo be more specific:\n\n* The BROADCAST mode in TPUEstimator does not scale up to large inputs (images\n  of tens of millions of pixels). We provide a new input pipeline:\n  [`experimental/input_reader.py`](https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/experimental/input_reader.py).\n  See [`experimental/model_executor.py`](https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/experimental/model_executor.py)\n  on how to use it.\n* If your model takes images as input and has convolution layers. You cannot\n  directly map image height and width dimensions to mesh dimensions, due to the\n  sliding-window nature of convolution. Instead, you should use spatial\n  partitioning. We provide examples in\n  [`experimental/unet.py`](https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/experimental/unet.py).\n* If you want more control on the training and evaluation loop, instead of using\n  the default API (TPUEstimator) to run your model, you can use low level APIs\n  in [`experimental/model_executor.py`](https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/experimental/model_executor.py).\n\nNote that we did not test the experimental code on GPUs. We ran them on TPUs.\nWe believe that some debugging would be required for it to work on GPUs.\n\n# Instructions for running on cloud-tpu\n\nNote: It requires `tensorflow\u003e=1.11.0`.\n\n## Prerequisite\n\nPlease go through the\n[Transformer tutorial](https://cloud.google.com/tpu/docs/tutorials/transformer).\n\n## Create VM and TPU instance in Cloud console\n\nTODO(trandustin,ylc): update given mtf pypi package\n\n```sh\nctpu up -name=ylc-mtf-donut -tf-version=nightly -tpu-size=v2-8 -zone=us-central1-b\n```\n\n## SSH into VM\n\n```sh\ngit clone https://github.com/tensorflow/mesh.git\ncd mesh/\npip install --user .\n```\n\n## Run the Transfomer model (no Tensor2Tensor dependencies)\n\n```sh\npip install tensorflow_datasets\n\ncd mesh/\nDATA_DIR=gs://noam-mtf/data\nMODEL_DIR=gs://noam-mtf/transformer_standalone\nTPU=noam-mtf-donut\n\n# MODEL HPARAMS AND DIRECTORY  (uncomment one)\n# base model\nMODEL=./transformer/gin/model_base.gin\n# 5B parameters (too big for this dataset, only trains with model-parallelism)\n# MODEL=./transformer/gin/model_5b.gin\n\n# UNCOMMENT ONE OF THESE\n# Data-parallelism\nLAYOUT=./transformer/gin/layout_data_parallel.gin\n# Model-parallelism\n# LAYOUT=./transformer/gin/layout_model_parallel.gin\n# Data-parallelism and Model-Parallelism\n# LAYOUT=./transformer/gin/layout_data_and_model_parallel.gin\n\n# TRAIN\npython examples/transformer_standalone.py \\\n  --tpu=$TPU --data_dir=$DATA_DIR --model_dir=$MODEL_DIR --gin_file=$MODEL \\\n  --gin_file=$LAYOUT --gin_param=\"run.mode='train'\"\n\n# EVAL\npython examples/transformer_standalone.py \\\n  --tpu=$TPU --data_dir=$DATA_DIR --model_dir=$MODEL_DIR --gin_file=$MODEL \\\n  --gin_file=$LAYOUT --gin_param=\"run.mode='evaluate'\"\n```\n\nThe above code will train on the LM1B language modeling benchmark, as specified\nin `examples/transformer_standalone_defaults.gin`. To train a\nsequence-to-sequence model on WMT14 en-de, change `utils.run.dataset` to\n`wmt_translate_ende/ende_subwords8k_t2t` and set `utils.run.mode` to `True`.\nNote that the `wmt_translate_ende/ende_subwords8k_t2t` dataset was removed from\nTensorFlow Datasets in\n[commit 211cb6f](https://github.com/tensorflow/datasets/commit/211cb6f082c5cc3c482e37d70234142a8fda2db3),\nso in order to train a model using this dataset you need to install a version of\nTFDS before this commit. Then, you can decode the WMT en-de development set\nand evaluate it using [SacreBLEU](https://github.com/mjpost/sacreBLEU) like so:\n\n```\n# INFER\npip3 install sacrebleu\nmkdir ~/input ~/output\nDECODE_INPUT=/home/$USER/input/ende.dev\nDECODE_OUTPUT=/home/$USER/output/ende.dev.out\n~/.local/bin/sacrebleu -t wmt13 -l en-de --echo src \u003e $DECODE_INPUT\npython examples/transformer_standalone.py \\\n  --tpu=$TPU --data_dir=$DATA_DIR --model_dir=$MODEL_DIR --gin_file=$MODEL \\\n  --gin_file=$LAYOUT \\\n  --gin_param=\"decode_from_file.input_filename='$DECODE_INPUT'\" \\\n  --gin_param=\"decode_from_file.output_filename='$DECODE_OUTPUT'\" \\\n  --gin_param=\"run.mode='infer'\"\n\n# Compute BLEU score for dev set\ncat $DECODE_OUTPUT | ~/.local/bin/sacrebleu -t wmt13 -l en-de -tok intl\n```\n\n\n## Run the Transfomer model with Tensor2Tensor config\n```sh\ngit clone https://github.com/tensorflow/tensor2tensor.git\ncd tensor2tensor/\npip install --user  .\n```\n\nBefore running the model, you need to prepare the training data and bucket for\nstoring checkpoints. Refer to the\n[Transformer tutorial](https://cloud.google.com/tpu/docs/tutorials/transformer)\nto learn how to generate the training data and create buckets.\n\n```sh\nCONF=mtf_transformer_paper_tr_0_mesh_8\nNAME=ende_$CONF\\_0828\nMODEL=mtf_transformer\nPROBLEM=translate_ende_wmt32k_packed\n\nDATA_DIR=gs://xxxx\nOUT_DIR=gs://xxxx\nTPU_NAME=ylc-mtf-donut\n\ntensor2tensor/bin/t2t-trainer \\\n  --model=$MODEL \\\n  --hparams_set=$CONF \\\n  --problem=$PROBLEM \\\n  --train_steps=10000 \\\n  --eval_steps=200 \\\n  --data_dir=$DATA_DIR \\\n  --output_dir=$OUT_DIR \\\n  --use_tpu=True \\\n  --cloud_tpu_name=$TPU_NAME\n```\n\n\n## Run the toy model without Tensor2Tensor dependencies\n\n  This toy model contains two fully-connected layers which aim to train a\n  identity function: f(x) = x. Since there are 8 TPU cores, we can arbitrary\n  change the FLAGS.mesh_shape and FLAGS.layout to achieve different\n  data-parallelism and model-parallelism strategies.\n\n```sh\nMODEL_DIR=gs://xxxx\nTPU_NAME=ylc-mtf-donut\n\n# 2 ways data-parallelism and 4 ways model-parallelism.\n# In this configuration, we split the batch dimension into 2 cores and the\n# hidden dimension into 4 cores.\npython examples/toy_model_tpu.py \\\n  --tpu=$TPU \\\n  --model_dir=$MODEL_DIR \\\n  --io_size=8 \\\n  --hidden_size=8 \\\n  --mesh_shape='x:2;y:4' \\\n  --layout='batch:x;hidden:y'\n\n# 8 ways model-parallelism.\n# In this configuration, We split the hidden dimension into 8 cores.\npython examples/toy_model_tpu.py \\\n  --tpu=$TPU \\\n  --model_dir=$MODEL_DIR \\\n  --io_size=8 \\\n  --hidden_size=8 \\\n  --mesh_shape='all:8' \\\n  --layout='hidden:all'\n```\n\n## References\n\n\u003e N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool,\n\u003e P. Hawkins, H. Lee, M. Hong, C. Young, R. Sepassi, and B. Hechtman.\n\u003e [Mesh-TensorFlow: Deep learning for supercomputers.](https://arxiv.org/abs/1811.02084)\n\u003e In _Neural Information Processing Systems_, 2018.\n\n```none\n@inproceedings{shazeer2018mesh,\n  author = {Noam Shazeer and Youlong Cheng and Niki Parmar and Dustin Tran and Ashish Vaswani and Penporn Koanantakool and Peter Hawkins and HyoukJoong Lee and Mingsheng Hong and Cliff Young and Ryan Sepassi and Blake Hechtman},\n  title = {{Mesh-TensorFlow}: Deep Learning for Supercomputers},\n  booktitle = {Neural Information Processing Systems},\n  year = {2018},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftensorflow%2Fmesh","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftensorflow%2Fmesh","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftensorflow%2Fmesh/lists"}