{"id":13437956,"url":"https://github.com/bytedance/effective_transformer","last_synced_at":"2025-04-05T15:09:55.780Z","repository":{"id":43225244,"uuid":"262027881","full_name":"bytedance/effective_transformer","owner":"bytedance","description":"Running BERT without Padding","archived":false,"fork":false,"pushed_at":"2022-03-18T23:07:30.000Z","size":474,"stargazers_count":472,"open_issues_count":7,"forks_count":54,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-03-29T14:11:47.001Z","etag":null,"topics":["bert","inference","machine-learning","tensorflow","transformer"],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bytedance.png","metadata":{"files":{"readme":"Readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-05-07T11:08:58.000Z","updated_at":"2025-03-16T03:06:33.000Z","dependencies_parsed_at":"2022-07-23T10:46:29.308Z","dependency_job_id":null,"html_url":"https://github.com/bytedance/effective_transformer","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bytedance%2Feffective_transformer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bytedance%2Feffective_transformer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bytedance%2Feffective_transformer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bytedance%2Feffective_transformer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bytedance","download_url":"https://codeload.github.com/bytedance/effective_transformer/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247353749,"owners_count":20925329,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","inference","machine-learning","tensorflow","transformer"],"created_at":"2024-07-31T03:01:01.657Z","updated_at":"2025-04-05T15:09:55.753Z","avatar_url":"https://github.com/bytedance.png","language":"C++","funding_links":[],"categories":["C++"],"sub_categories":[],"readme":"# Effective Transformer\n\nEffective Transformer is built on top of the NVIDIA open sourced project [FasterTransformer](https://github.com/NVIDIA/DeepLearningExamples/tree/master/FasterTransformer/v1) with many advanced optimizations.\nOur experiments show Effective Transformer can significantly reduce the execution time and memory consumption, especially for large batch size cases.\n\n## Running BERT without Padding\n\nWhen using BERT to encode a batch of input sequences, we usually treat the input batch as a matrix whose column number equals to the maximum length of all sequences.\nNVIDIA [FasterTransformer](https://github.com/NVIDIA/DeepLearningExamples/tree/master/FasterTransformer/v1) can process cases that all sequences have roughly the same length very efficiently.\nHowever, if the lengths of sequences in the same batch vary a lot, padding them into the same length means a big waste of both memory and computation resources.\n\nConsider the following case\n\n``` python\nbert_input = [[\"Hi\"], [\"Picking\"], [\"The\", \"seed\", \"of\", \"Job's\", \"tears\"]]\nbert_tokens = [[1], [2], [3,4,5,6,7]]\nbert_tokens_padded = [[1, 0, 0, 0, 0], [2, 0, 0, 0, 0], [3, 4, 5, 6, 7]]\nbert_tokens_mask = [[1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 1, 1, 1, 1]]\n```\n\nthis input includes 3 sequences and the maximum length is 5. If we just simply treat it as a 3x5 matrix, only 7 out of 15 values are meaningful.\n\nIn Effective Transformer, we still take the input batch as a padded matrix but padding values will be dynamically removed and restored during different calculation stages.\n\nBy calculating the prefix sum of the input [mask matrix](https://github.com/google-research/bert/blob/master/modeling.py#L115), we can access real inputs in each sequence in a matrix with no padding values.\nThe following figure illustrates how to access valid inputs and dynamically remove and restore padding values during the calculation.\nAll valid inputs are colored in green while padding values are colored in gray.\n\n\u003cimg src=\"./images/1.png\" width=\"50%\" height=\"50%\"\u003e\n\n\n## Environment requirements\n\n* CMake \u003e= 3.12\n* gcc \u003e= 6\n* CUDA 10.0\n* Python \u003e= 3.5\n* Tensorflow 1.15.x\n\n\n## Features\n\n* dynamic batch size\n* inference with float32 and float16\n\n## Performance\n\nBERT-Base, layers=12, head_num=12, hidden_size=64\n\nIntel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz\n\nserquence length generated by\n\n``` python\navg_seq_len = np.random.randint(\n    low = 2 * avg_seq_len - max_seq_len,\n    high = max_seq_len + 1,\n    size = (batch_size),\n    dtype = np.int32)\n```\n\n### Tesla V100, float16, maximum sequence length=32, average serquence length≈20\n| batch_size | XLA (in ms)| Faster Transformer (in ms) | Speedup over XLA | Effective Transformer (in ms) | Speedup over XLA\n|:-------------:|:-------------:|:---------:|:-----------:|:-----------:|:-----------:|\n| 100  | 15.08 | 10.39 | 1.45 | 8.75 | 1.72 |\n| 200  | 28.08 | 19.64 | 1.43 | 15.32 | 1.83 |\n| 300  | 41.37 | 29.65 | 1.40 | 22.18 | 1.86 |\n| 400  | 53.65 | 38.52 | 1.39 | 28.31 | 1.89 |\n| 500  | 66.86 | 48.13 | 1.39 | 33.08 | 2.02 |\n| 1000  | 131.46 | 95.01 | 1.38 | 64.34 | 2.04 |\n\n### Tesla V100, float16, maximum sequence length=64, average serquence length≈40\n| batch_size | XLA (in ms)| Faster Transformer (in ms) | Speedup over XLA | Effective Transformer (in ms) | Speedup over XLA\n|:-------------:|:-------------:|:---------:|:-----------:|:-----------:|:-----------:|\n| 100  | 28.31 | 20.27 | 1.40 | 16.03 | 1.77 |\n| 200  | 54.47 | 40.08 | 1.36 | 30.15 | 1.81 |\n| 300  | 80.53 | 59.11 | 1.36 | 41.27 | 1.95 |\n| 400  | 106.5 | 78.38 | 1.36 | 54.12 | 1.97 |\n| 500  | 132.35 | 98.03 | 1.37 | 65.92 | 2.01 |\n| 1000  | 261.18 | 190.91 | 1.38 | 133.61 | 1.95 |\n\n### Tesla V100, float32, maximum sequence length=64, average serquence length≈40\n| batch_size | XLA (in ms)| Faster Transformer (in ms) | Speedup over XLA | Effective Transformer (in ms) | Speedup over XLA\n|:-------------:|:-------------:|:---------:|:-----------:|:-----------:|:-----------:|\n| 100  | 103.13 | 98.52 | 1.05 | 67.45 | 1.53 |\n| 200  | 207.40 | 198.86 | 1.04 | 125.44 | 1.65 |\n| 300  | 304.99 | 290.55 | 1.05 | 197.07 | 1.55 |\n| 400  | 405.98 | 386.04 | 1.05 | 247.39 | 1.64 |\n| 500  | 516.88 | 496.90 | 1.04 | 325.37 | 1.59 |\n\n### Tesla T4, float16, maximum sequence length=32, average serquence length≈20\n| batch_size | XLA (in ms)| FasterTransformer (in ms) | Speedup over XLA | EffectiveTransformer (in ms) | Speedup over XLA |\n|:----------:|:----------:|:---------:|:-----------:|:-----------:|:-----------:|\n| 100  | 44.94 | 35.07 | 1.28 | 28.63 | 1.57 |\n| 200  | 90.09 | 67.08 | 1.34 | 53.84 | 1.67 |\n| 300  | 136.88 | 100.96 | 1.35 | 82.74 | 1.65 |\n| 400  | 184.80 | 133.13 | 1.39 | 109.09 | 1.69 |\n| 500  | 242.79 | 166.54 | 1.46 | 136.66 | 1.78 |\n\n### Tesla T4, float16, maximum sequence length=64, average serquence length≈40\n| batch_size | XLA (in ms)| FasterTransformer (in ms) | Speedup over XLA | EffectiveTransformer (in ms) | Speedup over XLA |\n|:----------:|:----------:|:---------:|:-----------:|:-----------:|:-----------:|\n| 100  | 87.23 | 65.86 | 1.30 | 52.01 | 1.68 |\n| 200  | 176.91 | 138.53 | 1.34 | 108.33 | 1.63 |\n| 300  | 261.25 | 204.99 | 1.36 | 157.84 | 1.65 |\n| 400  | 355.34 | 272.96 | 1.33 | 202.61 | 1.75 |\n| 500  | 452.62 | 343.89 | 1.33 | 250.78 | 1.80 |\n\n## Run demo\n\nUsing python prebuilt packege requires python3.5+ tensorflow1.15.x cuda10.0, tested on debian9.\n\n```\n$ cd effective_transformer\n$ pip install -e python\n\n$ python benchmark.py --help\nusage: benchmark.py [-h] [-c CONFIG] [-p {fp32,fp16}] [-b BATCH_SIZE]\n                    [-m MAX_SEQ_LENGTH] [-a AVG_SEQ_LENGTH]\n\nBert performance measuring sample.\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -c CONFIG, --config CONFIG\n                        Bert config file.\n  -p {fp32,fp16}, --precision {fp32,fp16}\n                        Weight precision.\n  -b BATCH_SIZE, --batch_size BATCH_SIZE\n                        Batch size.\n  -m MAX_SEQ_LENGTH, --max_seq_length MAX_SEQ_LENGTH\n                        Max sequence length.\n  -a AVG_SEQ_LENGTH, --avg_seq_length AVG_SEQ_LENGTH\n                        Average sequence length.\n```\n\n## Build from source\n`TF_PATH : path to libtensorflow_framework.so`\n```\n$ mkdir build \u0026\u0026 cd build\n$ cmake -DTF_PATH=/your/path/to/pythonx.x/site-packages/tensorflow_core/ ..\n$ make\n$ cp lib/libtf_effectivetransformer.so ../python/effective_transformer/libtf_effectivetransformer.so.1.15\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbytedance%2Feffective_transformer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbytedance%2Feffective_transformer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbytedance%2Feffective_transformer/lists"}