{"id":13535047,"url":"https://github.com/zhihu/cuBERT","last_synced_at":"2025-04-02T00:32:07.727Z","repository":{"id":43040202,"uuid":"175349798","full_name":"zhihu/cuBERT","owner":"zhihu","description":"Fast implementation of BERT inference directly on NVIDIA (CUDA, CUBLAS) and Intel MKL","archived":false,"fork":false,"pushed_at":"2020-11-18T07:36:58.000Z","size":100227,"stargazers_count":518,"open_issues_count":8,"forks_count":83,"subscribers_count":21,"default_branch":"master","last_synced_at":"2024-08-02T08:09:52.459Z","etag":null,"topics":["bert","cuda","deep-learning","inference","mkl","predict","tensorflow","transformer"],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zhihu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-03-13T05:04:39.000Z","updated_at":"2024-08-02T04:00:45.000Z","dependencies_parsed_at":"2022-07-19T12:59:14.124Z","dependency_job_id":null,"html_url":"https://github.com/zhihu/cuBERT","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhihu%2FcuBERT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhihu%2FcuBERT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhihu%2FcuBERT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zhihu%2FcuBERT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zhihu","download_url":"https://codeload.github.com/zhihu/cuBERT/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":222788514,"owners_count":17037777,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","cuda","deep-learning","inference","mkl","predict","tensorflow","transformer"],"created_at":"2024-08-01T08:00:49.032Z","updated_at":"2024-11-02T23:30:18.068Z","avatar_url":"https://github.com/zhihu.png","language":"C++","funding_links":[],"categories":["BERT Deploy Tricks:","C++"],"sub_categories":[],"readme":"Fast implementation of BERT inference directly on NVIDIA (CUDA, CUBLAS) and Intel MKL\n=====================================================================================\n\n[![Build Status](https://travis-ci.org/zhihu/cuBERT.svg?branch=master)](https://travis-ci.org/zhihu/cuBERT)\n\nHighly customized and optimized BERT inference directly on NVIDIA (CUDA,\nCUBLAS) or Intel MKL, *without* tensorflow and its framework overhead.\n\n**ONLY** BERT (Transformer) is supported.\n\n# Benchmark\n\n### Environment\n\n* Tesla P4\n* 28 * Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz\n* Debian GNU/Linux 8 (jessie)\n* gcc (Debian 4.9.2-10+deb8u1) 4.9.2\n* CUDA: release 9.0, V9.0.176\n* MKL: 2019.0.1.20181227\n* tensorflow: 1.12.0\n* BERT: seq_length = 32\n\n### GPU (cuBERT)\n\n|batch size|128 (ms) |32 (ms) |\n|---       |---      |---     |\n|tensorflow|255.2    |70.0    |\n|cuBERT    |**184.6**|**54.5**|\n\n### CPU (mklBERT)\n\n|batch size|128 (ms) |1 (ms)  |\n|---       |---      |---     |\n|tensorflow|1504.0   |69.9    |\n|mklBERT   |**984.9**|**24.0**|\n\nNote: MKL should be run under `OMP_NUM_THREADS=?` to control its thread\nnumber. Other environment variables and their possible values includes:\n\n* `KMP_BLOCKTIME=0`\n* `KMP_AFFINITY=granularity=fine,verbose,compact,1,0`\n\n### Mixed Precision\n\ncuBERT can be accelerated by [Tensor Core](https://developer.nvidia.com/tensor-cores)\nand [Mixed Precision](https://devblogs.nvidia.com/tensor-cores-mixed-precision-scientific-computing)\non NVIDIA Volta and Turing GPUs. We support mixed precision as variables\nstored in fp16 with computation taken in fp32. The typical accuracy error\nis less than 1% compared with single precision inference, while the speed\nachieves more than 2x acceleration.\n\n# API\n\n[API .h header](/src/cuBERT.h)\n\n### Pooler\n\nWe support following 2 pooling method.\n\n* The standard BERT pooler, which is defined as:\n\n```python\nwith tf.variable_scope(\"pooler\"):\n  # We \"pool\" the model by simply taking the hidden state corresponding\n  # to the first token. We assume that this has been pre-trained\n  first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)\n  self.pooled_output = tf.layers.dense(\n    first_token_tensor,\n    config.hidden_size,\n    activation=tf.tanh,\n    kernel_initializer=create_initializer(config.initializer_range))\n```\n\n* Simple average pooler:\n\n```python\nself.pooled_output = tf.reduce_mean(self.sequence_output, axis=1)\n```\n\n### Output\n\nFollowing outputs are supported:\n\n|cuBERT_OutputType      |python code                   |\n|---                    |---                           |\n|cuBERT_LOGITS          |[`model.get_pooled_output() * output_weights + output_bias`](https://github.com/google-research/bert/blob/d66a146741588fb208450bde15aa7db143baaa69/run_classifier.py#L607)|\n|cuBERT_PROBS           |`probs = tf.nn.softmax(logits, axis=-1)`|\n|cuBERT_POOLED_OUTPUT   |`model.get_pooled_output()`   |\n|cuBERT_SEQUENCE_OUTPUT |`model.get_sequence_output()` |\n|cuBERT_EMBEDDING_OUTPUT|`model.get_embedding_output()`|\n\n# Build from Source\n\n```shell\nmkdir build \u0026\u0026 cd build\n# if build with CUDA\ncmake -DCMAKE_BUILD_TYPE=Release -DcuBERT_ENABLE_GPU=ON -DCUDA_ARCH_NAME=Common ..\n# or build with MKL\ncmake -DCMAKE_BUILD_TYPE=Release -DcuBERT_ENABLE_MKL_SUPPORT=ON ..\nmake -j4\n\n# install to /usr/local\n# it will also install MKL if -DcuBERT_ENABLE_MKL_SUPPORT=ON\nsudo make install\n```\n\nIf you would like to run tfBERT_benchmark for performance comparison,\nplease first install tensorflow C API from https://www.tensorflow.org/install/lang_c.\n\n### Run Unit Test\n\nDownload BERT test model `bert_frozen_seq32.pb` and `vocab.txt` from\n[Dropbox](https://www.dropbox.com/sh/ulcdmu9ysyg5lk7/AADndzKXOrHIXLYRc5k60Q-Ta?dl=0), \nand put them under dir `build` before run `make test` or `./cuBERT_test`.\n\n### Python\n\nWe provide simple Python wrapper by Cython, and it can be built and \ninstalled after C++ building as follows:\n\n```shell\ncd python\npython setup.py bdist_wheel\n\n# install\npip install dist/cuBERT-xxx.whl\n\n# test\npython cuBERT_test.py\n```\n\nPlease check the Python API usage and examples at [cuBERT_test.py](/python/cuBERT_test.py)\t\nfor more details.\n\n### Java\n\nJava wrapper is implemented through [JNA](https://github.com/java-native-access/jna)\n. After installing maven and C++ building, it can be built as follows:\n\n```shell\ncd java\nmvn clean package # -DskipTests\n```\n\nWhen using Java JAR, you need to specify `jna.library.path` to the \nlocation of `libcuBERT.so` if it is not installed to the system path.\nAnd `jna.encoding` should be set to UTF8 as `-Djna.encoding=UTF8`\nin the JVM start-up script.\n\nPlease check the Java API usage and example at [ModelTest.java](/java/src/test/java/com/zhihu/cubert/ModelTest.java)\nfor more details.\n\n# Install\n\nPre-built python binary package (currently only with MKL on Linux) can\nbe installed as follows:\n\n* Download and install [MKL](https://github.com/intel/mkl-dnn/releases)\nto system path.\n\n* Download the wheel package and `pip install cuBERT-xxx-linux_x86_64.whl`\n\n* run `python -c 'import libcubert'` to verify your installation.\n\n# Dependency\n\n### Protobuf\n\ncuBERT is built with [protobuf-c](https://github.com/protobuf-c/protobuf-c) to \navoid version and code conflicting with tensorflow protobuf.\n\n### CUDA\n\nLibraries compiled by CUDA with different versions are not compatible.\n\n### MKL\n\nMKL is dynamically linked. We install both cuBERT and MKL in `sudo make install`.\n\n# Threading\n\nWe assume the typical usage case of cuBERT is for online serving, where\nconcurrent requests of different batch_size should be served as fast as\npossible. Thus, throughput and latency should be balanced, especially in\npure CPU environment.\n\nAs the vanilla [class Bert](/src/cuBERT/Bert.h) is not thread-safe\nbecause of its internal buffers for computation, a wrapper [class BertM](/src/cuBERT/BertM.h)\nis written to hold locks of different `Bert` instances for thread safety.\n`BertM` will choose one underlying `Bert` instance by a round-robin\nmanner, and consequence requests of the same `Bert` instance might be\nqueued by its corresponding lock.\n\n### GPU\n\nOne `Bert` is placed on one GPU card. The maximum concurrent requests is\nthe number of usable GPU cards on one machine, which can be controlled\nby `CUDA_VISIBLE_DEVICES` if it is specified.\n\n### CPU\n\nFor pure CPU environment, it is more complicate than GPU. There are 2\nlevel of parallelism:\n\n1. Request level. Concurrent requests will compete CPU resource if the\nonline server itself is multi-threaded. If the server is single-threaded\n(for example some server implementation in Python), things will be much\neasier.\n\n2. Operation level. The matrix operations are parallelized by OpenMP and\nMKL. The maximum parallelism is controlled by `OMP_NUM_THREADS`,\n`MKL_NUM_THREADS`, and many other environment variables. We refer our\nusers to first read [Using Threaded Intel® MKL in Multi-Thread Application](https://software.intel.com/en-us/articles/using-threaded-intel-mkl-in-multi-thread-application)\n and [Recommended settings for calling Intel MKL routines from multi-threaded applications](https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications)\n.\n\nThus, we introduce `CUBERT_NUM_CPU_MODELS` for better control of request\nlevel parallelism. This variable specifies the number of `Bert` instances\ncreated on CPU/memory, which acts same like `CUDA_VISIBLE_DEVICES` for\nGPU.\n\n* If you have limited number of CPU cores (old or desktop CPUs, or in\nDocker), it is not necessary to use `CUBERT_NUM_CPU_MODELS`. For example\n4 CPU cores, a request-level parallelism of 1 and operation-level\nparallelism of 4 should work quite well.\n\n* But if you have many CPU cores like 40, it might be better to try with\nrequest-level parallelism of 5 and operation-level parallelism of 8.\n\nIn summary, `OMP_NUM_THREADS` or `MKL_NUM_THREADS` defines how many threads\none model could use, and `CUBERT_NUM_CPU_MODELS` defines how many models in\ntotal.\n\nAgain, the per request latency and overall throughput should be balanced,\nand it diffs from model `seq_length`, `batch_size`, your CPU cores, your\nserver QPS, and many many other things. You should take a lot benchmark\nto achieve the best trade-off. Good luck!\n\n# Authors\n\n* fanliwen\n* wangruixin\n* fangkuan\n* sunxian\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzhihu%2FcuBERT","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzhihu%2FcuBERT","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzhihu%2FcuBERT/lists"}