{"id":13819190,"url":"https://github.com/openai/blocksparse","last_synced_at":"2025-05-15T18:10:47.299Z","repository":{"id":55435970,"uuid":"113274212","full_name":"openai/blocksparse","owner":"openai","description":"Efficient GPU kernels for block-sparse matrix multiplication and convolution","archived":false,"fork":false,"pushed_at":"2023-06-08T11:01:25.000Z","size":536,"stargazers_count":1023,"open_issues_count":36,"forks_count":200,"subscribers_count":198,"default_branch":"master","last_synced_at":"2024-10-29T15:51:27.104Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://blog.openai.com/block-sparse-gpu-kernels/","language":"Cuda","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/openai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-12-06T05:44:27.000Z","updated_at":"2024-10-28T16:40:22.000Z","dependencies_parsed_at":"2024-05-28T03:15:27.041Z","dependency_job_id":"5b883c9e-7219-4c03-bacf-4644468b3604","html_url":"https://github.com/openai/blocksparse","commit_stats":{"total_commits":25,"total_committers":7,"mean_commits":"3.5714285714285716","dds":0.28,"last_synced_commit":"89074c5ccf78e3a88b4aa2aefc9e208d4773dcbc"},"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openai%2Fblocksparse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openai%2Fblocksparse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openai%2Fblocksparse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openai%2Fblocksparse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/openai","download_url":"https://codeload.github.com/openai/blocksparse/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254394724,"owners_count":22063984,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-04T08:00:41.962Z","updated_at":"2025-05-15T18:10:47.278Z","avatar_url":"https://github.com/openai.png","language":"Cuda","funding_links":[],"categories":["Cuda"],"sub_categories":[],"readme":"**Status:** Active (under active development, breaking changes may occur)\n\n# Blocksparse\n\nThe `blocksparse` package contains TensorFlow Ops and corresponding GPU kernels for block-sparse matrix multiplication.  Also included are related ops like edge bias, sparse weight norm and layer norm.\n\nTo learn more, see [the launch post on the OpenAI blog](https://blog.openai.com/block-sparse-gpu-kernels/).\n\n## Prerequisites\n\nFirst, you need at least one Nvidia GPU. For best performance, we recommend using a Pascal or Maxwell generation GPU -- this is the full list of features by GPU type:\n\n| GPU Family | BSMatMul-ASM | BSMatMul-CudaC | BSConv |\n|------------|------------------------|----------------|--------|\n| Kepler | - | X | - |\n| Maxwell | X (fastest) | X | X |\n| Pascal | X (fastest) | X | X |\n| Volta | - | X (fastest) | - |\n\nNote that BSMatMul-CudaC **only supports `feature_axis=0`**, while BSMatMul-ASM only supports `feature_axis=1`.\n\nAdditionally, you need:\n\n- A working Linux installation (we run Ubuntu 16.04) with the Nvidia drivers for your GPU.\n- CUDA 8 (in `/usr/local/cuda`)\n- Python 3.5 or newer, or 2.7 or newer\n- TensorFlow 1.4.0 or newer, [with GPU support](https://www.tensorflow.org/install/install_linux#install_tensorflow) (e.g. `pip install tensorflow-gpu`)\n- CUDA 9 and Volta will work if you update the build targets (-gencode=arch=compute_70,code=sm_70) and also build tenorflow from source.\n\n## Installation\n\n```\npip install blocksparse\n```\n\n## Usage\n\nThis example performs a block-sparse matrix multiplication:\n```\nfrom blocksparse.matmul import BlocksparseMatMul\nimport tensorflow as tf\nimport numpy as np\n\nhidden_size = 4096\nblock_size = 32\nminibatch_size = 64\n\n# Create a (random) sparsity pattern\nsparsity = np.random.randint(2, size=(hidden_size//block_size,hidden_size//block_size))\n\n# Initialize the sparse matrix multiplication object\nbsmm = BlocksparseMatMul(sparsity, block_size=block_size)\n\n# Input to graph\nx = tf.placeholder(tf.float32, shape=[None, hidden_size])\n\n# Initialize block-sparse weights\nw = tf.get_variable(\"w\", bsmm.w_shape, dtype=tf.float32)\n\n# Block-sparse matrix multiplication\ny = bsmm(x, w)\n\n# Run\nsess = tf.InteractiveSession()\nsess.run(tf.global_variables_initializer())\nresult = sess.run([y], feed_dict = {x: np.ones((minibatch_size,hidden_size), dtype='float32')})\nprint(result)\n```\n\nFor a more involved example using block-sparse ops to train a language model, see [`examples/`](./examples/).\n\n## Development\n\nIf you're interested in hacking on the ops and kernels, go ahead and build from source:\n\n    git clone git@github.com:openai/blocksparse.git\n    cd blocksparse\n\n    make compile\n    pip install dist/*.whl\n\n    # test it if you like\n    test/blocksparse_matmul_test.py\n    test/blocksparse_conv_test.py\n\nIf your CUDA is not in `/usr/local/cuda` or you have several versions, e.g. both `/usr/local/cuda-8.0` and `/usr/local/cuda-9.0`, set `CUDA_HOME` to the base path to use when compiling `make compile`.\n\n\n## API Documentation:\n\n\n### blocksparse.matmul\n\n    class BlocksparseMatMul(object)\n\n        def __init__(self, layout, block_size=32, feature_axis=1)\n        \"\"\"\n        layout: a 2d array of ones and zeros specifying the block layout\n        block_size: values 32, 16, 8 supported\n        feature_axis: when block_size is less than 32 memory access becomes far more efficient\n                      with a (C,N) activation layout\n        \"\"\"\n\n        # shape helpers for generating tensors (N=minibatch)\n        self.w_shape\n        def i_shape(self, N)\n        def o_shape(self, N)\n\n        # return the coordinates (c,k) in the layout that corresponds to a given block id\n        def block_coord(self, block)\n\n        # experimental ortho init\n        def ortho_init(self)\n\n        # in practice, identity_init + layernorm is all you need for initialization\n        # with gpu=True the init is performed by kernel on the device\n        def identity_init(self, gpu=False)\n\n        # To implement weight normalization.  In practice, layernorm works much better.\n        def l2_normalize(self, W, gain=None, epsilon=1e-6, dtype=np.float32)\n\n        def __call__(self, I, W, dw_dtype=tf.float32)\n        \"\"\"\n        Execute the op.  Note that the weight variable is independant from the bsmm object.\n        This allows multiple weights to be tied to the same bsmm layout.\n\n        dw_dtype: allows control over dw precision format.\n        \"\"\"\n\n\n    def group_param_grads(param_grad, group_size=8, cast32=True)\n    \"\"\"\n    param_grad: the tensorflow parameter gradient for a give bsmm weight variable (returned from tf.gradients)\n    group_size: desired group size, up to 8 supported\n\n    This causes the tf graph to be rewritten so that weight grad matmuls from different time steps\n    (and shared weights across time) are combined into a more efficient single matmul.\n    \"\"\"\n\n\n    class SparseProj(object):\n        def __init__(self, nhidden, nproj=None, proj_stride=None, block_size=32, gather_lut=None)\n        \"\"\"\n        Experimental class to support dense-to-sparse and sparse-to-dense projections.\n        Basically the same as the tensorflow ops but faster and support alternate precision formats.\n        They assume a unique 1 to 1 mapping so atomics need not be used on backward ops.\n        \"\"\"\n\n        def gather(self, x)\n        def scatter(self, x)\n        def scatter_add(self, x, y)\n        def scatter_mul(self, x, y)\n\n\n\n### blocksparse.conv\n\n    class BlocksparseConv(object):\n        def __init__(self, BCK, TRS, DHW, MPQ=None, strides=(1,1,1), dilates=(1,1,1), padding=\"SAME\", edge_bias=False)\n        \"\"\"\n        BCK: (                                             # block(B)/input(C)/output(K) feature dims\n                 ( (c0, c1, c2, ...), (k0, k1, k2, ...) ), # block 0 c,k are indeces into C,K dims\n                 ( (c0, c1, c2, ...), (k0, k1, k2, ...) ), # block 1\n                 ( (c0, c1, c2, ...), (k0, k1, k2, ...) ), # block 2 ...\n             )\n        TRS: (T,R,S) or (R,S) or (S,)         - filter spatial size dims\n        DHW: (D,H,W) or (H,W) or (W,)         - input image spatial size dims\n        MPQ: (M,P,Q) or (P,Q) or (Q,) or None - output image spatial size dims (used for ambiguous dims in strided transpose conv)\n        strides: (1,1,1) or (1,1) or (1,)\n        dilates: (1,1,1) or (1,1) or (1,)\n        padding: (1,1,1) or (1,1) or (1,) or \"SAME\" or \"VALID\"\n        edge_bias: True/False\n        \"\"\"\n\n        # shape helpers for setting up variables or test tensors\n        def edge_bias_shape(self)\n        def f_shape(self, block=None)\n        def i_shape(self, N)\n        def o_shape(self, N)\n\n        # execute op passing in param variables and input\n        def __call__(self, F, I, edge_bias=None):\n\n        # for implementing weight norm\n        def l2_normalize(self, F, gain=None, epsilon=1e-6, dtype=np.float32):\n\n    class BlocksparseDeconv(BlocksparseConv)\n        def __init__(self, BCK, TRS, DHW, MPQ=None, strides=(1,1,1), dilates=(1,1,1), padding=\"SAME\", edge_bias=False)\n        \"\"\"\n        Deconvolution.  Same params as above.\n        \"\"\"\n\n    def cwise_linear(x, a=None, b=None)\n    \"\"\"\n    In the NCHW tensor format, tensorflow is extremely slow at implementing simple broadcasting ops on the middle C dim.\n    This lets you do:\n        y = a*x + b\n        y = a*x\n        y = x + b\n\n    Where a and b are of shape (1,C,1,1)\n    This is useful for ops like weight norm.\n\n### blocksparse.ew\n\n    # same as tf ops but generally more efficient and allow custom precision formats\n    def        add(x, y, name=None)\n    def   multiply(x, y, name=None)\n    def   subtract(x, y, name=None)\n    def     divide(x, y, name=None)\n    def    maximum(x, y, name=None)\n    def    minimum(x, y, name=None)\n\n    def   negative(x,    name=None)\n    def reciprocal(x,    name=None)\n    def     square(x,    name=None)\n    def       sqrt(x,    name=None)\n    def        exp(x,    name=None)\n    def        log(x,    name=None)\n    def    sigmoid(x,    name=None)\n    def       tanh(x,    name=None)\n    def       relu(x,    name=None)\n    def       elu (x, alpha=1.0, name=None)\n\n    # here args can be the 4 independant gate tensors or\n    # a single merged gate tensor (which gets split in 4 internally)\n    def fused_lstm_gates(c, *args, name=None)\n\n    def split4(x)\n    def concat4(x0, x1, x2, x3)\n\n    # A custom cast op to help explore novel precision formats\n    def float_cast(x, dtype, dx_dtype=None)\n\n    # a much faster (and non-deterministic) dropout op\n    # also supports novel precision formats\n    def dropout(x, keep_prob=0.8, mask=None)\n\n    # an op to be used in tf.gradients when adding together multiple contributions of a gradient.\n    # note that only 8 inputs are supported as you'd never want a single op to consume all possible inputs\n    # before it starts executing in the graph (and hence reducing the memory footprint)\n    def add_n8(xs, name=None)\n\n\n\n### blocksparse.norms\n\n    def layer_norm(x, g, b, axis=1, epsilon=1e-6, relu=False)\n    \"\"\"\n    Very fast layernorm to support both bsmm feature_axis activation layouts.\n    Also inlcludes optional integrated relu (applied to end)\n    \"\"\"\n\n    # basic batch norm ops for the NCHW layout\n    def batch_norm(x, g, b, epsilon=1e-6)\n    def batch_norm_inference(x, g, b, m, v, epsilon=1e-6)\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenai%2Fblocksparse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopenai%2Fblocksparse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenai%2Fblocksparse/lists"}