{"id":13592179,"url":"https://github.com/kyegomez/BitNet","last_synced_at":"2025-04-08T23:31:23.572Z","repository":{"id":201025076,"uuid":"706794274","full_name":"kyegomez/BitNet","owner":"kyegomez","description":"Implementation of \"BitNet: Scaling 1-bit Transformers for Large Language Models\" in pytorch","archived":false,"fork":false,"pushed_at":"2024-04-13T14:41:51.000Z","size":38278,"stargazers_count":1254,"open_issues_count":16,"forks_count":117,"subscribers_count":34,"default_branch":"main","last_synced_at":"2024-04-14T03:24:10.037Z","etag":null,"topics":["artificial-intelligence","deep-neural-networks","deeplearning","gpt4","machine-learning","multimodal","multimodal-deep-learning"],"latest_commit_sha":null,"homepage":"https://discord.gg/qUtxnK2NMf","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kyegomez.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":["kyegomez"],"patreon":null,"open_collective":null,"ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"otechie":null,"lfx_crowdfunding":null,"custom":null}},"created_at":"2023-10-18T16:19:06.000Z","updated_at":"2024-04-28T03:59:58.085Z","dependencies_parsed_at":"2024-04-28T04:10:00.665Z","dependency_job_id":null,"html_url":"https://github.com/kyegomez/BitNet","commit_stats":null,"previous_names":["kyegomez/bitnet"],"tags_count":0,"template":false,"template_full_name":"kyegomez/Python-Package-Template","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kyegomez%2FBitNet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kyegomez%2FBitNet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kyegomez%2FBitNet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kyegomez%2FBitNet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kyegomez","download_url":"https://codeload.github.com/kyegomez/BitNet/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223346562,"owners_count":17130461,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","deep-neural-networks","deeplearning","gpt4","machine-learning","multimodal","multimodal-deep-learning"],"created_at":"2024-08-01T16:01:06.614Z","updated_at":"2025-04-08T23:31:23.567Z","avatar_url":"https://github.com/kyegomez.png","language":"Python","funding_links":["https://github.com/sponsors/kyegomez"],"categories":["Python","Repos"],"sub_categories":[],"readme":"[![Multi-Modality](agorabanner.png)](https://discord.gg/qUtxnK2NMf)\n\n# BitNet\n![bitnet](/bitnet.png)\nPyTorch Implementation of the linear methods and model from the paper \"BitNet: Scaling 1-bit Transformers for Large Language Models\"\n\n[Paper link:](https://arxiv.org/pdf/2310.11453.pdf)\n\nBitLinear = tensor -\u003e layernorm -\u003e Binarize -\u003e abs max quantization -\u003e dequant\n\n\"The implementation of the BitNet architecture is quite simple, requiring only the replacement of linear projections (i.e., nn.Linear in PyTorch) in the Transformer. \" -- BitNet is really easy to implement just swap out the linears with the BitLinear modules! \n\n## **NEWS**\n- **New Iteration** 🔥 There is an all-new iteration from the paper \"[The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits](https://arxiv.org/abs/2402.17764)\", we're implementing it now. Join the Agora discord and contribute! [Join Here](https://discord.gg/hFzevCjG8c)\n- **New Optimizations** The first `BitLinear` has been optimized and we now have a Bit Attention `BitMGQA` That implements BitLinear into the attention mechanism. Multi Grouped Query Attention is also widely recognized as the best attention for its fast decoding and long context handling, thanks to Frank for his easy to use implementation!\n- **BitLinear 1.5 Launch 🔥**: The new BitLinear 1.5 is still in progress 🔥 [Here is the file]() There are still some bugs like with the dequantization algorithm and we still need to replace the multiplication with elementwisw addition, if you could help with this, this would be amazing.\n- **NOTICE**: A model obviously needs to be finetuned from scratch to use BitLinear, just changing the linear methods in an already trained model isn't going to work. Finetune or train from scratch.\n\n## Appreciation\n- Dimitry, Nullonix for analysis and code review and revision\n- Vyom, for providing 4080 to train!\n\n## Installation\n```bash\npip3 install bitnet\n```\n\n## Usage\nWe have a vast selection of example scripts here and in the [examples folder:](/examples/), let me know if you want assistance with a use-case in the discord!\n\n\n### `BitLinear`\n- Example of the BitLinear layer which is the main innovation of the paper!\n```python\nimport torch\n\nfrom bitnet import BitLinear\n\n# Input\nx = torch.randn(10, 1000, 512)\n\n# BitLinear layer\nlayer = BitLinear(512, 400)\n\n# Output\ny = layer(x)\n\nprint(y)\n\n```\n\n### BitLinearNew\n```python\nimport torch\nfrom bitnet import BitLinearNew\n\n# Create a random tensor of shape (16, 10)\nx = torch.randn(16, 1000, 512)\n\n# Create an instance of the BitLinearNew class with input size 10, output size 20, and 2 groups\nlayer = BitLinearNew(\n    512,\n    20,\n)\n\n# Perform a forward pass through the BitLinearNew layer with input x\noutput = layer(x)\n\n# Print the output tensor\nprint(output)\nprint(output.shape)\n```\n----\n\n### `BitNetTransformer`\n- Fully implemented Transformer as described in the diagram with MHA, and BitFeedforwards\n- Can be utilized not just for text but for images and maybe even video or audio processing\n- Complete with residuals and skip connections for gradient flow\n\n```python\n# Import the necessary libraries\nimport torch\nfrom bitnet import BitNetTransformer\n\n# Create a random tensor of integers\nx = torch.randint(0, 20000, (1, 1024))\n\n# Initialize the BitNetTransformer model\nbitnet = BitNetTransformer(\n    num_tokens=20000,  # Number of unique tokens in the input\n    dim=1024,  # Dimension of the input and output embeddings\n    depth=6,  # Number of transformer layers\n    heads=8,  # Number of attention heads\n    ff_mult=4,  # Multiplier for the hidden dimension in the feed-forward network\n)\n\n# Pass the tensor through the transformer model\nlogits = bitnet(x)\n\n# Print the shape of the output\nprint(logits)\n\n```\n\n\n### `BitAttention`\nThis Attention has been modified to use BitLinear instead of the default linear projection. It's also using Multi-Grouped Query Attention instead of regular multi-head attention for faster decoding and longer context handling.\n\n```python\nimport torch\nfrom bitnet import BitMGQA\n\n# Create a random tensor of shape (1, 10, 512)\nx = torch.randn(1, 10, 512)\n\n# Create an instance of the BitMGQA model with input size 512, 8 attention heads, and 4 layers\ngqa = BitMGQA(512, 8, 4)\n\n# Pass the input tensor through the BitMGQA model and get the output and attention weights\nout, _ = gqa(x, x, x, need_weights=True)\n\n# Print the shapes of the output tensor and attention tensor\nprint(out)\n```\n\n### `BitFeedForward`\n- Feedforward as shown in the diagram with BitLinear and a GELU:\n- Linear -\u003e GELU -\u003e Linear\n- You can add dropouts, or layernorms, or other layers for a better ffn\n\n```python\nimport torch\nfrom bitnet import BitFeedForward\n\n# Create a random input tensor of shape (10, 512)\nx = torch.randn(10, 512)\n\n# Create an instance of the BitFeedForward class with the following parameters:\n# - input_dim: 512\n# - hidden_dim: 512\n# - num_layers: 4\n# - swish: True (use Swish activation function)\n# - post_act_ln: True (apply Layer Normalization after each activation)\n# - dropout: 0.1 (apply dropout with a probability of 0.1)\nff = BitFeedForward(512, 512, 4, swish=True, post_act_ln=True, dropout=0.1)\n\n# Apply the BitFeedForward network to the input tensor x\ny = ff(x)\n\n# Print the shape of the output tensor y\nprint(y)  # torch.Size([10, 512])\n```\n\n## Inference\n```python\nfrom bitnet import BitNetInference\n\nbitnet = BitNetInference()\nbitnet.load_model(\"../model_checkpoint.pth\")  # Download model\noutput_str = bitnet.generate(\"The dog jumped over the \", 512)\nprint(output_str)\n```\n\n## Huggingface Usage\n```python\nimport torch\nfrom transformers import AutoModelForSequenceClassification, AutoTokenizer\n\nfrom bitnet import replace_linears_in_hf\n\n# Load a model from Hugging Face's Transformers\nmodel_name = \"bert-base-uncased\"\ntokenizer = AutoTokenizer.from_pretrained(model_name)\nmodel = AutoModelForSequenceClassification.from_pretrained(model_name)\n\n# Replace Linear layers with BitLinear\nreplace_linears_in_hf(model)\n\n# Example text to classify\ntext = \"Replace this with your text\"\ninputs = tokenizer(\n    text, return_tensors=\"pt\", padding=True, truncation=True, max_length=512\n)\n\n# Perform inference\nmodel.eval()  # Set the model to evaluation mode\nwith torch.no_grad():\n    outputs = model(**inputs)\n    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)\n    print(predictions)\n\n# Process predictions\npredicted_class_id = predictions.argmax().item()\nprint(f\"Predicted class ID: {predicted_class_id}\")\n\n# Optionally, map the predicted class ID to a label, if you know the classification labels\n# labels = [\"Label 1\", \"Label 2\", ...]  # Define your labels corresponding to the model's classes\n# print(f\"Predicted label: {labels[predicted_class_id]}\")\n```\n\n\n## Drop in Replacement for Pytorch Models\n```python\nimport torch\nfrom torch import nn\nfrom bitnet import replace_linears_in_pytorch_model\n\n# Define a simple model\nmodel = nn.Sequential(\n    nn.Linear(10, 20),\n    nn.ReLU(),\n    nn.Linear(20, 30),\n)\n\nprint(\"Before replacement:\")\nprint(model)\n\n# Replace nn.Linear with BitLinear\nreplace_linears_in_pytorch_model(model)\n\nprint(\"After replacement:\")\nprint(model)\n\n# Now you can use the model for training or inference\n# For example, pass a random input through the model\ninput = torch.randn(1, 10)\noutput = model(input)\n```\n\n\n### Optimized Cuda Kernel\n`python setup.py build_ext --inplace`\n\n```python\nimport torch\nimport gemm_lowbit_ext  # This imports the compiled module\n\n# Example usage\na = torch.randn(10, 20, dtype=torch.half, device='cuda')  # Example tensor\nb = torch.randn(20, 30, dtype=torch.half, device='cuda')  # Example tensor\nc = torch.empty(10, 30, dtype=torch.half, device='cuda')  # Output tensor\n\nw_scale = 1.0  # Example scale factor\nx_scale = 1.0  # Example scale factor\n\n# Call the custom CUDA GEMM operation\ngemm_lowbit_ext.gemm_lowbit(a, b, c, w_scale, x_scale)\n\nprint(c)  # View the result\n\n```\n\n\n## `BitLora`\nImplementation of BitLora!\n\n```python\nimport torch\nfrom bitnet import BitLora\n\n# Random text tensor\nx = torch.randn(1, 12, 200)\n\n# Create an instance of the BitLora model\nmodel = BitLora(in_features=200, out_features=200, rank=4, lora_alpha=1)\n\n# Perform the forward pass\nout = model(x)\n\n# Print the shape of the output tensor\nprint(out.shape)\n```\n\n\n## BitMamba\n```python\nimport torch\nfrom bitnet import BitMamba\n\n# Create a tensor of size (2, 10) with random values between 0 and 100\nx = torch.randint(0, 100, (2, 10))\n\n# Create an instance of the BitMamba model with input size 512, hidden size 100, output size 10, and depth size 6\nmodel = BitMamba(512, 100, 10, 6, return_tokens=True)\n\n# Pass the input tensor through the model and get the output\noutput = model(x)\n\n# Print the output tensor\nprint(output)\n\n# Print the shape of the output tensor\nprint(output.shape)\n\n```\n\n## `BitMoE`\n\n```python\nimport torch\nfrom bitnet.bit_moe import BitMoE\n\n# Create input tensor\nx = torch.randn(2, 4, 8)\n\n# Create BitMoE model with specified input and output dimensions\nmodel = BitMoE(8, 4, 2)\n\n# Forward pass through the model\noutput = model(x)\n\n# Print the output\nprint(output)\n```\n\n\n### 1 Bit Vision Transformers\nThis idea came to me out of nowhere but it seems to be pretty fun, as you can leverage bitlinear for vision tasks for ultra-compression. It would be nice to train this on imagenet if you could make a script, we'll provide the compute. Then the next stage would be to train a join vision language model gpt-4o\n\n```python\nimport torch\nfrom bitnet import OneBitViT\n\n# Create an instance of the OneBitViT model\nv = OneBitViT(\n    image_size=256,\n    patch_size=32,\n    num_classes=1000,\n    dim=1024,\n    depth=6,\n    heads=16,\n    mlp_dim=2048,\n)\n\n# Generate a random image tensor\nimg = torch.randn(1, 3, 256, 256)\n\n# Pass the image through the OneBitViT model to get predictions\npreds = v(img)  # (1, 1000)\n\n# Print the predictions\nprint(preds)\n\n```\n\n# License\nMIT\n\n# Citation\n```bibtex\n@misc{2310.11453,\nAuthor = {Hongyu Wang and Shuming Ma and Li Dong and Shaohan Huang and Huaijie Wang and Lingxiao Ma and Fan Yang and Ruiping Wang and Yi Wu and Furu Wei},\nTitle = {BitNet: Scaling 1-bit Transformers for Large Language Models},\nYear = {2023},\nEprint = {arXiv:2310.11453},\n}\n\n```\n\n\n# Todo\n- [x] Double check BitLinear implementation and make sure it works exactly as in paper \n- [x] Implement training script for `BitNetTransformer`\n- [x] Train on Enwiki8, copy and past code and data from Lucidrains repos\n- [x] Benchmark performance\n- [x] Look into Straight Through Estimator for non-differentiable backprop\n- [x] Implement BitFeedForward\n- [x] Clean up codebase \n- [x] Add unit tests for each module\n- [x] Implement the new BitNet1.5b from the [paper](https://arxiv.org/abs/2402.17764)\n- [ ] Implement the BitNet15b in Cuda\n- [ ] Implement the low bit gemm cuda kernel ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkyegomez%2FBitNet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkyegomez%2FBitNet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkyegomez%2FBitNet/lists"}