{"id":40564330,"url":"https://github.com/ai4sd/multiscale-byte-lm","last_synced_at":"2026-01-21T01:06:26.344Z","repository":{"id":271862157,"uuid":"905257865","full_name":"ai4sd/multiscale-byte-lm","owner":"ai4sd","description":"A hierarchical LM that scales to training on context windows of +5M tokens","archived":false,"fork":false,"pushed_at":"2025-06-25T20:32:36.000Z","size":3451,"stargazers_count":8,"open_issues_count":1,"forks_count":0,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-08-20T12:28:08.352Z","etag":null,"topics":["bytes","deep-learning","language-model","llm","machine-learning"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2502.14553","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ai4sd.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-12-18T13:19:44.000Z","updated_at":"2025-06-25T20:32:39.000Z","dependencies_parsed_at":"2025-02-11T08:35:23.206Z","dependency_job_id":"5e6434ed-9056-4dbb-b98a-3ba7d50a8ef4","html_url":"https://github.com/ai4sd/multiscale-byte-lm","commit_stats":null,"previous_names":["ai4sd/multiscale-byte-lm","ai4sd/mblm"],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/ai4sd/multiscale-byte-lm","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ai4sd%2Fmultiscale-byte-lm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ai4sd%2Fmultiscale-byte-lm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ai4sd%2Fmultiscale-byte-lm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ai4sd%2Fmultiscale-byte-lm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ai4sd","download_url":"https://codeload.github.com/ai4sd/multiscale-byte-lm/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ai4sd%2Fmultiscale-byte-lm/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28620574,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-20T23:49:58.628Z","status":"ssl_error","status_checked_at":"2026-01-20T23:47:29.996Z","response_time":117,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bytes","deep-learning","language-model","llm","machine-learning"],"created_at":"2026-01-21T01:06:25.822Z","updated_at":"2026-01-21T01:06:26.339Z","avatar_url":"https://github.com/ai4sd.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Multiscale Byte Language Model\n\n![PyPI - Version](https://img.shields.io/pypi/v/mblm) ![PyPI - Types](https://img.shields.io/pypi/types/mblm)\n![GitHub tag check runs](https://img.shields.io/github/check-runs/ai4sd/mblm/main)\n![GitHub License](https://img.shields.io/github/license/ai4sd/mblm)\n\nThe Multiscale Byte Language Model is a model-agnostic, hierarchical architecture for causal byte-level language modeling that scales to million-length sequences.\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/ai4sd/multiscale-byte-lm/refs/heads/main/assets/mblm.png\" alt=\"mblm-architecture\" width=\"600\"/\u003e\n\u003c/p\u003e\n\n## Install\n\nMBLM is tested against Python versions 3.10, 3.11, 3.12 and 3.13.\n\nInstall from PyPI:\n\n```\npip install mblm\n```\n\nFor `uv`:\n\n```\nuv add mblm\n```\n\n### Using Torch and Mamba\n\nYou will need to **install a recent PyTorch version manually**. We use `\u003e=2.6.0`. It is best to do this after installing the package since some sub-dependencies might install their own (CPU) PyTorch version.\n\n```\npip install 'torch\u003e=2.6.0' --index-url https://download.pytorch.org/whl/cu124\n```\n\nFor `uv`:\n\n```\nuv pip install 'torch\u003e=2.6.0' --index-url https://download.pytorch.org/whl/cu124\n```\n\nFinally, in order to use the efficient [Mamba-SSM](https://github.com/state-spaces/mamba), follow their instructions on the homepage. You'll need Linux and a GPU available during installation.\n\n```\npip install \"mamba-ssm\u003e=2.2.2\" \"causal-conv1d\u003e=1.4.0\" --no-build-isolation\n```\n\nFor `uv`:\n\n```\nuv pip install \"mamba-ssm\u003e=2.2.2\" \"causal-conv1d\u003e=1.4.0\" --no-build-isolation\n```\n\nIf `mamba-ssm` is not available, we fall back to using `mambapy`, which is written in pure PyTorch.\n\n## Quickstart\n\n### Using a built-in stage block\n\nMBLM can be used with the default Transformer Decoder or Mamba block. The below model is a 2D MBLM with a global Mamba and local Transformer model.\n\n```py\nimport torch\n\nfrom mblm import (\n    MBLM,\n    MambaBlock,\n    MBLMModelConfig,\n    MBLMReturnType,\n    TransformerBlock,\n)\n\nmblm = MBLM(\n    MBLMModelConfig(\n        num_tokens=257,\n        hidden_dims=[1024, 1024],\n        seq_lens=[1024, 8],\n        num_layers=[5, 5],\n        pad_token_id=256,\n        train_checkpoint_chunks=None,\n        block=[\n            MambaBlock(\n                d_state=128,\n                d_conv=4,\n                expand=2,\n                headdim=64,\n                pos_emb_type=None,\n            ),\n            TransformerBlock(\n                attn_head_dims=64,\n                attn_num_heads=16,\n                attn_use_rot_embs=True,\n                use_flash_attn=True,\n                pos_emb_type=\"fixed\",\n            ),\n        ],\n    )\n)\n\nx = torch.randint(0, 258, (1, 12)).long()\n\n# Choose between any of the return types\nlogits = mblm.forward(x, return_type=MBLMReturnType.LOGITS)\nloss = mblm.forward(x, return_type=MBLMReturnType.LOSS)\nloss, logits = mblm.forward(x, return_type=MBLMReturnType.LOSS_LOGITS)\n\nassert logits.shape == (1, 12, 257)\nassert loss.ndim == 0\n```\n\nAlternatively, you can read configuration from a YAML string (or file):\n\n```py\nimport torch\nimport yaml\n\nfrom mblm import MBLM, MBLMModelConfig, MBLMReturnType\n\nyml_model_config = \"\"\"\nnum_tokens: 257\nhidden_dims: [1024, 1024]\nseq_lens: [1024, 8]\nnum_layers: [5, 5]\npad_token_id: 256\ntrain_checkpoint_chunks: null\nblock:\n    - d_state: 128\n      d_conv: 4\n      expand: 2\n      headdim: 64\n      pos_emb_type: null\n    - attn_head_dims: 64\n      attn_num_heads: 16\n      attn_use_rot_embs: true\n      use_flash_attn: true\n      pos_emb_type: fixed\n\"\"\"\n\nparsed_config = yaml.safe_load(yml_model_config)\nmblm = MBLM(MBLMModelConfig.model_validate(parsed_config))\nx = torch.randint(0, 258, (1, 12)).long()\nmblm.forward(x, return_type=MBLMReturnType.LOSS)\n```\n\n### Custom stage blocks\n\nYou can define custom stage blocks for MBLM as follows. A stageblock must provide a `block_type` field as well as a `to_model` function with the signature below that returns a `torch.nn.Module`. Other than that, specify whatever other parameters you might need. Note that the default blocks (Transformer and Mamba) are already registered.\n\n```py\nimport torch\n\nfrom mblm import MBLM, MBLMModelConfig, MBLMReturnType, TransformerBlock\nfrom mblm.model.block import StageBlock\n\n# Define any custom model\nclass LSTM(torch.nn.Module):\n    def __init__(self, lstm: torch.nn.LSTM):\n        super().__init__()\n        self.lstm = lstm\n\n    def forward(self, input_ids: torch.Tensor) -\u003e torch.Tensor:\n        # Wrap the LSTM forward to extract the output\n        out, _ = self.lstm(input_ids)\n        return out\n\n# Add a block config and inherit from StageBlock\nclass LSTMBlock(StageBlock):\n    block_type: str = \"lstm\"\n\n    # Add whatever is needed\n    dropout: float\n\n    def to_model(self, model_dim: int, num_layers: int) -\u003e torch.nn.Module:\n        return LSTM(\n            torch.nn.LSTM(\n                input_size=model_dim,\n                hidden_size=model_dim,\n                batch_first=True,\n                dropout=self.dropout,\n                num_layers=num_layers,\n            )\n        )\n\nmblm = MBLM(\n    MBLMModelConfig(\n        num_tokens=257,\n        hidden_dims=[1024, 1024],\n        seq_lens=[1024, 8],\n        num_layers=[5, 5],\n        pad_token_id=256,\n        train_checkpoint_chunks=None,\n        block=[\n            LSTMBlock(\n                dropout=0.1,\n                pos_emb_type=None,\n            ),\n            TransformerBlock(\n                attn_head_dims=64,\n                attn_num_heads=16,\n                attn_use_rot_embs=True,\n                use_flash_attn=True,\n                pos_emb_type=\"fixed\",\n            ),\n        ],\n    )\n)\n\nx = torch.randint(0, 258, (1, 12)).long()\nmblm.forward(x, return_type=MBLMReturnType.LOSS)\n```\n\nIf you want to parse a YAML config to a custom block, **register the block** before creating the model:\n\n```py\nimport torch\nimport yaml\n\nfrom mblm import MBLM, MBLMModelConfig, MBLMReturnType\nfrom mblm.model.block import StageBlock\nfrom mblm.model.config import block_registry  # Import this!\n\n# Define any custom model\nclass LSTM(torch.nn.Module):\n    def __init__(self, lstm: torch.nn.LSTM):\n        super().__init__()\n        self.lstm = lstm\n\n    def forward(self, input_ids: torch.Tensor) -\u003e torch.Tensor:\n        # Wrap the LSTM forward to extract the output\n        out, _ = self.lstm(input_ids)\n        return out\n\n# Add a block config and inherit from StageBlock\n@block_registry.register()\nclass LSTMBlock(StageBlock):\n    block_type: str = \"lstm\"\n\n    # Add whatever is needed\n    dropout: float\n    my_property: int\n\n    def to_model(self, model_dim: int, num_layers: int) -\u003e torch.nn.Module:\n        return LSTM(\n            torch.nn.LSTM(\n                input_size=model_dim,\n                hidden_size=model_dim,\n                batch_first=True,\n                dropout=self.dropout,\n                num_layers=num_layers,\n            )\n        )\n\nyml_model_config = \"\"\"\nnum_tokens: 257\nhidden_dims: [1024, 1024]\nseq_lens: [1024, 8]\nnum_layers: [5, 5]\npad_token_id: 256\ntrain_checkpoint_chunks: null\nblock:\n    - dropout: 0.1\n      my_property: 1\n      pos_emb_type: null\n    - attn_head_dims: 64\n      attn_num_heads: 16\n      attn_use_rot_embs: true\n      use_flash_attn: true\n      pos_emb_type: fixed\n\"\"\"\n\nparsed_config = yaml.safe_load(yml_model_config)\nmblm = MBLM(MBLMModelConfig.model_validate(parsed_config))\nx = torch.randint(0, 258, (1, 12)).long()\nmblm.forward(x, return_type=MBLMReturnType.LOSS)\n```\n\n### Custom datasets\n\nIf you want to use the MBLM trainer with [torchrun](https://pytorch.org/docs/stable/elastic/run.html) with a custom dataset, you will need to add a few special methods. Here is an end-to-end example where you launch training on your own:\n\n```py\n# Filename: train_my_mblm.py\n\nimport torch\nfrom typing_extensions import Unpack\n\nfrom mblm import MambaBlock, TransformerBlock\nfrom mblm.data.datasets import DistributedDataset, DistributedDatasetConfig\nfrom mblm.data.types import BatchWithLossMask, ModelMode\nfrom mblm.train.core.config import CoreTrainConfig\nfrom mblm.train.mblm import (\n    TrainEntryConfig,\n    TrainMBLMIoConfig,\n    TrainMBLMParams,\n    dataset_registry,\n    train_mblm,\n)\n\n# Register dataset with a unique ID\n@dataset_registry.register(\"mydataset\")\nclass MyDataset(DistributedDataset[BatchWithLossMask]):\n    def __init__(\n        self,\n        mode: ModelMode,\n        dataset_dir: str,\n        **args: Unpack[DistributedDatasetConfig],\n    ):\n        # Dummy example - Get data from anywhere, e.g., the disk\n        print(f\"Reading dataset from {dataset_dir}\")\n        if mode == ModelMode.TRAIN:\n            data = list(range(10_000))\n        else:\n            data = list(range(2_000))\n        self._data = data\n        super().__init__(\n            data_size=len(data),\n            is_sequential=True,  # We have a sequential dataset\n            **args,\n        )\n\n    def get_sample(self, from_idx: int):\n        \"\"\"\n        Tell the superclass how to get a single sample - here, a sequence of\n        the specified length.\n        \"\"\"\n        data = torch.tensor(self._data[from_idx : from_idx + self.seq_len])\n        return torch.ones_like(data), data\n\n    @staticmethod\n    def from_train_entry_config(\n        config: TrainEntryConfig,\n        mode: ModelMode,\n        worker_id: int,\n        num_workers: int,\n    ) -\u003e DistributedDataset[BatchWithLossMask]:\n        \"\"\"\n        How to parse a training config to a dataset.\n        \"\"\"\n        return MyDataset(\n            dataset_dir=config.io.dataset_dir,\n            mode=mode,\n            seq_len=config.params.input_seq_len,\n            num_workers=num_workers,\n            worker_id=worker_id,\n        )\n\n    @staticmethod\n    def supports_test_mode() -\u003e bool:\n        \"\"\"\n        Whether or not this dataset supports a test mode. Some datasets might not\n        expose the answers in their test set so we cannot evaluate a model on it.\n        Override if necessary\n        \"\"\"\n        return True\n\n\nconfig = TrainEntryConfig(\n    io=TrainMBLMIoConfig(\n        dataset_dir=\"data/datasets/my-dataset\",\n        dataset_id=\"mydataset\",  # Must match the ID above\n        name_model=\"my-model\",\n        output_dir=\"data/outputs\",\n        num_models_to_save=3,\n        validate_amount=20,\n        log_train_loss_amount=100,\n    ),\n    train=CoreTrainConfig(\n        batch_size=1,\n        target_elements=1000,\n        target_elements_strategy=\"sequence\",\n        learning_rate=0.001,\n        gradient_accumulate_every=4,\n        gradient_clipping=1,\n        shuffle_train=True,\n        shuffle_eval=False,\n    ),\n    params=TrainMBLMParams(\n        input_seq_len=128,\n        num_tokens=257,\n        hidden_dims=[512, 512],\n        seq_lens=[16, 8],\n        num_layers=[5, 5],\n        pad_token_id=256,\n        train_checkpoint_chunks=None,\n        block=[\n            MambaBlock(\n                d_state=128,\n                d_conv=4,\n                expand=2,\n                headdim=64,\n                pos_emb_type=None,\n            ),\n            TransformerBlock(\n                attn_head_dims=64,\n                attn_num_heads=16,\n                attn_use_rot_embs=True,\n                use_flash_attn=True,\n                pos_emb_type=\"fixed\",\n            ),\n        ],\n    ),\n)\n\nif __name__ == \"__main__\":\n    train_mblm(config)\n\n```\n\nThen, run the above file with:\n\n```sh\nOMP_NUM_THREADS=1 uv run torchrun --standalone \\\n    --nproc_per_node=gpu train_my_mblm.py\n```\n\nGenerally, training is started from a config file in YAML format. The above is just to give an idea of how everything works together.\n\nCheck the [example configs](config) - they should look very similar to the config above - and how we launch training (with `scripts/train_mblm.py`). With any config, simply run:\n\n```bash\nbash scripts/train_mblm.py -c \u003cyour-config\u003e\n```\n\nWhich will launch [torchrun](https://pytorch.org/docs/stable/elastic/run.html) with all the necessary configuration.\n\nAlternatively, you can always subclass the core trainer and do things you way. There are many examples in the source dir and the end-to-end tests.\n\n## Streaming responses\n\nAs a byte language model, MBLM generates integer representations of bytes. We can hook into the generation process and stream all generated bytes directly to a [file object](https://docs.python.org/3/glossary.html#term-file-object) such as `sys.stdout` (for debugging or interactive sessions) or `io.TextIO` and `io.BinaryIO` stream interfaces.\n\nLet's assume our model is conditioned to generate the following text string:\n\n```\n👉🏽 bytes generated by a 🤖\n```\n\nIn UTF-8 bytes, this corresponds to:\n\n```sh\n# hex representation\nf0 9f 91 89 f0 9f 8f bd 20 62 79 74 65 73 20 67 65 6e 65 72 61 74 65 64 20 62 79 20 61 20 f0 9f a4 96\n\n# integer representation\n240 159 145 137 240 159 143 189 32 98 121 116 101 115 32 103 101 110 101 114 97 116 101 100 32 98 121 32 97 32 240 159 164 150\n```\n\nInternally, these integers are what the model generates. However, maybe you have trained to output a different modality such as a PNG file or an MP4 video - the possibilities are endless.\n\nFor simplicity, let's assume we have some `root_dir` and a function `create_mblm` to create an MBLM module.\n\n### Streaming to a file\n\nWe can **stream the response directly to a file** - no need to specify the encoding. All we need to do is open a file in binary mode. In this example, the output corresponds to UTF-8.\n\n```py\nfrom pathlib import Path\n\nfrom mblm.utils.stream import ByteStreamer\n\nmblm = create_mblm(...)\n\n# any modality that the model learns to output - .png, .txt, .bin, etc.\nfile_path = Path(root_dir) / \"output.txt\"\n\n# open in binary mode and write raw bytes\nwith Path(file_path).open(\"wb\") as file:\n    with ByteStreamer(stream=file) as streamer:\n        mblm.generate(stream=streamer)\n\n# we can open the file and interpret its content as UTF-8\nwith Path(file_path).open(\"r\", encoding=\"utf8\") as file:\n    assert file.read() == \"👉🏽 bytes generated by a 🤖\"\n```\n\n### Streaming to stdout\n\nFor developing and interactive sessions, we can **stream the response directly to the terminal**. We can either decode the bytes from UTF-8 on the fly or stream the raw integer bytes to the terminal when the bytes represent something other than text.\n\n```py\nimport sys\n\nfrom mblm.utils.stream import ByteStreamer\n\nmblm = create_mblm(...)\n\n# approach 1: stream to stdout and decode on the fly\nwith ByteStreamer(stream=sys.stdout, decode_utf8=True) as streamer:\n    mblm.generate(stream=streamer)\n\n# streams the decoded bytes to the terminal:\n# 👉🏽 bytes generated by a 🤖\n\n# approach 2: stream raw output to stdout\nwith ByteStreamer(stream=sys.stdout) as streamer:\n    mblm.generate(stream=streamer)\n\n# streams the bytes as integers to the terminal:\n# 240 159 145 ... 159 164 150\n```\n\nOur approach of decoding from UTF-8 uses the [`replace` strategy](https://docs.python.org/3/library/codecs.html#error-handlers) for dealing with malformed data, which enables continuous decoding even for partially corrupted sequences. Whenever `decode_utf8` is `False`, raw bytes are streamed and you'll need to deal with corrupted UTF-8 sequences on your own.\n\n## Local development setup\n\nWe use `uv` for packaging and dependency management. Before proceeding, install a recent version (\u003e= `0.5`) via the instructions on [the homepage](https://docs.astral.sh/uv/getting-started/installation/).\n\n### Install dependencies\n\n- With CUDA: `make install_cuda`\n- CPU only (e.g., MacOS): `make install_cpu`\n\nIf you've noticed, there are two SSM/Mamba dependencies:\n\n- `mambapy`, defined in `pyproject.toml`\n- `mamba-ssm` (with `causal-conv1d`), defined in `Makefile`\n\nBecause the official Mamba implementation `mamba-ssm` requires a Linux machine and a GPU available during installation, we shim the dependencies. `mambapy` is used as a fallback for all unsupported platforms or when `mamba-ssm` is not installed. Because `mamba-ssm` is so delicate, it needs to be installed manually:\n\n```sh\nmake install_mamba\n```\n\nFor any experiments, we wish to use the new Mamba 2 block from `mamba-ssm`. If the import of this module fails, we fall back to a Mamba 1 block from `mambapy`, which is written in pure PyTorch.\n\n## Running scripts\n\n- Project-related tasks (e.g., installing dependencies, running tests) are defined in the [Makefile](Makefile)\n\n## Pre-Commit Hooks\n\nBefore every commit, we lint the _staged_ Python and Jupyter Notebook files and check if they are formatted correctly. Doing this locally speeds up development because one does not have to wait for the CI to catch issues. Errors of these checks are not fixed automatically, instead, you will have to fix the files yourself before committing. You may bypass hooks with `git commit -m \u003cmessage\u003e --no-verify`. However, the CI will likely fail in this case.\n\nAll Pre-commit hooks can be run manually as well:\n\n- `pre-commit run lint`\n- `pre-commit run check-format`\n\nNote that:\n\n- The `lint` command is similar to the `make lint` command, but the `make` command operates on _all_ files in the project and not just the staged files\n- While `check-format` simply _checks_ the format, `make format` will _actually_ format the files\n\n## Citation\n\n```bibtex\n@inproceedings{egli2025multiscalebytelanguagemodels,\n      title={Multiscale Byte Language Models - A Hierarchical Architecture for Causal Million-Length Sequence Modeling},\n      author={Eric Egli and Matteo Manica and Jannis Born},\n      booktitle={ICML 2025 Workshop on Long-Context Foundation Models},\n      year={2025},\n      url={https://arxiv.org/abs/2502.14553},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fai4sd%2Fmultiscale-byte-lm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fai4sd%2Fmultiscale-byte-lm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fai4sd%2Fmultiscale-byte-lm/lists"}