{"id":28059652,"url":"https://github.com/harrisonvanderbyl/rwkvstic","last_synced_at":"2025-05-12T08:08:26.086Z","repository":{"id":65453590,"uuid":"592628271","full_name":"harrisonvanderbyl/rwkvstic","owner":"harrisonvanderbyl","description":"Framework agnostic python runtime for RWKV models","archived":false,"fork":false,"pushed_at":"2023-08-24T07:27:46.000Z","size":4136,"stargazers_count":146,"open_issues_count":7,"forks_count":16,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-05-12T08:08:17.694Z","etag":null,"topics":["ai","deep-learning","pytorch","tensor"],"latest_commit_sha":null,"homepage":"https://hazzzardous-rwkv-instruct.hf.space","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/harrisonvanderbyl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-01-24T06:46:26.000Z","updated_at":"2025-04-01T13:44:25.000Z","dependencies_parsed_at":null,"dependency_job_id":"39347d41-83d7-4ba7-8f67-9124655889a8","html_url":"https://github.com/harrisonvanderbyl/rwkvstic","commit_stats":null,"previous_names":[],"tags_count":12,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harrisonvanderbyl%2Frwkvstic","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harrisonvanderbyl%2Frwkvstic/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harrisonvanderbyl%2Frwkvstic/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harrisonvanderbyl%2Frwkvstic/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/harrisonvanderbyl","download_url":"https://codeload.github.com/harrisonvanderbyl/rwkvstic/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253700620,"owners_count":21949696,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","deep-learning","pytorch","tensor"],"created_at":"2025-05-12T08:08:25.543Z","updated_at":"2025-05-12T08:08:26.051Z","avatar_url":"https://github.com/harrisonvanderbyl.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# RWKVSTIC\n\nRwkvstic, pronounced however you want to, is a library for interfacing and using the RWKV-V4 based models.\n\nRwkvstic does not autoinstall its dependencies, as its main purpose is to be dependency agnostic, able to be used by whatever library you would prefer.\n\nWhen using BlinkDLs pretrained models, it would advised to have the `torch` package installed.\n\nSome options, when left blank, will elicit a prompt asking you to choose a value.\nfor this purpose, please ensure you have the `inquirer` package installed.\n\n## Note\n### as of RWKVSTIC 2.0, the default mode is GPU with FASTQUANT, my own custom implementation of strategy=\"cuda fp32i8\".\n\n### Please checkout the strategy section on [RWKV](https://pypi.org/project/rwkv/) for other strategies, or look at the advanced modes below.\n\n## Tables and graphs\n\n### Rwkv-4 models -\u003e recomended vram\n\n```\nrwkvstic vram\nModel | 8bit | bf16/fp16 | fp32\n14B   | 16GB | 28GB      | \u003e50GB\n7B    | 8GB  | 14GB      | 28GB\n3B    | 2.8GB| 6GB       | 12GB\n1b5   | 1.3GB| 3GB       | 6GB\n```\n\n## Installation\n\n```bash\npip install rwkvstic\n```\n\n## Basic Usage\n\n```python\nfrom rwkvstic.load import RWKV\n\n# Load the model (supports full path, relative path, and remote paths)\n\nmodel = RWKV(\n    \"https://huggingface.co/BlinkDL/rwkv-4-pile-3b/resolve/main/RWKV-4-Pile-3B-Instruct-test1-20230124.pth\"\n)\n\nmodel.loadContext(newctx=f\"Q: who is Jim Butcher?\\n\\nA:\")\noutput = model.forward(number=100)[\"output\"]\n\nprint(output)\n\n# Q: who is Jim Butcher?\n# A: Jim Butcher is a very popular American author of fantasy novels. He’s known for the Dresden Files series of novels.\u003c|endoftext|\u003e\n```\n\n## RWKV wrapper\n\n### You can use any compatible rwkv strategy string to overwrite the default behavior with the original BlinkDL package\n\n```\nmodel = RWKV(\n    \"https://huggingface.co/BlinkDL/rwkv-4-pile-3b/resolve/main/RWKV-4-Pile-3B-Instruct-test1-20230124.pth\",\n    strategy=\"cuda fp32\"\n)\n```\n\n## Exporting\n\n### You can export the default FASTQUANT mode for quick downloading and loading, as it has a smaller file size and uses less Ram and Disk Space\n\n```py\nmodel = RWKV(\n    \"https://huggingface.co/BlinkDL/rwkv-4-pile-3b/resolve/main/RWKV-4-Pile-3B-Instruct-test1-20230124.pth\",\n    export=\"myfilename\"\n)\n\n# exported model as myfilename.rwkv\n```\n\n```py\nmodel = RWKV(\n    \"myfile.rwkv\",\n)\n\n```\n\n## Advanced Usage\n\n#\n\n## Step 1: load the model with your choice of poison\n\n### Pytorch\n\n```python\nfrom rwkvstic.load import RWKV\nfrom rwkvstic.agnostic.backends import TORCH\n\n# this is the dtype used for trivial operations, such as vector-\u003evector operations and is the dtype that will determine the accuracy of the model\nruntimedtype = torch.float32 # torch.float64, torch.bfloat16\n\n# this is the dtype used for matrix-vector operations, and is the dtype that will determine the performance and memory usage of the model\ndtype = torch.bfloat16 # torch.float32, torch.float64, torch.bfloat16\n\nuseGPU = True # False\n\nmodel = RWKV(\"path/to/model.pth\", mode=TORCH, useGPU=useGPU, runtimedtype=runtimedtype, dtype=dtype)\n```\n\n### JAX\n\n```python\nfrom rwkvstic.load import RWKV\nfrom rwkvstic.agnostic.backends import JAX\n\n# Jax will automatically use the GPU if available, and will use the CPU if not available\n\nmodel = RWKV(\"path/to/model.pth\", mode=JAX)\n```\n\n### TensorFlow\n\n```python\nfrom rwkvstic.load import RWKV\nfrom rwkvstic.agnostic.backends import TF\n\nuseGPU = True # False\n\nmodel = RWKV(\"path/to/model.pth\", mode=TF, useGPU=useGPU)\n```\n\n### Numpy\n\n```python\nfrom rwkvstic.load import RWKV\nfrom rwkvstic.agnostic.backends import NUMPY\n\n# you masochistic bastard\nmodel = RWKV(\"path/to/model.pth\", mode=NUMPY)\n```\n\n### Streaming\n\n#### Trade vram usage for performance\n\n```python\nfrom rwkvstic.load import RWKV\nfrom rwkvstic.agnostic.backends import TORCH_STREAM\n\n# this is the dtype used for trivial operations, such as vector-\u003evector operations and is the dtype that will determine the accuracy of the model\nruntime_dtype = torch.float32 # torch.float64, torch.bfloat16\n\n# this is the dtype used for matrix-vector operations, and is the dtype that will determine the performance and memory usage of the model\ndtype = torch.bfloat16 # torch.float32, torch.float64, torch.bfloat16\n\n# this is the amount of GB you want to use for matrix storage, if the model is too large, matrixes will be stored in ram and moved to the GPU as needed\ntarget = 4\n\n# Pin Memory is used to speed up the transfer of data to the GPU, but will use more memory, both on the GPU and on the CPU\npin_memory = True\n\nmodel = RWKV(\"path/to/model.pth\", mode=TORCH_STREAM, runtimedtype=runtime_dtype, dtype=dtype, target=target, pinMem=pin_memory)\n\n```\n\n### Multi-GPU\n\n#### Model weights are split(sharded) across multiple GPUs\n\n```python\nfrom rwkvstic.load import RWKV\nfrom rwkvstic.agnostic.backends import TORCH_SPLIT\n\n# this is the dtype used for trivial operations, such as vector-\u003evector operations and is the dtype that will determine the accuracy of the model\nruntime_dtype = torch.float32 # torch.float64, torch.bfloat16\n\n# this is the dtype used for matrix-vector operations, and is the dtype that will determine the performance and memory usage of the model\ndtype = torch.bfloat16 # torch.float32, torch.float64, torch.bfloat16\n\nmodel = RWKV(\"path/to/model.pth\", mode=TORCH_SPLIT, runtimedtype=runtime_dtype, dtype=dtype)\n\n```\n\n### Quantization\n\n#### Uses close to half the memory of float16, but is slightly less accurate, and is about 4x slower\n\n```python\nfrom rwkvstic.load import RWKV\nfrom rwkvstic.agnostic.backends import TORCH_QUANT\n\n# this is the dtype used for trivial operations, such as vector-\u003evector operations and is the dtype that will determine the accuracy of the model\nruntime_dtype = torch.float32 # torch.float64, torch.bfloat16\n\n# this is the amount of chunks to split the matrix rows into pre-row-quantization, the more chunks, the more accurate the model will be, but with some minor trade offs\nchunksize = 4\n\nuseGPU = True # False\n\n# this is the amount of GB you want to use for matrix storage, if the model is too large, matrixes will be stored in ram and moved to the GPU as needed, same as stream\ntarget = 4\n\nmodel = RWKV(\"path/to/model.pth\", mode=TORCH_QUANT, runtimedtype=runtime_dtype, chunksize=chunksize, useGPU=useGPU, target=target)\n```\n\n## Step 2: State management\n\n### The state\n\nThe state is a vectorized value that is a representation of all the previous inputs and outputs of the model. It is used basically the memory of the model, and is used to generate the next output.\n\nThe model has an internal state, so the following is useful in that regards.\n\n```python\nmodel = RWKV(\"path/to/model.pth\")\n\nemptyState = model.emptyState\nmodel.setState(emptyState)\ncurrentMem = model.getState()\n```\n\n## Step 3: Injecting context\n\n### Injecting context\n\nWhen you want to influence the output of the model, you can inject context into the model. This is done by using the `loadContext` function.\n\n```python\nmodel = RWKV(\"path/to/model.pth\")\n\nmodel.loadContext(newctx=\"Q: who is Jim Butcher?\\n\\nA:\")\n\nprint(model.forward(number=100)[\"output\"])\n\nmodel.loadContext(newctx=\"Can you tell me more?\\n\\nA:\")\n```\n\n## Step 4: Generating output\n\n### Generating output\n\nWhen you want to generate output, you can use the `forward` function.\n\n```python\nmodel = RWKV(\"path/to/model.pth\")\n\nnumber = 100 # the number of tokens to generate\nstopStrings = [\"\\n\\n\"] # When read, the model will stop generating output\n\nstopTokens = [0] # advanced, when the model has generated any of these tokens, it will stop generating output\n\ntemp = 1 # the temperature of the model, higher values will result in more random output, lower values will result in more predictable output\n\ntop_p = 0.9 # the top_p of the model, higher values will result in more random output, lower values will result in more predictable output\n\ndef progressLambda(properties):\n    # \"logits\", \"state\", \"output\", \"progress\", \"tokens\", \"total\", \"current\"\n    print(\"progress:\",properties[\"progress\"]/properties[\"total\"])\n\noutput = model.forward(number=number, stopStrings=stopStrings, stopTokens=stopTokens, temp=temp, top_p=top_p, progressLambda=progressLambda)\n\nprint(output[\"output\"]) # the generated output\nprint(output[\"state\"]) # the state of the model after generation\nprint(output[\"logits\"]) # the logits of the model after generation, before sampling\n```\n\n# Implementation Details\n\n## The RWKVOP object\n\nHere is a base class, when overwritten, will allow the swapout of operations with their equivilents in different frameworks. Ill show you the JAX one, as its relatively simple\n\n```python\n\nclass RWKVJaxOps(RWKVOp.module):\n    def __init__(self, layers, embed, preJax=False):\n        from jax import numpy as npjax\n        super().__init__(layers, embed)\n        # convert from torch to jax\n        self.initTensor = lambda x: npjax.array(x.float().cpu().numpy())\n        # jax math functions\n        self.sqrt = lambda x: npjax.sqrt(x)\n        self.mean = lambda x: npjax.mean(x)\n        self.relu = lambda x: npjax.maximum(x, 0)\n        self.exp = lambda x: npjax.exp(x)\n        self.matvec = npjax.matmul\n        self.lerp = lambda x, y, z: x*(1-z) + y*(z)\n        self.minimum = lambda x, y: npjax.minimum(x, y)\n        self.log = npjax.log\n        def ln(x, w, b):\n            xee2 = x - self.mean(x)\n\n            x2 = self.sqrt(self.mean(xee2*xee2) + 0.000009999999747378752)\n\n            return w*(xee2/x2) + b\n\n        self.layernorm = ln\n\n        # constants and stuff\n        self.klimit = npjax.array([18] * embed)\n        self.stack = lambda x: x\n\n        # module def\n        self.module = object\n\n        # function overwrites (used for advanced stuff)\n        self.initfunc = lambda x: x\n        self.layerdef = lambda x: x\n        self.mainfunc = lambda x: x\n\n        # The empty state\n        self.emptyState = npjax.array([[0.01]*embed]*4*layers)\n```\n\nThis can then be used to construct and infer the model.\n\n## Stream, Split And Quant\n\nThe stream, split and quant backends are all pytorch varients that use some tricks to use less, or distribute memory usage across multiple GPUs.\n\nIll show you the important stuff, usually consisting of how the matrixes are constructed, and how they are used to create a matvec.\n\n(Disclaimer, just similar to the actual code, not the actual code, actual code is messy and gross)\n\n### Stream\n\n```python\n# Pinning memory allows for faster transfer between CPU and GPU, but uses more memory\ndef pinmem(x):\n            return x.pin_memory() if pinMem and x.device == \"cpu\" else x\n\n\ndef initMatrix(x):\n    # if more memory is used then the target specified, then it is sent to the cpu\n    if torch.cuda.max_memory_reserved(0)/1024/1024/1024 \u003e target:\n        x = x.cpu()\n    else:\n        x = x.cuda(non_blocking=True)\n    return pinmem(x)\n\n# for the matvec, it just brings it to the correct device as needed\ndef matvec(z, y):\n    return z.to(y.device, non_blocking=True) @ y\n```\n\n### Split\n\n```python\ndef initMatrix(x):\n    devices = [torch.device(\"cuda\", i) for i in range(torch.cuda.device_count())]\n    # split the matrix into the number of devices\n    x = torch.split(x, x.shape[0]//len(devices), dim=0)\n    # send each part to a different device\n    x = [i.to(devices[i], non_blocking=True) for i in range(len(x))]\n    return x\n\n# for the matvec, split the vector into the number of devices, and then send each part to the correct device\ndef matvec(z, y):\n    devices = [torch.device(\"cuda\", i) for i in range(torch.cuda.device_count())]\n    y = torch.split(y, y.shape[0]//len(devices), dim=0)\n    y = [i.to(devices[i], non_blocking=True) for i in range(len(y))]\n    # do the matvec on each part\n    z = [z[i].mv(y[i]) for i in range(len(z))]\n    # put them all on one device\n    z = [i.to(devices[0], non_blocking=True) for i in z]\n    # add them all together\n    z = torch.sum(torch.stack(z), dim=0)\n    return z\n```\n\n### Quant\n\n```python\ndef QuantizeMatrix(x, runtimeDtype, device):\n    rang = 255\n    ran, mini = (x.max(0)[0]-x.min(0)[0])/rang,  x.min(0)[0]\n    x = x.double()\n    x = ((x-mini)/ran)\n\n    x = x.to(\n        dtype=torch.uint8, non_blocking=True, device=device)\n\n    return x, ran.to(runtimeDtype).to(device=device), mini.to(runtimeDtype).to(device=device)\n\ndef MatVec(x, y, runtimedtype):\n    # resize y into a 2d array\n    y = y.reshape(chunksize, -1)\n\n    # retrieve the  quantized matrix, the spread, and the offset\n    rx, spread, zpoint = x\n\n    # spread the y vector across the spread matrix\n    yy = y*spread\n\n    # convert the quantized matrix back to the runtime dtype\n    rx = rx.to(dtype=runtimedtype)\n\n    # we can use matmul to do a batched matvec for each split matrix\n    xmain = rx.matmul(yy.reshape(yy.shape[0], -1, 1)).sum(0).squeeze()\n\n    # the offset is added to the result\n    return xmain + torch.tensordot(zpoint, y)\n\ndef initMatrix(x):\n    # by splitting the matrix before quantizing, it allows for much better results\n    splitmatrices = torch.chunk(x, chunksize, 1)\n    xx = [QuantizeMatrix(x, runtimedtype, dev)\n            for x in splitmatrices]\n    xxo = torch.stack([x[0] for x in xx])\n    xx1 = torch.stack([x[1] for x in xx])\n    xx2 = torch.stack([x[2] for x in xx])\n    return xxo, xx1, xx2\n```\n\n## PreQuantization\n\nYou can prequantize the matrixes to save loading time, and bandwidth when downloading model.\n\n```bash\ncd /path/to/folder/with/model\npython3 -m rwkvstic --pq\n\n# what model to prequantize?\n# -\u003e model.pth\n\nls\n# model.pth\n# model.pqth\n```\n\nYou can load these pre-quantized models as you would a normal file.\n\n```python\nfrom rwkvstic.load import RWKV\n\nmodel = RWKV(\"model.pqth\")\n```\n\n## Onnx export\n\nYou can export the model to onnx, and then use onnxruntime/rwkvstic to infer the model.\n\n```python\nfrom rwkvstic.load import RWKV\nfrom rwkvstic.agnostic.backends import ONNX_EXPORT\nimport torch\n\nmodel = RWKV(\"model.pth\", mode=ONNX_EXPORT, dtype=torch.float16) # or torch.float32\n# the model is exported to model_{layers}_{embed}.onnx\n# the external data is stored in model_{layers}_{embed}.bin\n```\n\n### rwkvstic onnx running\n\n```py\nfrom rwkvstic.load import RWKV\n\nmodel = RWKV(\"model_12_768.onnx\")\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fharrisonvanderbyl%2Frwkvstic","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fharrisonvanderbyl%2Frwkvstic","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fharrisonvanderbyl%2Frwkvstic/lists"}