{"id":13789240,"url":"https://github.com/PotatoSpudowski/fastLLaMa","last_synced_at":"2025-05-12T05:31:56.098Z","repository":{"id":144927792,"uuid":"617081761","full_name":"PotatoSpudowski/fastLLaMa","owner":"PotatoSpudowski","description":"fastLLaMa: An experimental high-performance framework for running Decoder-only LLMs with 4-bit quantization in Python using a C/C++ backend.","archived":false,"fork":false,"pushed_at":"2023-06-02T15:34:53.000Z","size":7745,"stargazers_count":410,"open_issues_count":9,"forks_count":27,"subscribers_count":10,"default_branch":"main","last_synced_at":"2024-11-18T03:36:51.723Z","etag":null,"topics":["c","cpp","lama","lamacpp","python"],"latest_commit_sha":null,"homepage":"https://potatospudowski.github.io/fastLLaMa/","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PotatoSpudowski.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-03-21T17:01:31.000Z","updated_at":"2024-11-09T08:44:31.000Z","dependencies_parsed_at":null,"dependency_job_id":"80913e4d-f817-402f-ae8b-dcb638f028a9","html_url":"https://github.com/PotatoSpudowski/fastLLaMa","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PotatoSpudowski%2FfastLLaMa","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PotatoSpudowski%2FfastLLaMa/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PotatoSpudowski%2FfastLLaMa/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PotatoSpudowski%2FfastLLaMa/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PotatoSpudowski","download_url":"https://codeload.github.com/PotatoSpudowski/fastLLaMa/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253682402,"owners_count":21946929,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["c","cpp","lama","lamacpp","python"],"created_at":"2024-08-03T21:01:00.478Z","updated_at":"2025-05-12T05:31:55.376Z","avatar_url":"https://github.com/PotatoSpudowski.png","language":"C","funding_links":[],"categories":["Tools"],"sub_categories":["Other"],"readme":"# fastLLaMa\n\n[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)\n\n`fastLLaMa` is an experimental high-performance framework designed to tackle the challenges associated with deploying large language models (LLMs) in production environments. \n\n\nIt offers a user-friendly Python interface to a C++ library, [llama.cpp](https://github.com/ggerganov/llama.cpp), enabling developers to create custom workflows, implement adaptable logging, and seamlessly switch contexts between sessions. This framework is geared towards enhancing the efficiency of operating LLMs at scale, with ongoing development focused on introducing features such as optimized cold boot times, Int4 support for NVIDIA GPUs, model artifact management, and multiple programming language support.\n\n```\n                ___            __    _    _         __ __      \n                | | '___  ___ _| |_ | |  | |   ___ |  \\  \\ ___ \n                | |-\u003c_\u003e |\u003c_-\u003c  | |  | |_ | |_ \u003c_\u003e ||     |\u003c_\u003e |\n                |_| \u003c___|/__/  |_|  |___||___|\u003c___||_|_|_|\u003c___|\n                                                            \n                                                                                        \n                                                                           \n                                                       .+*+-.                \n                                                      -%#--                  \n                                                    :=***%*++=.              \n                                                   :+=+**####%+              \n                                                   ++=+*%#                   \n                                                  .*+++==-                   \n                  ::--:.                           .**++=::                   \n                 #%##*++=......                    =*+==-::                   \n                .@@@*@%*==-==-==---:::::------::==*+==--::                   \n                 %@@@@+--====+===---=---==+=======+++----:                   \n                 .%@@*++*##***+===-=====++++++*++*+====++.                   \n                 :@@%*##%@@%#*%#+==++++++=++***==-=+==+=-                    \n                  %@%%%%%@%#+=*%*##%%%@###**++++==--==++                     \n                  #@%%@%@@##**%@@@%#%%%%**++*++=====-=*-                     \n                  -@@@@@@@%*#%@@@@@@@%%%%#+*%#++++++=*+.                     \n                   +@@@@@%%*-#@@@@@@@@@@@%%@%**#*#+=-.                       \n                    #%%###%:  ..+#%@@@@%%@@@@%#+-                            \n                    :***#*-         ...  *@@@%*+:                            \n                     =***=               -@%##**.                            \n                    :#*++                -@#-:*=.                            \n                     =##-                .%*..##                             \n                      +*-                 *:  +-                             \n                      :+-                :+   =.                             \n                       =-.               *+   =-                             \n                        :-:-              =--  :::                           \n                                                                           \n\n```\n---\n\n## Features\n- [x] Easy-to-use Python interface that allows developers to build custom workflows.\n    - [x] Pip install support.\n- [x] Ability to ingest system prompts.\n    - [x] System prompts will remain in runtime memory, normal prompts are recycled.\n- [x] Customisable logger support.\n- [x] Low memory mode support using mmap.\n- [x] Quick context switching between sessions.\n    - [x] Ability to save and load session states.\n- [x] Quick LoRA adapter switching during runtime.\n    - [x] During the conversion of LoRA adapters to bin file, we are caching the result of matrix multiplication to avoid expensive caclulation for every context switch.\n    - [x] Possible quantization of LoRA adapters with minimal performance degradation. (FP16 supported)\n    - [x] Attach and Detach support during runtime.\n    - [x] Support to attach and detach adapters for models running using mmap.\n- [ ] Cold boot time optimization using multithreading.\n    - [x] Improve loading using threads.\n    - [ ] Support for `aio_read` for posix.\n    - [ ] Experiment with Linux `io_uring`.\n- [x] [Web Socket Server](https://github.com/PotatoSpudowski/fastLLaMa/tree/websocket-server).\n- [x] [Web UI for chat](https://github.com/PotatoSpudowski/fastLLaMa/tree/webui).  \n- [ ] Implement Multimodal models like MiniGPT-4\n    - [ ] Implement ViT and Q-Former \n    - [ ] TBD ...\n- [ ] Int4 support for NVIDIA GPUs.\n- [ ] Model artifact management support.\n- [ ] Multiple programming language support.\n\n### Supported Models\n- [X] LLaMA 🦙\n- [X] Alpaca\n- [X] GPT4All\n- [X] Chinese LLaMA / Alpaca\n- [X] Vigogne (French)\n- [X] Vicuna\n- [X] Koala\n---\n\n## Requirements\n1. CMake\n\n    * For Linux: \\\n    ```sudo apt-get -y install cmake```\n\n    * For OS X: \\\n    ```brew install cmake```\n\n\n   * For Windows \\\nDownload cmake-*.exe installer from [Download page](https://cmake.org/download/) and run it.\n\n2. GCC 11 or greater\n3. Minimum C++ 17\n4. Python 3.x\n\n## Installation \n\nTo install `fastLLaMa` through pip use\n\n```bash\npip install git+https://github.com/PotatoSpudowski/fastLLaMa.git@main\n```\n\n## Usage\n\n### Importing the package\n\nTo import fastLLaMa just run\n\n```python\nfrom fastllama import Model \n```\n\n### Initializing the Model\n```python\nMODEL_PATH = \"./models/7B/ggml-model-q4_0.bin\"\n\nmodel = Model(\n        path=MODEL_PATH, #path to model\n        num_threads=8, #number of threads to use\n        n_ctx=512, #context size of model\n        last_n_size=64, #size of last n tokens (used for repetition penalty) (Optional)\n        seed=0, #seed for random number generator (Optional)\n        n_batch=128, #batch size (Optional)\n        use_mmap=False, #use mmap to load model (Optional)\n    )\n```\n\n### Ingesting Prompts\n```python\nprompt = \"\"\"Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.\n\nUser: Hello, Bob.\nBob: Hello. How may I help you today?\nUser: Please tell me the largest city in Europe.\nBob: Sure. The largest city in Europe is Moscow, the capital of Russia.\nUser: \"\"\"\n\nres = model.ingest(prompt, is_system_prompt=True) #ingest model with prompt\n```\n\n### Generating Output\n```python\ndef stream_token(x: str) -\u003e None:\n    \"\"\"\n    This function is called by the library to stream tokens\n    \"\"\"\n    print(x, end='', flush=True)\n\nres = model.generate(\n    num_tokens=100, \n    top_p=0.95, #top p sampling (Optional)\n    temp=0.8, #temperature (Optional)\n    repeat_penalty=1.0, #repetition penalty (Optional)\n    streaming_fn=stream_token, #streaming function\n    stop_words=[\"User:\", \"\\n\"] #stop generation when this word is encountered (Optional)\n    )\n```\n### Loading model using Multithreads \n\n```python\nmodel = Model(\n        path=MODEL_PATH, #path to model\n        num_threads=8, #number of threads to use\n        n_ctx=512, #context size of model\n        last_n_size=64, #size of last n tokens (used for repetition penalty) (Optional)\n        seed=0, #seed for random number generator (Optional)\n        n_batch=128, #batch size (Optional)\n        load_parallel=True\n    )\n```\n\n### Saving Model State\n\nTo cache the session, you can use the `save_state` method.\n\n```python\nres = model.save_state(\"./models/fast_llama.bin\")\n```\n\n### Loading Model State\n\nTo load the session, use the `load_state` method.\n\n```python\nres = model.load_state(\"./models/fast_llama.bin\")\n```\n\n### Resetting the Model State\n\nTo reset the session use the `reset` method.\n\n```python\nmodel.reset()\n```\n### Attaching LoRA Adapters to Base model during runtime\n\nTo attach LoRA Adapter during runtime use the `attach_lora` method.\n\n```python\nLORA_ADAPTER_PATH = \"./models/ALPACA-7B-ADAPTER/ggml-adapter-model.bin\"\n\nmodel.attach_lora(LORA_ADAPTER_PATH)\n```\n\nNote: It is a good idea to reset the state of the model after attaching a LoRA Adapter.\n\n### Detaching LoRA Adapters to Base model during runtime\n\nTo detach LoRA Adapter during runtime use the `detach_lora` method.\n\n```python\nmodel.detach_lora()\n```\n\n### Calculating perplexity\n\nTo caculate the perplexity, use the `perplexity` method.\n\n```python\n\nwith open(\"test.txt\", \"r\") as f:\n    data = f.read(8000)\n       \ntotal_perplexity = model.perplexity(data)\nprint(f\"Total Perplexity: {total_perplexity:.4f}\")\n```\n\n### Getting the embeddings of the model\n\nTo get the embeddings of the model, use the `get_embeddings` method.\n\n```python\nembeddings = model.get_embeddings()\n```\n\n### Getting the logits of the model\n\nTo get the logits of the model, use the `get_logits` method.\n\n```python\nlogits = model.get_logits()\n```\n\n### Using the logger\n\n```python\nfrom fastLLaMa import Logger\n\nclass MyLogger(Logger):\n    def __init__(self):\n        super().__init__()\n        self.file = open(\"logs.log\", \"w\")\n\n    def log_info(self, func_name: str, message: str) -\u003e None:\n        #Modify this to do whatever you want when you see info logs\n        print(f\"[Info]: Func('{func_name}') {message}\", flush=True, end='', file=self.file)\n        pass\n    \n    def log_err(self, func_name: str, message: str) -\u003e None:\n        #Modify this to do whatever you want when you see error logs\n        print(f\"[Error]: Func('{func_name}') {message}\", flush=True, end='', file=self.file)\n    \n    def log_warn(self, func_name: str, message: str) -\u003e None:\n        #Modify this to do whatever you want when you see warning logs\n        print(f\"[Warn]: Func('{func_name}') {message}\", flush=True, end='', file=self.file)\n```\n\nFor more clarity, check the `examples/python/` folder.\n\n### Running LLaMA\n```sh\n# obtain the original LLaMA model weights and place them in ./models\nls ./models\n65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model\n\n# convert the 7B model to ggml FP16 format\n# python [PythonFile] [ModelPath] [Floattype] [Vocab Only] [SplitType]\npython3 scripts/convert-pth-to-ggml.py models/7B/ 1 0\n\n# quantize the model to 4-bits\n./build/src/quantize models/7B/ggml-model-f16.bin models/7B/ggml-model-q4_0.bin 2\n\n# run the inference\n#Run the scripts from the root dir of the project for now!\npython ./examples/python/example.py\n```\n\n### Running Alpaca-LoRA \n\n```sh\n# Before running this command\n# You need to provide the HF model paths here\npython ./scripts/export-from-huggingface.py\n# Alternatively you can just download the ggml models from huggingface directly and run them! \n\npython3 ./scripts/convert-pth-to-ggml.py models/ALPACA-LORA-7B 1 0\n\n./build/src/quantize models/ALPACA-LORA-7B/ggml-model-f16.bin models/ALPACA-LORA-7B/alpaca-lora-q4_0.bin 2\n\npython ./examples/python/example-alpaca.py\n```\n\n### Using LoRA adapters during runtime\n\n```sh\n# Download lora adapters and paste them inside models folder\n# https://huggingface.co/tloen/alpaca-lora-7b\n\n\npython scripts/convert-lora-to-ggml.py models/ALPACA-7B-ADAPTER/ -t fp32 \n# Change -t to fp16 to use fp16 weights\n# Inorder to use LoRA adapters without caching, pass the --no-cache flag\n#   - Only supported for fp32 adapter weights\n\npython examples/python/example-lora-adapter.py\n\n# Make sure to set paths correctly for the base model and adapter inside the example\n# Commands: \n# load_lora: Attaches the adapter to the base model \n# unload_lora: Deattaches the adapter (Deattach for fp16 is yet to be added!)\n# reset: Resets the model state\n```\n\n### Running the webUI\n\nTo run the [WebSocket Server](https://github.com/PotatoSpudowski/fastLLaMa/tree/websocket-server) and the [WebUI](https://github.com/PotatoSpudowski/fastLLaMa/tree/webui), Follow the instructions on the respective branches.\n\n### Memory/Disk Requirements\n\nAs the models are currently fully loaded into memory, you will need adequate disk space to save them\nand sufficient RAM to load them. At the moment, memory and disk requirements are the same.\n\n| model size | original size | quantized size (4-bit) |\n|-------|---------------|------------------------|\n| 7B    | 13 GB         | 3.9 GB                 |\n| 13B   | 24 GB         | 7.8 GB                 |\n| 30B   | 60 GB         | 19.5 GB                |\n| 65B   | 120 GB        | 38.5 GB                |\n\n**Info:** Run time may require extra memory during inference!\\\n(Depends on hyperparmeters used during model initialization)\n\n### Contributing\n* Contributors can open PRs\n* Collaborators can push to branches to the repo and merge PRs into the main branch\n* Collaborators will be invited based on contributions\n* Any help with managing issues and PRs is very appreciated!\n* Make sure to read about our [vision](https://github.com/PotatoSpudowski/fastLLaMa/discussions/46)\n\n### Notes\n\n* Tested on\n    * Hardware: Apple silicon, Intel, Arm (Pending)\n    * OS: MacOs, Linux, Windows (Pending), Android (Pending)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FPotatoSpudowski%2FfastLLaMa","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FPotatoSpudowski%2FfastLLaMa","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FPotatoSpudowski%2FfastLLaMa/lists"}