{"id":13625151,"url":"https://github.com/xtekky/gpt4local","last_synced_at":"2025-08-25T22:10:13.707Z","repository":{"id":225204412,"uuid":"765349785","full_name":"xtekky/gpt4local","owner":"xtekky","description":"Openai-style, fast \u0026 lightweight local language model inference w/ documents","archived":false,"fork":false,"pushed_at":"2024-03-19T20:44:33.000Z","size":10181,"stargazers_count":112,"open_issues_count":0,"forks_count":28,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-04-02T02:45:53.935Z","etag":null,"topics":["ai","chatbot","chatbots","chatgpt","chatgpt-api","documents","gpt","gpt-4","gpt4free","language-model","llm","llm-inference","local","local-llm","openai","openai-api","python"],"latest_commit_sha":null,"homepage":"https://g4f.ai","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xtekky.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-29T18:47:16.000Z","updated_at":"2025-03-30T10:47:39.000Z","dependencies_parsed_at":"2024-08-01T21:57:39.209Z","dependency_job_id":null,"html_url":"https://github.com/xtekky/gpt4local","commit_stats":null,"previous_names":["gpt4free/g4f-local","gpt4free/gpt4local","xtekky/gpt4local"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/xtekky/gpt4local","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xtekky%2Fgpt4local","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xtekky%2Fgpt4local/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xtekky%2Fgpt4local/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xtekky%2Fgpt4local/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xtekky","download_url":"https://codeload.github.com/xtekky/gpt4local/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xtekky%2Fgpt4local/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272139422,"owners_count":24880304,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-25T02:00:12.092Z","response_time":1107,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","chatbot","chatbots","chatgpt","chatgpt-api","documents","gpt","gpt-4","gpt4free","language-model","llm","llm-inference","local","local-llm","openai","openai-api","python"],"created_at":"2024-08-01T21:01:51.393Z","updated_at":"2025-08-25T22:10:13.679Z","avatar_url":"https://github.com/xtekky.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003cimg width=\"1148\" alt=\"image\" src=\"https://github.com/gpt4free/gpt4local/assets/98614666/df91ae5f-fa4a-4eb3-9dca-f9d38aa3764b\"\u003e\n\n`g4l` is a high-level Python library that allows you to run language models using the `llama.cpp` bindings. It is a sister project to @gpt4free, which also provides AI, but using internet and external providers, aswell as additional feature such as text retrieval from documents.\n\npull requests are welcome !!\n\n#### Roadmap\n\n- [ ] Gui / playground\n- [ ] Support function calling \u0026 image models\n- [ ] tts / stt models\n- [ ] Blog article creator (use of multiple queries to produce a qualitative blog atricle with efficient style prompting and context retrieval)\n- [ ] Allow for passing of more arguments\n- [ ] Improve compatibility / Unittests.\n- [ ] Native binding implementation / more low level usage of `llama-cpp-python`\n- [ ] Ability to finetune models on datasets / dataset generator\n- [ ] Optimise for devices with low memory and computing (current min ram is 8gb \u0026 gpu is preferred)\n- [ ] Blog articles explaining usage, and how llm's work.\n- [ ] Better model list / optimised parameters\n- [ ] Create custom local benchmarking.\n\n\n## Table of Contents\n1. [Requirements](#requirements)\n2. [Installation](#installation)\n3. [Downloading Models](#downloading-models)\n   - [Model Quantization](#model-quantization)\n   - [Best Models](#best-models)\n4. [Usage](#usage)\n   - [Basic Usage](#basic-usage)\n   - [Chat With Documents](#chat-with-documents)\n   - [Document Retrieval](#document-retrieval)\n   - [Advanced Usage](#advanced-usage)\n5. [Benchmark](#benchmark)\n6. [Why gpt4local?](#why-gpt4local)\n\n## Requirements\nTo use G4L, you need to have the llama.cpp Python bindings installed. You can install them using pip:\n```\npip3 install -U llama-cpp-python\n```\n\n## Installation\n1. Clone the G4L repository:\n```\ngit clone https://github.com/gpt4free/gpt4local\n```\n2. Navigate to the cloned directory:\n```\ncd gpt4local\n```\n3. Install the required dependencies:\n```\npip install -r requirements.txt\n```\n\n## Downloading Models\n1. Download the desired models in the `GGUF` format from [HuggingFace](https://huggingface.co/). You can find a variety of quantized `.gguf` models on [TheBloke's page](https://huggingface.co/TheBloke).\n2. Place the downloaded models in the [`./models`](/models) folder.\n\nSome popular models include:\n- [mistral-7b-instruct (v2)](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF)\n- [orca-mini-3b](https://gpt4all.io/models/gguf/orca-mini-3b-gguf2-q4_0.gguf)\n\n### Model Quantization\nThe models are available in different quantization levels, such as `q2_0`, `q4_0`, `q5_0`, and `q8_0`. Higher quantization 'bit counts' (4 bits or more) generally preserve more quality, whereas lower levels compress the model further, which can lead to a significant loss in quality. The standard quantization level is `q4_0`.\n\nKeep in mind the memory requirements for different model sizes:\n- 7b parameters ~ `8gb` of RAM\n- 13b parameters ~ `16gb` of RAM\n\n### Best Models\nAccording to [chat.lmsys.org](https://chat.lmsys.org/), the best models are:\n- Best **`7b`** model: `Mistral-7B-Instruct-v0.2`\n- Best opensource model: `Qwen1.5-72B-Chat` ([available here](https://huggingface.co/Qwen/Qwen1.5-72B-Chat-GGUF/tree/main))\n\n## Usage\n\n### Basic Usage\n```py\nfrom g4l.local import LocalEngine\n\nengine = LocalEngine(\n    gpu_layers = -1,  # use all GPU layers\n    cores      = 0    # use all CPU cores\n)\n\nresponse = engine.chat.completions.create(\n    model    = 'orca-mini-3b-gguf2-g4_0',\n    messages = [{\"role\": \"user\", \"content\": \"hi\"}],\n    stream   = True\n)\n\nfor token in response:\n    print(token.choices[0].delta.content)\n```\n\nNote: The `model` parameter must match the file name of the `.gguf` model you placed in `./models`, without the `.gguf` extension!\n\n### Chat With Documents\n\n```py\nfrom g4l.local import LocalEngine, DocumentRetriever\n\nengine = LocalEngine(\n    gpu_layers = -1,  # use all GPU layers\n    cores      = 0,   # use all CPU cores\n    document_retriever = DocumentRetriever(\n        files       = ['einstein-albert.pdf'], \n        embed_model = 'SmartComponents/bge-micro-v2', # https://huggingface.co/spaces/mteb/leaderboard\n    )\n)\n\nresponse = engine.chat.completions.create(\n    model    = 'mistral-7b-instruct',\n    messages = [\n        {\n            \"role\": \"user\", \"content\": \"how was einstein's work in the laboratory\"\n        }\n    ],\n    stream   = True\n)\n\nfor token in response:\n    print(token.choices[0].delta.content or \"\", end=\"\", flush=True)\n```\n\n! The embeddings model will be downloaded upon first use, but it is really small and lightweight.\n\n### Document Retrieval\nG4L provides a `DocumentRetriever` class that allows you to retrieve relevant information from documents based on a query. Here's an example of how to use it:\n\n```py\nfrom g4l.local import DocumentRetriever\n\nengine = DocumentRetriever(\n    files=['einstein-albert.txt'], \n    embed_model='SmartComponents/bge-micro-v2', # https://huggingface.co/spaces/mteb/leaderboard\n    verbose=True,\n)\n\nretrieval_data = engine.retrieve('what inventions did he do')\n\nfor node_with_score in retrieval_data:\n    node = node_with_score.node\n    score = node_with_score.score\n    text = node.text\n    metadata = node.metadata\n    page_label = metadata['page_label']\n    file_name = metadata['file_name']\n    \n    print(f\"Text: {text}\")\n    print(f\"Score: {score}\")\n    print(f\"Page Label: {page_label}\")\n    print(f\"File Name: {file_name}\")\n    print(\"---\")\n```\n\nYou can also get a ready-to-go prompt for the language model using the `retrieve_for_llm` method:\n\n```py\nretrieval_data = engine.retrieve_for_llm('what inventions did he do')\nprint(retrieval_data)\n```\n\nThe prompt template used by `retrieve_for_llm` is as follows:\n\n```py\nprompt = (f'Context information is below.\\n'\n    + '---------------------\\n'\n    + f'{context_batches}\\n'\n    + '---------------------\\n'\n    + 'Given the context information and not prior knowledge, answer the query.\\n'\n    + f'Query: {query_str}\\n'\n    + 'Answer: ')\n```\n\n### Advanced Usage\nG4L provides several configuration options to customize the behavior of the `LocalEngine`. Here are some of the available options:\n\n- `gpu_layers`: The number of layers to offload to the GPU. Use `-1` to offload all layers.\n- `cores`: The number of CPU cores to use. Use `0` to use all available cores.\n- `use_mmap`: Whether to use memory mapping for faster model loading. Default is `True`.\n- `use_mlock`: Whether to lock the model in memory to prevent swapping. Default is `False`.\n- `offload_kqv`: Whether to offload key, query, and value tensors to the GPU. Default is `True`.\n- `context_window`: The maximum context window size. Default is `4900`.\n\nYou can pass these options when creating an instance of `LocalEngine`:\n\n```py\nengine = LocalEngine(\n    gpu_layers = -1,\n    cores      = 0,\n    use_mmap   = True,\n    use_mlock  = False,\n    offload_kqv= True,\n    context_window = 4900\n)\n```\n\n## Benchmark\nBenchmark ran on a 2022 MacBook Air M2, 8GB RAM.\n\n```\nPC: Mac Air M2\nCPU/GPU: M2 chip\nCores: All (8)\nGPU Layers: All\nGPU Offload: 100%\n\nNo power:\nModel: mistral-7b-instruct-v2\nNumber of iterations: 5\nAverage loading time: 1.85s\nAverage total tokens: 48.20\nAverage total time: 5.34s\nAverage speed: 9.02 t/s\n\nWith power:\nModel: mistral-7b-instruct-v2\nNumber of iterations: 5\nAverage loading time: 1.88s\nAverage total tokens: 317\nAverage total time: 17.7s\nAverage speed: 17.9 t/s\n```\n\n## Why gpt4local?\n- I have coded G4L in a way that you can use language models in a very familiar way with quick installation, while preserving maximum performance.\n- Using the direct Python bindings, I was able to **max out** the performance by using 100% GPU, CPU, and RAM.\n- I tried different 3rd party packages that wrap `llama.cpp`, like LmStudio, which still had great performance but in my case a speed of ~`7.83` tokens/s in contrast to `9.02` t/s with native llama.cpp Python bindings.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxtekky%2Fgpt4local","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxtekky%2Fgpt4local","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxtekky%2Fgpt4local/lists"}