{"id":28410404,"url":"https://github.com/ml-explore/mlx-lm","last_synced_at":"2026-01-19T18:02:26.805Z","repository":{"id":282313719,"uuid":"946767878","full_name":"ml-explore/mlx-lm","owner":"ml-explore","description":"Run LLMs with MLX","archived":false,"fork":false,"pushed_at":"2026-01-16T19:36:25.000Z","size":1500,"stargazers_count":3310,"open_issues_count":94,"forks_count":375,"subscribers_count":43,"default_branch":"main","last_synced_at":"2026-01-17T04:49:03.200Z","etag":null,"topics":["llms","mlx"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ml-explore.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-03-11T16:38:30.000Z","updated_at":"2026-01-17T03:21:42.000Z","dependencies_parsed_at":"2025-03-13T22:34:46.473Z","dependency_job_id":"8a60e78f-e087-4d2c-b263-30d66e3fd92b","html_url":"https://github.com/ml-explore/mlx-lm","commit_stats":null,"previous_names":["ml-explore/mlx-lm"],"tags_count":21,"template":false,"template_full_name":null,"purl":"pkg:github/ml-explore/mlx-lm","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ml-explore%2Fmlx-lm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ml-explore%2Fmlx-lm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ml-explore%2Fmlx-lm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ml-explore%2Fmlx-lm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ml-explore","download_url":"https://codeload.github.com/ml-explore/mlx-lm/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ml-explore%2Fmlx-lm/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28578952,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-19T17:42:58.221Z","status":"ssl_error","status_checked_at":"2026-01-19T17:40:54.158Z","response_time":67,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llms","mlx"],"created_at":"2025-06-02T11:35:35.821Z","updated_at":"2026-01-19T18:02:26.800Z","avatar_url":"https://github.com/ml-explore.png","language":"Python","funding_links":[],"categories":["Python","AI and Agents","Inference Runtimes \u0026 Backends","HarmonyOS","Agentic AI ![](https://img.shields.io/badge/_-AGENTIC-22d3ee?style=flat-square\u0026logo=openai\u0026logoColor=white)","Repos","Inference engines","Projects","\u003cimg src=\"./assets/cpu.svg\" width=\"16\" height=\"16\" style=\"vertical-align: middle;\"\u003e Backends","Industry Strength Natural Language Processing","Browse The Shelves","AI \u0026 LLM","Core MLX \u0026 Examples","A01_文本生成_文本对话","🚀 Inference Engines"],"sub_categories":["Windows Manager","Apple Silicon optimized (MLX / CoreML / ANE)","AI and Agents","Local LLM developer tools","LLM Apps \u0026 Interfaces","大语言对话模型及数据","LLM \u0026 GenAI Specialized"],"readme":"## MLX LM \n\nMLX LM is a Python package for generating text and fine-tuning large language\nmodels on Apple silicon with MLX.\n\nSome key features include:\n\n* Integration with the Hugging Face Hub to easily use thousands of LLMs with a\n  single command. \n* Support for quantizing and uploading models to the Hugging Face Hub.\n* [Low-rank and full model\n  fine-tuning](https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/LORA.md)\n  with support for quantized models.\n* Distributed inference and fine-tuning with `mx.distributed`\n\nThe easiest way to get started is to install the `mlx-lm` package:\n\n**With `pip`**:\n\n```sh\npip install mlx-lm\n```\n\n**With `conda`**:\n\n```sh\nconda install -c conda-forge mlx-lm\n```\n\n### Quick Start\n\nTo generate text with an LLM use:\n\n```bash\nmlx_lm.generate --prompt \"How tall is Mt Everest?\"\n```\n\nTo chat with an LLM use:\n\n```bash\nmlx_lm.chat\n```\n\nThis will give you a chat REPL that you can use to interact with the LLM. The\nchat context is preserved during the lifetime of the REPL.\n\nCommands in `mlx-lm` typically take command line options which let you specify\nthe model, sampling parameters, and more. Use `-h` to see a list of available\noptions for a command, e.g.:\n\n```bash\nmlx_lm.generate -h\n```\n\nThe default model for generation and chat is\n`mlx-community/Llama-3.2-3B-Instruct-4bit`.  You can specify any MLX-compatible\nmodel with the `--model` flag. Thousands are available in the\n[MLX Community](https://huggingface.co/mlx-community) Hugging Face\norganization.\n\n### Python API\n\nYou can use `mlx-lm` as a module:\n\n```python\nfrom mlx_lm import load, generate\n\nmodel, tokenizer = load(\"mlx-community/Mistral-7B-Instruct-v0.3-4bit\")\n\nprompt = \"Write a story about Einstein\"\n\nmessages = [{\"role\": \"user\", \"content\": prompt}]\nprompt = tokenizer.apply_chat_template(\n    messages, add_generation_prompt=True,\n)\n\ntext = generate(model, tokenizer, prompt=prompt, verbose=True)\n```\n\nTo see a description of all the arguments you can do:\n\n```\n\u003e\u003e\u003e help(generate)\n```\n\nCheck out the [generation\nexample](https://github.com/ml-explore/mlx-lm/tree/main/mlx_lm/examples/generate_response.py)\nto see how to use the API in more detail. Check out the [batch generation\nexample](https://github.com/ml-explore/mlx-lm/tree/main/mlx_lm/examples/batch_generate_response.py)\nto see how to efficiently generate continuations for a batch of prompts.\n\nThe `mlx-lm` package also comes with functionality to quantize and optionally\nupload models to the Hugging Face Hub.\n\nYou can convert models using the Python API:\n\n```python\nfrom mlx_lm import convert\n\nrepo = \"mistralai/Mistral-7B-Instruct-v0.3\"\nupload_repo = \"mlx-community/My-Mistral-7B-Instruct-v0.3-4bit\"\n\nconvert(repo, quantize=True, upload_repo=upload_repo)\n```\n\nThis will generate a 4-bit quantized Mistral 7B and upload it to the repo\n`mlx-community/My-Mistral-7B-Instruct-v0.3-4bit`. It will also save the\nconverted model in the path `mlx_model` by default.\n\nTo see a description of all the arguments you can do:\n\n```\n\u003e\u003e\u003e help(convert)\n```\n\n#### Streaming\n\nFor streaming generation, use the `stream_generate` function. This yields\na generation response object.\n\nFor example,\n\n```python\nfrom mlx_lm import load, stream_generate\n\nrepo = \"mlx-community/Mistral-7B-Instruct-v0.3-4bit\"\nmodel, tokenizer = load(repo)\n\nprompt = \"Write a story about Einstein\"\n\nmessages = [{\"role\": \"user\", \"content\": prompt}]\nprompt = tokenizer.apply_chat_template(\n    messages, add_generation_prompt=True,\n)\n\nfor response in stream_generate(model, tokenizer, prompt, max_tokens=512):\n    print(response.text, end=\"\", flush=True)\nprint()\n```\n\n#### Sampling\n\nThe `generate` and `stream_generate` functions accept `sampler` and\n`logits_processors` keyword arguments. A sampler is any callable which accepts\na possibly batched logits array and returns an array of sampled tokens.  The\n`logits_processors` must be a list of callables which take the token history\nand current logits as input and return the processed logits. The logits\nprocessors are applied in order.\n\nSome standard sampling functions and logits processors are provided in\n`mlx_lm.sample_utils`.\n\n### Command Line\n\nYou can also use `mlx-lm` from the command line with:\n\n```\nmlx_lm.generate --model mistralai/Mistral-7B-Instruct-v0.3 --prompt \"hello\"\n```\n\nThis will download a Mistral 7B model from the Hugging Face Hub and generate\ntext using the given prompt.\n\nFor a full list of options run:\n\n```\nmlx_lm.generate --help\n```\n\nTo quantize a model from the command line run:\n\n```\nmlx_lm.convert --model mistralai/Mistral-7B-Instruct-v0.3 -q\n```\n\nFor more options run:\n\n```\nmlx_lm.convert --help\n```\n\nYou can upload new models to Hugging Face by specifying `--upload-repo` to\n`convert`. For example, to upload a quantized Mistral-7B model to the\n[MLX Hugging Face community](https://huggingface.co/mlx-community) you can do:\n\n```\nmlx_lm.convert \\\n    --model mistralai/Mistral-7B-Instruct-v0.3 \\\n    -q \\\n    --upload-repo mlx-community/my-4bit-mistral\n```\n\nModels can also be converted and quantized directly in the\n[mlx-my-repo](https://huggingface.co/spaces/mlx-community/mlx-my-repo) Hugging\nFace Space.\n\n### Long Prompts and Generations \n\n`mlx-lm` has some tools to scale efficiently to long prompts and generations:\n\n- A rotating fixed-size key-value cache.\n- Prompt caching\n\nTo use the rotating key-value cache pass the argument `--max-kv-size n` where\n`n` can be any integer. Smaller values like `512` will use very little RAM but\nresult in worse quality. Larger values like `4096` or higher will use more RAM\nbut have better quality.\n\nCaching prompts can substantially speedup reusing the same long context with\ndifferent queries. To cache a prompt use `mlx_lm.cache_prompt`. For example:\n\n```bash\ncat prompt.txt | mlx_lm.cache_prompt \\\n  --model mistralai/Mistral-7B-Instruct-v0.3 \\\n  --prompt - \\\n  --prompt-cache-file mistral_prompt.safetensors\n``` \n\nThen use the cached prompt with `mlx_lm.generate`:\n\n```\nmlx_lm.generate \\\n    --prompt-cache-file mistral_prompt.safetensors \\\n    --prompt \"\\nSummarize the above text.\"\n```\n\nThe cached prompt is treated as a prefix to the supplied prompt. Also notice\nwhen using a cached prompt, the model to use is read from the cache and need\nnot be supplied explicitly.\n\nPrompt caching can also be used in the Python API in order to avoid\nrecomputing the prompt. This is useful in multi-turn dialogues or across\nrequests that use the same context. See the\n[example](https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/examples/chat.py)\nfor more usage details.\n\n### Supported Models\n\n`mlx-lm` supports thousands of LLMs available on the Hugging Face Hub. If the\nmodel you want to run is not supported, file an\n[issue](https://github.com/ml-explore/mlx-lm/issues/new) or better yet, submit\na pull request. Many supported models are available in various quantization\nformats in the [MLX Community](https://huggingface.co/mlx-community) Hugging\nFace organization.\n\nFor some models the tokenizer may require you to enable the `trust_remote_code`\noption. You can do this by passing `--trust-remote-code` in the command line.\nIf you don't specify the flag explicitly, you will be prompted to trust remote\ncode in the terminal when running the model. \n\nTokenizer options can also be set in the Python API. For example:\n\n```python\nmodel, tokenizer = load(\n    \"qwen/Qwen-7B\",\n    tokenizer_config={\"eos_token\": \"\u003c|endoftext|\u003e\", \"trust_remote_code\": True},\n)\n```\n\n### Large Models\n\n\u003e [!NOTE]\n    This requires macOS 15.0 or higher to work.\n\nModels which are large relative to the total RAM available on the machine can\nbe slow. `mlx-lm` will attempt to make them faster by wiring the memory\noccupied by the model and cache. This requires macOS 15 or higher to\nwork.\n\nIf you see the following warning message:\n\n\u003e [WARNING] Generating with a model that requires ...\n\nthen the model will likely be slow on the given machine. If the model fits in\nRAM then it can often be sped up by increasing the system wired memory limit.\nTo increase the limit, set the following `sysctl`:\n\n```bash\nsudo sysctl iogpu.wired_limit_mb=N\n```\n\nThe value `N` should be larger than the size of the model in megabytes but\nsmaller than the memory size of the machine.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fml-explore%2Fmlx-lm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fml-explore%2Fmlx-lm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fml-explore%2Fmlx-lm/lists"}