{"id":19921750,"url":"https://github.com/mzbac/mlx_sharding","last_synced_at":"2025-04-11T05:53:08.557Z","repository":{"id":249369283,"uuid":"831267978","full_name":"mzbac/mlx_sharding","owner":"mzbac","description":"Distributed Inference for mlx LLm","archived":false,"fork":false,"pushed_at":"2024-08-01T16:00:30.000Z","size":104,"stargazers_count":87,"open_issues_count":2,"forks_count":10,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-11T05:53:02.148Z","etag":null,"topics":["distributed-inference","mlx"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mzbac.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-20T05:00:41.000Z","updated_at":"2025-03-21T13:57:51.000Z","dependencies_parsed_at":"2025-01-06T04:10:59.831Z","dependency_job_id":"6df26b65-4f76-4aa9-8cbd-785b2104036d","html_url":"https://github.com/mzbac/mlx_sharding","commit_stats":null,"previous_names":["mzbac/mlx_sharding"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mzbac%2Fmlx_sharding","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mzbac%2Fmlx_sharding/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mzbac%2Fmlx_sharding/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mzbac%2Fmlx_sharding/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mzbac","download_url":"https://codeload.github.com/mzbac/mlx_sharding/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248351407,"owners_count":21089271,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["distributed-inference","mlx"],"created_at":"2024-11-12T22:08:22.490Z","updated_at":"2025-04-11T05:53:08.537Z","avatar_url":"https://github.com/mzbac.png","language":"Python","funding_links":[],"categories":["Python","LLM \u0026 Inference"],"sub_categories":[],"readme":"# MLX Sharding\n\nThis project demonstrates how to implement pipeline parallelism for large language models using MLX. It includes tools for sharding a model, serving shards across multiple machines, and generating text using the distributed model. Additionally, it features an OpenAI API-compatible server for easier integration and usage.\n\n## Demo Video\n\nTo see the distributed inference in action, check out our demo video:\n\n[Sharding DeepSeek-Coder-V2-Lite-Instruct Demo](https://www.youtube.com/watch?v=saOboSfP76o)\n\n## Quick Start\n\n### Installation\n\nInstall the package using pip:\n\n```bash\npip install mlx-sharding\n```\n\n### Running the Servers\n\n1. For the shard node:\n\n   ```bash\n   mlx-sharding-server --model mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx --start-layer 14 --end-layer 27\n   ```\n\n2. For the primary node:\n\n   ```bash\n   mlx-sharding-api --model mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx --start-layer 0 --end-layer 14 --llm-shard-addresses \u003cyour shard node address\u003e\n   ```\n\n   Replace `\u003cyour shard node address\u003e` with the actual address of your shard node (e.g., `localhost:50051`).\n\n## Educational Purpose\n\nThis repository is designed for educational purposes to illustrate how pipeline parallelism can be implemented in MLX. It provides a basic framework for:\n\n1. Sharding a large language model\n2. Distributing model shards across multiple machines\n3. Implementing a simple pipeline for text generation\n4. Serving the model through an OpenAI API-compatible interface\n\nWhile not optimized for production use, this demo serves as a starting point for understanding and experimenting with pipeline parallelism in machine learning workflows.\n\n## Setup and Usage\n\n### 1. Model Preparation\n\nYou have two main options for preparing and using the model:\n\n#### Option A: Pre-Sharding the Model\n\nIf you prefer to pre-shard the model, use `sharding_weight.py`:\n\n```bash\npython sharding_weight.py --model \"mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx\" --output_dir shard_0 --start_layer 0 --end_layer 14 --total_layers 27\npython sharding_weight.py --model \"mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx\" --output_dir shard_1 --start_layer 14 --end_layer 27 --total_layers 27\n# Repeat for additional shards as needed\n```\n\n#### Option B: Dynamic Sharding\n\nYou can let the system dynamically load and shard the weights when starting the server. This option doesn't require pre-sharding.\n\n### 2. Distribute Shards (If Using Option A)\n\nIf you've pre-sharded the model, copy the shard directories to their respective machines. Skip this step for Option B.\n\n### 3. Start the Servers\n\nStart server instances based on your chosen approach:\n\n#### For Pre-Sharded Model (Option A)\n\nOn each machine with a shard, start a server instance. For example:\n\n```bash\npython -m shard.main --model mzbac/DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx-shard-1\n```\n\n#### For Dynamic Sharding (Option B)\n\nStart the server with specific layer ranges:\n\n```bash\npython -m shard.main --model \"mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx\" --start-layer 0 --end-layer 14\n```\n\nNote the IP address and port printed by each server.\n\n### 4. Generate Text\n\n#### Using the generate script\n\nFor a dynamically sharded setup:\n\n```bash\npython generate.py --model \"mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx\" --start_layer 0 --end_layer 14 --server_address \u003cremote_ip1\u003e:\u003cport1\u003e,\u003cremote_ip2\u003e:\u003cport2\u003e --prompt \"Your prompt here\" --max_tokens 512\n```\n\nFor a pre-sharded setup:\n\n```bash\npython generate.py --model mzbac/DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx-shard-0 --server_address \u003cremote_ip1\u003e:\u003cport1\u003e,\u003cremote_ip2\u003e:\u003cport2\u003e --prompt \"Your prompt here\" --max_tokens 512\n```\n\n#### Using the OpenAI API-compatible server\n\n1. Start the server:\n\n   For dynamic sharding:\n\n   ```bash\n   python -m shard.openai_api --model \"mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx\" --llm-shard-addresses localhost:50051,\u003cremote_ip1\u003e:\u003cport1\u003e,\u003cremote_ip2\u003e:\u003cport2\u003e --start-layer 0 --end-layer 14\n   ```\n\n   For pre-sharded model:\n\n   ```bash\n   python -m shard.openai_api --model mzbac/DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx-shard-0 --llm-shard-addresses localhost:50051,\u003cremote_ip1\u003e:\u003cport1\u003e,\u003cremote_ip2\u003e:\u003cport2\u003e\n   ```\n\n2. Use the API endpoints:\n   - `/v1/completions`: Text completion endpoint\n   - `/v1/chat/completions`: Chat completion endpoint\n\nExample usage:\n\n```bash\ncurl localhost:8080/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n     \"messages\": [{\"role\": \"user\", \"content\": \"Say this is a test!\"}],\n     \"temperature\": 0.7\n   }'\n```\n\n### 5. Web User Interface\n\nThis project now includes a web-based user interface for easy interaction with the model. To use the UI:\n\n1. Ensure the OpenAI API-compatible server is running (as described in step 4).\n\n2. Navigate to `http://localhost:8080` (or the appropriate host and port if you've configured it differently) in your web browser.\n\n3. Use the interface to input prompts, adjust parameters, and view the model's responses.\n\nThe UI provides a user-friendly way to interact with the model, making it easier to experiment with different inputs and settings without needing to use command-line tools or write code.\n\n## Limitations and Considerations\n\n1. **Network Dependency**: The performance of this pipeline parallelism implementation is heavily dependent on network speed and latency between machines.\n\n2. **Error Handling**: The current implementation has basic error handling. In a production environment, you'd want to implement more robust error handling and recovery mechanisms.\n\n3. **Security**: This demo uses insecure gRPC channels. For any real-world application, implement proper security measures.\n\n4. **Shard Configuration**: Ensure that when using multiple shards, the layer ranges are set correctly to cover the entire model without overlap.\n\n## Extending the System\n\nTo extend the system for more shards:\n\n1. If pre-sharding, create additional shards using `sharding_weight.py`.\n2. Set up more server instances, one for each new shard.\n3. In `generate.py` or when using the OpenAI API server, include all shard addresses.\n4. Adjust the layer ranges accordingly when using dynamic sharding.\n\n## Requirements\n\n- Python 3.x\n- MLX library\n- gRPC and related dependencies\n- NumPy\n- Transformers library\n- Sufficient RAM on each machine to load and process its model shard\n\n## Acknowledgments\n\n- MLX team for providing the framework\n- Exo(\u003chttps://github.com/exo-explore/exo\u003e) that I heavily inspired from for their implementation of pipeline parallelism\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmzbac%2Fmlx_sharding","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmzbac%2Fmlx_sharding","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmzbac%2Fmlx_sharding/lists"}