{"id":13638541,"url":"https://github.com/apple/ml-ane-transformers","last_synced_at":"2025-05-14T19:07:35.631Z","repository":{"id":37287605,"uuid":"499580907","full_name":"apple/ml-ane-transformers","owner":"apple","description":"Reference implementation of the Transformer architecture optimized for Apple Neural Engine (ANE)","archived":false,"fork":false,"pushed_at":"2023-04-25T09:24:38.000Z","size":1062,"stargazers_count":2618,"open_issues_count":3,"forks_count":89,"subscribers_count":46,"default_branch":"main","last_synced_at":"2025-05-03T20:02:42.234Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/apple.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2022-06-03T16:36:06.000Z","updated_at":"2025-05-03T19:42:47.000Z","dependencies_parsed_at":"2024-01-12T04:46:08.874Z","dependency_job_id":"806ec9dc-e6ba-4290-a1c9-f8d0ca76468f","html_url":"https://github.com/apple/ml-ane-transformers","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apple%2Fml-ane-transformers","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apple%2Fml-ane-transformers/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apple%2Fml-ane-transformers/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apple%2Fml-ane-transformers/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/apple","download_url":"https://codeload.github.com/apple/ml-ane-transformers/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254209859,"owners_count":22032897,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T01:00:48.091Z","updated_at":"2025-05-14T19:07:30.644Z","avatar_url":"https://github.com/apple.png","language":"Python","funding_links":[],"categories":["Transformer Implementations By Communities","Reimplementations","Python"],"sub_categories":["Other"],"readme":"# Apple Neural Engine (ANE) Transformers\nUse `ane_transformers` as a reference PyTorch implementation if you are considering deploying your Transformer models on Apple devices with an A14 or newer and M1 or newer chip to achieve up to **10 times faster** and **14 times** lower peak memory consumption compared to baseline implementations.\n\n`ane_transformers.reference` comprises a standalone reference implementation and `ane_transformers.huggingface` comprises optimized versions of [Hugging Face](https://huggingface.co/models) model classes such as `distilbert` to demonstrate the application of the optimization principles laid out in our research article on existing third-party implementations.\n\nPlease check out our [research article](https://machinelearning.apple.com/research/apple-neural-engine) for a detailed explanation of the optimizations as well as interactive figures to explore latency and peak memory consumption data from our case study: [Hugging Face distilbert](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) model deployment on various devices and operating system versions. Below figures are non-interactive snapshots from the research article for iPhone 13 with iOS16.0 installed:\n\n\u003cimg src=\"assets/iPhone13_iOS16_latency_comparison.png\"\u003e\n\n\u003cimg src=\"assets/iPhone13_iOS16_memory_comparison.png\"\u003e\n\n## Tutorial: Optimized Deployment of Hugging Face distilbert\n\nThis tutorial is a step-by-step guide to the model deployment process from the case study in our research article. The same code is used to generate the Hugging Face distilbert performance data in the figures above.\n\nIn order to begin the optimizations, we initialize the baseline model as follows:\n\n```python\nimport transformers\nmodel_name = \"distilbert-base-uncased-finetuned-sst-2-english\"\nbaseline_model = transformers.AutoModelForSequenceClassification.from_pretrained(\n    model_name,\n    return_dict=False,\n    torchscript=True,\n).eval()\n```\n\nThen we initialize the mathematically equivalent but optimized model, and we restore its parameters using that of the baseline model:\n\n```python\nfrom ane_transformers.huggingface import distilbert as ane_distilbert\noptimized_model = ane_distilbert.DistilBertForSequenceClassification(\n    baseline_model.config).eval()\noptimized_model.load_state_dict(baseline_model.state_dict())\n```\n\nNext we create sample inputs for the model:\n\n```python\ntokenizer = transformers.AutoTokenizer.from_pretrained(model_name)\ntokenized = tokenizer(\n    [\"Sample input text to trace the model\"],\n    return_tensors=\"pt\",\n    max_length=128,  # token sequence length\n    padding=\"max_length\",\n)\n```\n\nWe then trace the optimized model to obtain the expected input format (Torchscript) for the coremltools conversion tool.\n\n```python\nimport torch\ntraced_optimized_model = torch.jit.trace(\n    optimized_model,\n    (tokenized[\"input_ids\"], tokenized[\"attention_mask\"])\n)\n```\n\nFinally, we use coremltools to generate the Core ML model package file and save it.\n\n```python\nimport coremltools as ct\nimport numpy as np\nane_mlpackage_obj = ct.convert(\n    traced_optimized_model,\n    convert_to=\"mlprogram\",\n    inputs=[\n        ct.TensorType(\n                f\"input_{name}\",\n                    shape=tensor.shape,\n                    dtype=np.int32,\n                ) for name, tensor in tokenized.items()\n            ],\n            compute_units=ct.ComputeUnit.ALL,\n)\nout_path = \"HuggingFace_ane_transformers_distilbert_seqLen128_batchSize1.mlpackage\"\nane_mlpackage_obj.save(out_path)\n```\n\nTo verify performance, developers can now launch Xcode and simply add this model package file as a resource in their projects. After clicking on the Performance tab, the developer can generate a performance report on locally available devices, for example, on the Mac that is running Xcode or another Apple device that is connected to that Mac. The figure below shows a performance report generated for this model on an iPhone 13 Pro Max with iOS 16.0 installed.\n\n\u003cimg src=\"assets/xcode_performance_report_comparison.png\"\u003e\n\n\nBased on the figure above, the latency is improved by a factor of 2.84 times for the sequence length of 128 and batch size of 1 that were chosen for the tutorial. Higher sequence lengths, such as 512, and batch sizes, such as 8, will yield up to **10 times lower latency** and **14 times lower peak memory consumption**. Please refer to Figure 2 from our [research article](https://machinelearning.apple.com/research/apple-neural-engine) for detailed and interactive performance data.\n\nNote that the load and compilation times increase due to the number of operations increasing in the optimized model but these are one-time costs and user experience will not be affected if the model is loaded asynchronously.\n\nNote that 4 of the 606 operations in the optimized model are executed on the CPU. These are the embedding lookup related operations and they are more efficient to do on the CPU for this particular model configuration.\n\n## A Note on Unit Tests\nThe unit tests measure, among other things, the ANE speed-up factor. Since the device spec for this reference implementation is M1 or newer chips for the Mac and A14 and newer chips for the iPhone and iPad, the speed-up unit tests will print a warning message if executed on devices outside of this spec. Even if the model is generated using an out of spec Mac, the model should work as expected on in-spec devices.\n\n## Installation \u0026 Troubleshooting\n- Fastest: `pip install ane_transformers`\n- Locally editable: `pip install -e .`\n- If installation fails with `ERROR: Failed building wheel for tokenizers` or `error: can't find Rust compiler`, please follow this [solution](https://github.com/huggingface/transformers/issues/2831#issuecomment-592724471)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapple%2Fml-ane-transformers","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fapple%2Fml-ane-transformers","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapple%2Fml-ane-transformers/lists"}