{"id":17707935,"url":"https://github.com/usefulsensors/moonshine","last_synced_at":"2025-03-13T13:32:19.787Z","repository":{"id":259064310,"uuid":"867866058","full_name":"usefulsensors/moonshine","owner":"usefulsensors","description":"Fast and accurate automatic speech recognition (ASR) for edge devices","archived":false,"fork":false,"pushed_at":"2025-02-26T21:37:44.000Z","size":1902,"stargazers_count":2622,"open_issues_count":17,"forks_count":136,"subscribers_count":33,"default_branch":"main","last_synced_at":"2025-03-10T03:56:46.310Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/usefulsensors.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-04T22:10:28.000Z","updated_at":"2025-03-09T23:30:06.000Z","dependencies_parsed_at":"2025-01-26T17:18:58.127Z","dependency_job_id":"79378cd6-cfcb-4f2f-813e-2a4fdbcc89c2","html_url":"https://github.com/usefulsensors/moonshine","commit_stats":null,"previous_names":["usefulsensors/moonshine"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/usefulsensors%2Fmoonshine","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/usefulsensors%2Fmoonshine/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/usefulsensors%2Fmoonshine/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/usefulsensors%2Fmoonshine/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/usefulsensors","download_url":"https://codeload.github.com/usefulsensors/moonshine/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243414516,"owners_count":20287135,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-25T02:00:34.447Z","updated_at":"2025-03-13T13:32:19.774Z","avatar_url":"https://github.com/usefulsensors.png","language":"Python","funding_links":[],"categories":["Python","Speech-to-Text (STT)"],"sub_categories":["Open-Source Models \u0026 Libraries"],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"logo.png\" width=\"192px\" /\u003e\n\u003c/p\u003e\n\n\u003ch1 style=\"text-align:center;\"\u003eMoonshine\u003c/h1\u003e\n\n[[Blog]](https://petewarden.com/2024/10/21/introducing-moonshine-the-new-state-of-the-art-for-speech-to-text/) [[Paper]](https://arxiv.org/abs/2410.15608) [[Model Card]](https://github.com/usefulsensors/moonshine/blob/main/model-card.md) [[Podcast]](https://notebooklm.google.com/notebook/d787d6c2-7d7b-478c-b7d5-a0be4c74ae19/audio)\n\nMoonshine is a family of speech-to-text models optimized for fast and accurate automatic speech recognition (ASR) on resource-constrained devices. It is well-suited to real-time, on-device applications like live transcription and voice command recognition. Moonshine obtains word-error rates (WER) better than similarly-sized tiny.en and base.en Whisper models from OpenAI on the datasets used in the [OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard) maintained by HuggingFace:\n\n\u003ctable\u003e\n\u003ctr\u003e\u003cth\u003eTiny\u003c/th\u003e\u003cth\u003eBase\u003c/th\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003e\n\n| WER        | Moonshine | Whisper |\n| ---------- | --------- | ------- |\n| Average    | **12.66** | 12.81   |\n| AMI        | 22.77     | 24.24   |\n| Earnings22 | 21.25     | 19.12   |\n| Gigaspeech | 14.41     | 14.08   |\n| LS Clean   | 4.52      | 5.66    |\n| LS Other   | 11.71     | 15.45   |\n| SPGISpeech | 7.70      | 5.93    |\n| Tedlium    | 5.64      | 5.97    |\n| Voxpopuli  | 13.27     | 12.00   |\n\n\u003c/td\u003e\u003ctd\u003e\n\n| WER        | Moonshine | Whisper |\n| ---------- | --------- | ------- |\n| Average    | **10.07** | 10.32   |\n| AMI        | 17.79     | 21.13   |\n| Earnings22 | 17.65     | 15.09   |\n| Gigaspeech | 12.19     | 12.83   |\n| LS Clean   | 3.23      | 4.25    |\n| LS Other   | 8.18      | 10.35   |\n| SPGISpeech | 5.46      | 4.26    |\n| Tedlium    | 5.22      | 4.87    |\n| Voxpopuli  | 10.81     | 9.76    |\n\n\u003c/td\u003e\u003c/tr\u003e \u003c/table\u003e\n\nMoonshine's compute requirements scale with the length of input audio. This means that shorter input audio is processed faster, unlike existing Whisper models that process everything as 30-second chunks. To give you an idea of the benefits: Moonshine processes 10-second audio segments _5x faster_ than Whisper while maintaining the same (or better!) WER.\n\nMoonshine Base is approximately 400MB, while Tiny is around 190MB. Both publicly-released models currently support English only.\n\nThis repo hosts inference code and demos for Moonshine.\n\n- [Installation](#installation)\n  - [1. Create a virtual environment](#1-create-a-virtual-environment)\n  - [2a. Install the `useful-moonshine` package to use Moonshine with Torch, TensorFlow, or JAX](#2a-install-the-useful-moonshine-package-to-use-moonshine-with-torch-tensorflow-or-jax)\n  - [2b. Install the `useful-moonshine-onnx` package to use Moonshine with ONNX](#2b-install-the-useful-moonshine-onnx-package-to-use-moonshine-with-onnx)\n  - [3. Try it out](#3-try-it-out)\n- [Examples](#examples)\n  - [Live Captions](#live-captions)\n  - [Running in the Browser](#running-in-the-browser)\n  - [CTranslate2](#ctranslate2)\n  - [HuggingFace Transformers](#huggingface-transformers)\n- [TODO](#todo)\n- [Citation](#citation)\n\n## Installation\n\nWe currently offer two options for installing Moonshine:\n\n1. `useful-moonshine`, which uses Keras (with support for Torch, TensorFlow, and JAX backends)\n2. `useful-moonshine-onnx`, which uses the ONNX runtime\n\nThese instructions apply to both options; follow along to get started.\n\nNote: We like `uv` for managing Python environments, so we use it here. If you don't want to use it, simply skip the `uv` installation and leave `uv` off of your shell commands.\n\n### 1. Create a virtual environment\n\nFirst, [install](https://github.com/astral-sh/uv) `uv` for Python environment management.\n\nThen create and activate a virtual environment:\n\n```shell\nuv venv env_moonshine\nsource env_moonshine/bin/activate\n```\n\n### 2a. Install the `useful-moonshine` package to use Moonshine with Torch, TensorFlow, or JAX\n\nThe `useful-moonshine` inference code is written in Keras and can run with each of the backends that Keras supports: Torch, TensorFlow, and JAX. The backend you choose will determine which flavor of the `useful-moonshine` package to install. If you're just getting started, we suggest installing the (default) Torch backend:\n\n```shell\nuv pip install useful-moonshine@git+https://github.com/usefulsensors/moonshine.git\n```\n\nTo run the provided inference code, you have to instruct Keras to use the PyTorch backend by setting an environment variable:\n\n```shell\nexport KERAS_BACKEND=torch\n```\n\nTo run with the TensorFlow backend, run the following to install Moonshine and set the environment variable:\n\n```shell\nuv pip install useful-moonshine[tensorflow]@git+https://github.com/usefulsensors/moonshine.git\nexport KERAS_BACKEND=tensorflow\n```\n\n  To run with the JAX backend, run the following:\n\n```shell\nuv pip install useful-moonshine[jax]@git+https://github.com/usefulsensors/moonshine.git\nexport KERAS_BACKEND=jax\n# Use useful-moonshine[jax-cuda] for jax on GPU\n```\n\n### 2b. Install the `useful-moonshine-onnx` package to use Moonshine with ONNX\n\nUsing Moonshine with the ONNX runtime is preferable if you want to run the models on SBCs like the Raspberry Pi. We've prepared a separate version of\nthe package with minimal dependencies to support these use cases. To use it, run the following:\n\n```shell\nuv pip install useful-moonshine-onnx@git+https://git@github.com/usefulsensors/moonshine.git#subdirectory=moonshine-onnx\n```\n\n### 3. Try it out\n\nYou can test whichever type of Moonshine you installed by transcribing the provided example audio file with the `.transcribe` function:\n\n```shell\npython\n\u003e\u003e\u003e import moonshine # or import moonshine_onnx\n\u003e\u003e\u003e moonshine.transcribe(moonshine.ASSETS_DIR / 'beckett.wav', 'moonshine/tiny') # or moonshine_onnx.transcribe(...)\n['Ever tried ever failed, no matter try again, fail again, fail better.']\n```\n\nThe first argument is a path to an audio file and the second is the name of a Moonshine model. `moonshine/tiny` and `moonshine/base` are the currently available models.\n\n## Examples\n\nSince the Moonshine models can be used with a variety of different runtimes and applications, we've included code samples showing how to use them in different situations. The [`demo`](/demo/) folder in this repository also has more information on many of them.\n\n### Live Captions\n\nYou can try the Moonshine ONNX models with live input from a microphone with the [live captions demo](/demo/README.md#demo-live-captioning-from-microphone-input).\n\n### Running in the Browser\n\nYou can try out the Moonshine ONNX models locally in a web browser with our [HuggingFace space](https://huggingface.co/spaces/UsefulSensors/moonshine-web). We've included the [source for this demo](/demo/moonshine-web/) in this repository; this is a great starting place for those wishing to build web-based applications with Moonshine.\n\n### CTranslate2\n\nThe files for the CTranslate2 versions of Moonshine are available at [huggingface.co/UsefulSensors/moonshine/tree/main/ctranslate2](https://huggingface.co/UsefulSensors/moonshine/tree/main/ctranslate2), but they require [a pull request to be merged](https://github.com/OpenNMT/CTranslate2/pull/1808) before they can be used with the mainline version of the framework. Until then, you should be able to try them with [our branch](https://github.com/njeffrie/CTranslate2/tree/master), with [this example script](https://github.com/OpenNMT/CTranslate2/pull/1808#issuecomment-2439725339).\n\n### HuggingFace Transformers\n\nBoth models are also available on the HuggingFace hub and can be used with the `transformers` library, as follows:\n\n```python\nimport torch\nfrom transformers import AutoProcessor, MoonshineForConditionalGeneration\nfrom datasets import load_dataset\n\nprocessor = AutoProcessor.from_pretrained(\"UsefulSensors/moonshine-tiny\")\nmodel = MoonshineForConditionalGeneration.from_pretrained(\"UsefulSensors/moonshine-tiny\")\n\nds = load_dataset(\"hf-internal-testing/librispeech_asr_dummy\", \"clean\", split=\"validation\")\naudio_array = ds[0][\"audio\"][\"array\"]\n\ninputs = processor(audio_array, return_tensors=\"pt\")\n\ngenerated_ids = model.generate(**inputs)\n\ntranscription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]\nprint(transcription)\n```\n\n## TODO\n* [x] Live transcription demo\n\n* [x] ONNX model\n\n* [x] HF transformers support\n\n* [x] Demo Moonshine running in the browser\n\n* [ ] CTranslate2 support (complete but [awaiting a merge](https://github.com/OpenNMT/CTranslate2/pull/1808))\n\n* [ ] MLX support\n\n* [ ] Fine-tuning code\n\n* [ ] HF transformers.js support\n\n* [ ] Long-form transcription demo \n\n## Known Issues\n\n### UserWarning: You are using a softmax over axis 3 of a tensor of shape torch.Size([1, 8, 1, 1])\nThis is a benign warning arising from Keras. For the first token in the decoding loop, the attention score matrix's shape is 1x1, which triggers this warning. You can safely ignore it, or run with `python -W ignore` to suppress the warning.\n\n## Citation\nIf you benefit from our work, please cite us:\n```\n@misc{jeffries2024moonshinespeechrecognitionlive,\n      title={Moonshine: Speech Recognition for Live Transcription and Voice Commands}, \n      author={Nat Jeffries and Evan King and Manjunath Kudlur and Guy Nicholson and James Wang and Pete Warden},\n      year={2024},\n      eprint={2410.15608},\n      archivePrefix={arXiv},\n      primaryClass={cs.SD},\n      url={https://arxiv.org/abs/2410.15608}, \n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fusefulsensors%2Fmoonshine","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fusefulsensors%2Fmoonshine","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fusefulsensors%2Fmoonshine/lists"}