https://github.com/axinc-ai/whisper-export

openvino version of openai/whisper
https://github.com/axinc-ai/whisper-export

Last synced: 5 months ago
JSON representation

openvino version of openai/whisper

Host: GitHub
URL: https://github.com/axinc-ai/whisper-export
Owner: axinc-ai
License: mit
Fork: true (zhuzilin/whisper-openvino)
Created: 2022-12-18T09:28:37.000Z (over 2 years ago)
Default Branch: onnx-export
Last Pushed: 2024-01-09T01:32:57.000Z (over 1 year ago)
Last Synced: 2024-08-04T00:11:00.550Z (9 months ago)
Language: Jupyter Notebook
Homepage:
Size: 3.83 MB
Stars: 10
Watchers: 3
Forks: 4
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-openai-whisper - whisper-export - openvino version of openai/whisper

README

        # Whisper ONNX Export Script

## ONNX Export

This repository based on [whisper.openvino](https://github.com/zhuzilin/whisper-openvino), but

OpenVinoAudioEncoder and OpenVinoTextDecoder were replaced by official AudioEncoder and TextDecoder for ONNX export.

The following command will onnx export:

```

python3 cli.py audio.wav --model medium --export_encoder

python3 cli.py audio.wav --model medium --export_decoder

```

You can also read weights saved_state_dicted from the original whisper.

```

python3 cli.py audio.wav --model medium --export_decoder --fine_tuning model.pth

```

The decoder fixes the size of kv_cache to avoid re-allocating tensors for each inference.

## Requirements

- windows or macOS or Linux

- torch 2.0

- onnx 1.13.1

# Whisper Original information

[[Blog]](https://openai.com/blog/whisper)

[[Paper]](https://cdn.openai.com/papers/whisper.pdf)

[[Model card]](model-card.md)

[[Colab example]](https://colab.research.google.com/github/openai/whisper/blob/master/notebooks/LibriSpeech.ipynb)

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

## Approach

![Approach](approach.png)

A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different stages of a traditional speech processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets.

## Setup

We used Python 3.9.9 and [PyTorch](https://pytorch.org/) 1.10.1 to train and test our models, but the codebase is expected to be compatible with Python 3.7 or later and recent PyTorch versions. The codebase also depends on a few Python packages, most notably [HuggingFace Transformers](https://huggingface.co/docs/transformers/index) for their fast tokenizer implementation and [ffmpeg-python](https://github.com/kkroening/ffmpeg-python) for reading audio files. The following command will pull and install the latest commit from this repository, along with its Python dependencies 

    pip install git+https://github.com/openai/whisper.git 

It also requires the command-line tool [`ffmpeg`](https://ffmpeg.org/) to be installed on your system, which is available from most package managers:

```bash

# on Ubuntu or Debian

sudo apt update && sudo apt install ffmpeg

# on Arch Linux

sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)

brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)

choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)

scoop install ffmpeg

```

You may need [`rust`](http://rust-lang.org) installed as well, in case [tokenizers](https://pypi.org/project/tokenizers/) does not provide a pre-built wheel for your platform. If you see installation errors during the `pip install` command above, please follow the [Getting started page](https://www.rust-lang.org/learn/get-started) to install Rust development environment.

## Available models and languages

There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and relative speed. 

|  Size  | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |

|:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|

|  tiny  |    39 M    |     `tiny.en`      |       `tiny`       |     ~1 GB     |      ~32x      |

|  base  |    74 M    |     `base.en`      |       `base`       |     ~1 GB     |      ~16x      |

| small  |   244 M    |     `small.en`     |      `small`       |     ~2 GB     |      ~6x       |

| medium |   769 M    |    `medium.en`     |      `medium`      |     ~5 GB     |      ~2x       |

| large  |   1550 M   |        N/A         |      `large`       |    ~10 GB     |       1x       |

For English-only applications, the `.en` models tend to perform better, especially for the `tiny.en` and `base.en` models. We observed that the difference becomes less significant for the `small.en` and `medium.en` models.

Whisper's performance varies widely depending on the language. The figure below shows a WER breakdown by languages of Fleurs dataset, using the `large` model. More WER and BLEU scores corresponding to the other models and datasets can be found in Appendix D in [the paper](https://cdn.openai.com/papers/whisper.pdf).

![WER breakdown by language](language-breakdown.svg)

## Command-line usage

The following command will transcribe speech in audio files, using the `medium` model:

    python3 cli.py audio.wav --model medium

    whisper audio.flac audio.mp3 audio.wav --model medium

The default setting (which selects the `small` model) works well for transcribing English. To transcribe an audio file containing non-English speech, you can specify the language using the `--language` option:

    whisper japanese.wav --language Japanese

Adding `--task translate` will translate the speech into English:

    whisper japanese.wav --language Japanese --task translate

Run the following to view all available options:

    whisper --help

See [tokenizer.py](whisper/tokenizer.py) for the list of all available languages.

## Python usage

Transcription can also be performed within Python: 

```python

import whisper

model = whisper.load_model("base")

result = model.transcribe("audio.mp3")

print(result["text"])

```

Internally, the `transcribe()` method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window.

Below is an example usage of `whisper.detect_language()` and `whisper.decode()` which provide lower-level access to the model.

```python

import whisper

model = whisper.load_model("base")

# load audio and pad/trim it to fit 30 seconds

audio = whisper.load_audio("audio.mp3")

audio = whisper.pad_or_trim(audio)

# make log-Mel spectrogram and move to the same device as the model

mel = whisper.log_mel_spectrogram(audio).to(model.device)

# detect the spoken language

_, probs = model.detect_language(mel)

print(f"Detected language: {max(probs, key=probs.get)}")

# decode the audio

options = whisper.DecodingOptions()

result = whisper.decode(model, mel, options)

# print the recognized text

print(result.text)

```

## License

The code and the model weights of Whisper are released under the MIT License. See [LICENSE](LICENSE) for further details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/axinc-ai/whisper-export

Awesome Lists containing this project

README