Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/MiscellaneousStuff/openai-whisper-cpu
Improving transcription performance of OpenAI Whisper for CPU based deployment
https://github.com/MiscellaneousStuff/openai-whisper-cpu
Last synced: 2 months ago
JSON representation
Improving transcription performance of OpenAI Whisper for CPU based deployment
- Host: GitHub
- URL: https://github.com/MiscellaneousStuff/openai-whisper-cpu
- Owner: MiscellaneousStuff
- License: mit
- Created: 2022-09-27T14:05:54.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2022-11-02T11:12:10.000Z (about 2 years ago)
- Last Synced: 2024-08-04T00:11:00.038Z (5 months ago)
- Language: Jupyter Notebook
- Homepage:
- Size: 36.1 KB
- Stars: 231
- Watchers: 10
- Forks: 18
- Open Issues: 9
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-openai-whisper - OpenAI Whisper - CPU
README
# OpenAI Whisper - CPU
## About
Experiments applying quantization methods to OpenAI Whisper ASR model
to improve the inference speed and throughput on CPU-based deployments.
This is motivated by the fact that, although the Whisper model greatly
improves the accessibility of SOTA ASR and doesn't require depending
on the cloud for high quality transcription, many end users can not
run this model out-of-the-box as most consumer computers only contain
CPUs and do not contain high performance GPUs.This could lead to allowing the larger Whisper models to run faster
on laptops without a GPU.Hardware for experiments: \
CPU - AMD Ryzen 5 5600X \
RAM - 32GB DDR4 \
GPU - Nvidia GeForce RTX 3060 Ti \
HDD - M.2 SSD## Usage
Firstly, get the fork of the OpenAI Whisper repo with the
modifications needed for CPU dynamic quantization:```bash
git submodule init
git submodule update
```And then install the module using:
```bash
pip install -e ./whisper
```### Explanation
Quantization of the Whisper model requires changing the `Linear()`
layers within the model to `nn.Linear()`. This is because you need
to specifiy which layer types to dynamically quantize, such as:```python
quantized_model = torch.quantization.quantize_dynamic(
model_fp32, {torch.nn.Linear}, dtype=torch.qint8
)
```However the whisper model is designed to be adaptable, i.e.
it can run at different precisions, so the `Linear()` layer contains
custom code to account for this. However, this is not required for
the quantized model. You can either change the `Linear()` layers in
"/whisper/whisper/model.py" yourself, or you can just use the above
installation instructions.## Results
Test audio is the first 30 seconds of: \
https://www.youtube.com/watch?v=oKOtzIo-uYw| Device | Whisper Model | Data Type | Linear Layer | Inference Time |
| --- | --- | ----------- | --- | --- |
| GPU | tiny | fp32 | Linear | 0.5 |
| CPU | tiny | fp32 | nn.Linear | 2.3 |
| CPU | tiny | qint8 (quant) | nn.Linear | 3.1 (0.74x slowdown) |Tiny quantized model is 9.67x faster than real time. \
Tiny quantized model is 0.74x slower than the original model.| Device | Whisper Model | Data Type | Linear Layer | Inference Time |
| --- | --- | ----------- | --- | --- |
| GPU | base | fp32 | Linear | 0.6 |
| CPU | base | fp32 | nn.Linear | 5.2 |
| CPU | base | qint8 (quant) | nn.Linear | 3.2 (1.62x speedup) |Base quantized model is 9.37x faster than real time. \
Base quantized model is 1.62x faster than the original model.| Device | Whisper Model | Data Type | Linear Layer | Inference Time |
| --- | --- | ----------- | --- | --- |
| GPU | small | fp32 | Linear | 0.7 |
| CPU | small | fp32 | nn.Linear | 19.1s |
| CPU | small | qint8 (quant) | nn.Linear | 6.9s (2.76x speedup) |Small quantized model is 4.34x faster than real time. \
Small quantized model is 2.76x faster than the original model.| Device | Whisper Model | Data Type | Linear Layer | Inference Time |
| --- | --- | ----------- | --- | ---
| GPU | medium | fp32 | Linear | 1.7s |
| CPU | medium | fp32 | nn.Linear | 60.7 |
| CPU | medium | qint8 (quant) | nn.Linear | 23.1 (2.62x speedup) |Medium quantized model is 1.29x faster than real time. \
Medium quantized model is 2.62x faster than the original model.# Docker
Build the docker image.
```
docker build -t whisper-cpu .
```
Run the quantized model.```
docker run --rm -v "$(pwd)/audio":/usr/src/app/audio -v "$(pwd)/script":/usr/src/app/script whisper-cpu python3 ./script/custom_whisper.py audio/path_to_dir_or_audio_file --language English --model medium.en
```- ```-v "$(pwd)/audio":/usr/src/app/audio``` this creates a volume to give docker access to your audio files.
- ```-v "$(pwd)/script":/usr/src/app/script``` this volume gives docker access to the custom start script. Transcription results are also stored here.- Note: you might want to adjust ```./script/custom_whisper.py``` for your own needs.