Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kadirnar/whisper-plus
WhisperPlus: Advancing Speech-to-Text Processing 🚀
https://github.com/kadirnar/whisper-plus
Last synced: 26 days ago
JSON representation
WhisperPlus: Advancing Speech-to-Text Processing 🚀
- Host: GitHub
- URL: https://github.com/kadirnar/whisper-plus
- Owner: kadirnar
- License: apache-2.0
- Created: 2023-11-21T09:59:50.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2024-04-29T19:23:16.000Z (6 months ago)
- Last Synced: 2024-05-01T16:28:37.042Z (6 months ago)
- Language: Python
- Homepage:
- Size: 184 KB
- Stars: 1,318
- Watchers: 16
- Forks: 107
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
- StarryDivineSky - kadirnar/whisper-plus
README
## 🛠️ Installation
```bash
pip install whisperplus git+https://github.com/huggingface/transformers
pip install flash-attn --no-build-isolation
```## 🤗 Model Hub
You can find the models on the [HuggingFace Model Hub](https://huggingface.co/models?search=whisper)
## 🎙️ Usage
To use the whisperplus library, follow the steps below for different tasks:
### 🎵 Youtube URL to Audio
```python
from whisperplus import SpeechToTextPipeline, download_youtube_to_mp3
from transformers import BitsAndBytesConfig, HqqConfig
import torchurl = "https://www.youtube.com/watch?v=di3rHkEZuUw"
audio_path = download_youtube_to_mp3(url, output_dir="downloads", filename="test")hqq_config = HqqConfig(
nbits=4,
group_size=64,
quant_zero=False,
quant_scale=False,
axis=0,
offload_meta=False,
) # axis=0 is used by defaultbnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)pipeline = SpeechToTextPipeline(
model_id="distil-whisper/distil-large-v3",
quant_config=hqq_config,
flash_attention_2=True,
)transcript = pipeline(
audio_path=audio_path,
chunk_length_s=30,
stride_length_s=5,
max_new_tokens=128,
batch_size=100,
language="english",
return_timestamps=False,
)print(transcript)
```### 🍎 Apple MLX
```python
from whisperplus.pipelines import mlx_whisper
from whisperplus import download_youtube_to_mp3url = "https://www.youtube.com/watch?v=1__CAdTJ5JU"
audio_path = download_youtube_to_mp3(url)text = mlx_whisper.transcribe(
audio_path, path_or_hf_repo="mlx-community/whisper-large-v3-mlx"
)["text"]
print(text)
```### 🍏 Lightning Mlx Whisper
```python
from whisperplus.pipelines.lightning_whisper_mlx import LightningWhisperMLX
from whisperplus import download_youtube_to_mp3url = "https://www.youtube.com/watch?v=1__CAdTJ5JU"
audio_path = download_youtube_to_mp3(url)whisper = LightningWhisperMLX(model="distil-large-v3", batch_size=12, quant=None)
output = whisper.transcribe(audio_path=audio_path)["text"]
```### 📰 Summarization
```python
from whisperplus.pipelines.summarization import TextSummarizationPipelinesummarizer = TextSummarizationPipeline(model_id="facebook/bart-large-cnn")
summary = summarizer.summarize(transcript)
print(summary[0]["summary_text"])
```### 📰 Long Text Support Summarization
```python
from whisperplus.pipelines.long_text_summarization import LongTextSummarizationPipelinesummarizer = LongTextSummarizationPipeline(model_id="facebook/bart-large-cnn")
summary_text = summarizer.summarize(transcript)
print(summary_text)
```### 💬 Speaker Diarization
You must confirm the licensing permissions of these two models.
- https://huggingface.co/pyannote/speaker-diarization-3.1
- https://huggingface.co/pyannote/segmentation-3.0```bash
pip install -r requirements/speaker_diarization.txt
pip install -U "huggingface_hub[cli]"
huggingface-cli login
``````python
from whisperplus.pipelines.whisper_diarize import ASRDiarizationPipeline
from whisperplus import download_youtube_to_mp3, format_speech_to_dialogueaudio_path = download_youtube_to_mp3("https://www.youtube.com/watch?v=mRB14sFHw2E")
device = "cuda" # cpu or mps
pipeline = ASRDiarizationPipeline.from_pretrained(
asr_model="openai/whisper-large-v3",
diarizer_model="pyannote/speaker-diarization-3.1",
use_auth_token=False,
chunk_length_s=30,
device=device,
)output_text = pipeline(audio_path, num_speakers=2, min_speaker=1, max_speaker=2)
dialogue = format_speech_to_dialogue(output_text)
print(dialogue)
```### ⭐ RAG - Chat with Video(LanceDB)
```bash
pip install sentence-transformers ctransformers langchain
``````python
from whisperplus.pipelines.chatbot import ChatWithVideochat = ChatWithVideo(
input_file="trascript.txt",
llm_model_name="TheBloke/Mistral-7B-v0.1-GGUF",
llm_model_file="mistral-7b-v0.1.Q4_K_M.gguf",
llm_model_type="mistral",
embedding_model_name="sentence-transformers/all-MiniLM-L6-v2",
)query = "what is this video about ?"
response = chat.run_query(query)
print(response)
```### 🌠 RAG - Chat with Video(AutoLLM)
```bash
pip install autollm>=0.1.9
``````python
from whisperplus.pipelines.autollm_chatbot import AutoLLMChatWithVideo# service_context_params
system_prompt = """
You are an friendly ai assistant that help users find the most relevant and accurate answers
to their questions based on the documents you have access to.
When answering the questions, mostly rely on the info in documents.
"""
query_wrapper_prompt = """
The document information is below.
---------------------
{context_str}
---------------------
Using the document information and mostly relying on it,
answer the query.
Query: {query_str}
Answer:
"""chat = AutoLLMChatWithVideo(
input_file="input_dir", # path of mp3 file
openai_key="YOUR_OPENAI_KEY", # optional
huggingface_key="YOUR_HUGGINGFACE_KEY", # optional
llm_model="gpt-3.5-turbo",
llm_max_tokens="256",
llm_temperature="0.1",
system_prompt=system_prompt,
query_wrapper_prompt=query_wrapper_prompt,
embed_model="huggingface/BAAI/bge-large-zh", # "text-embedding-ada-002"
)query = "what is this video about ?"
response = chat.run_query(query)
print(response)
```### 🎙️ Text to Speech
```python
from whisperplus.pipelines.text2speech import TextToSpeechPipelinetts = TextToSpeechPipeline(model_id="suno/bark")
audio = tts(text="Hello World", voice_preset="v2/en_speaker_6")
```### 🎥 AutoCaption
```bash
pip install moviepy
apt install imagemagick libmagick++-dev
cat /etc/ImageMagick-6/policy.xml | sed 's/none/read,write/g'> /etc/ImageMagick-6/policy.xml
``````python
from whisperplus.pipelines.whisper_autocaption import WhisperAutoCaptionPipeline
from whisperplus import download_youtube_to_mp4video_path = download_youtube_to_mp4(
"https://www.youtube.com/watch?v=di3rHkEZuUw",
output_dir="downloads",
filename="test",
) # Optionalcaption = WhisperAutoCaptionPipeline(model_id="openai/whisper-large-v3")
caption(video_path=video_path, output_path="output.mp4", language="english")
```## 😍 Contributing
```bash
pip install pre-commit
pre-commit install
pre-commit run --all-files
```## 📜 License
This project is licensed under the terms of the Apache License 2.0.
## 🤗 Citation
```bibtex
@misc{radford2022whisper,
doi = {10.48550/ARXIV.2212.04356},
url = {https://arxiv.org/abs/2212.04356},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
```