{"id":13715424,"url":"https://github.com/Majdoddin/nlp","last_synced_at":"2025-05-07T04:30:45.964Z","repository":{"id":61320321,"uuid":"546784071","full_name":"Majdoddin/nlp","owner":"Majdoddin","description":null,"archived":false,"fork":false,"pushed_at":"2023-08-23T03:53:01.000Z","size":1437,"stargazers_count":457,"open_issues_count":2,"forks_count":56,"subscribers_count":5,"default_branch":"main","last_synced_at":"2024-11-14T03:34:28.434Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Majdoddin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2022-10-06T16:32:24.000Z","updated_at":"2024-11-07T19:20:56.000Z","dependencies_parsed_at":"2024-01-14T22:13:37.288Z","dependency_job_id":null,"html_url":"https://github.com/Majdoddin/nlp","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Majdoddin%2Fnlp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Majdoddin%2Fnlp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Majdoddin%2Fnlp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Majdoddin%2Fnlp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Majdoddin","download_url":"https://codeload.github.com/Majdoddin/nlp/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252813650,"owners_count":21808362,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T00:00:58.852Z","updated_at":"2025-05-07T04:30:45.502Z","avatar_url":"https://github.com/Majdoddin.png","language":"Jupyter Notebook","funding_links":[],"categories":["Applications","Jupyter Notebook"],"sub_categories":[],"readme":"*If you like my code, please [donate!](./donation/donation.md)*\n# Pyannote plays and Whisper rhymes [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Majdoddin/nlp/blob/main/Pyannote_plays_and_Whisper_rhymes_v_2_0.ipynb)\n\n# Whisper's transcription plus Pyannote's Diarization \n\n**Update** - [@johnwyles](https://github.com/johnwyles) added HTML output for audio/video files from Google Drive, along with some fixes.\n\nUsing the new word-level timestamping of Whisper, the transcription words are highlighted as the video plays, with optional autoscroll. And the display on small displays is improved.\n\nMoreover, the model is loaded just once, thus the whole thing runs much faster now. You can also hardcode your Huggingface token. \n\n---\nAndrej Karpathy [suggested](https://twitter.com/karpathy/status/1574476200801538048?s=20\u0026t=s5IMMXOYjBI6-91dib6w8g) training a classifier on top of  OpenAI [Whisper](https://openai.com/blog/whisper/) model features to identify the speaker, so we can visualize the speaker in the transcript. But, as [pointed out](https://twitter.com/tarantulae/status/1574493613362388992?s=20\u0026t=s5IMMXOYjBI6-91dib6w8g) by Christian Perone, it seems that features from whisper wouldn't be that great for speaker recognition as its main objective is basically to ignore speaker differences.\n\nIn the following, I use [**`pyannote-audio`**](https://github.com/pyannote/pyannote-audio), a speaker diarization toolkit by Hervé Bredin, to identify the speakers, and then match it with the transcriptions of Whispr, linked to the video. The input can be YouTube or an video/audio file (also on Google Drive). I try it on a [Customer Support Call](https://youtu.be/hpZFJctBUHQ). Check the result [**here**](https://majdoddin.github.io/dyson.html).\n\nTo make it easier to match the transcriptions to diarizations by speaker change, Sarah Kaiser [suggested](https://github.com/openai/whisper/discussions/264#discussioncomment-3825375) runnnig the pyannote.audio first and  then just running whisper on the split-by-speaker chunks. \nFor sake of performance (and transcription quality?), we attach the audio segements into a single audio file with a silent spacer as a seperator, and run whisper on it. Enjoy it!\n\n(For sake of performance , I also tried attaching the audio segements into a single audio file with a silent -or beep- spacer as a seperator, and run whisper on it see it on [colab](https://colab.research.google.com/drive/1HuvcY4tkTHPDzcwyVH77LCh_m8tP-Qet?usp=sharing). It [works](https://majdoddin.github.io/lexicap.html) on some audio, and fails on some (Dyson's Interview). The problem is, whisper does not reliably make a timestap on a spacer. See the discussions [#139](https://github.com/openai/whisper/discussions/139) and [#29](https://github.com/openai/whisper/discussions/29))\n\nThe Markdown form used below is from [@ArthurFDLR](https://github.com/ArthurFDLR/whisper-youtube/).   \n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMajdoddin%2Fnlp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMajdoddin%2Fnlp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMajdoddin%2Fnlp/lists"}