Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/adesoji1/youtubesummaryai
Python script for YouTube summary. The service should summarize an YouTube video by url. It should works for long video and for different languages.
https://github.com/adesoji1/youtubesummaryai
Last synced: 17 days ago
JSON representation
Python script for YouTube summary. The service should summarize an YouTube video by url. It should works for long video and for different languages.
- Host: GitHub
- URL: https://github.com/adesoji1/youtubesummaryai
- Owner: Adesoji1
- Created: 2024-05-29T17:13:53.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-06-06T08:41:17.000Z (5 months ago)
- Last Synced: 2024-06-06T09:51:05.655Z (5 months ago)
- Language: Python
- Size: 20.1 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
export CUDA_LAUNCH_BLOCKING=1
python your_script.py
# YouTube Video Summarizer
This repository contains Python scripts to summarize YouTube videos using two different approaches. The scripts extract transcripts from YouTube videos and use pre-trained AI models to generate concise summaries. The resulting summaries are saved in a CSV file along with the corresponding YouTube video URLs.
## Approaches
### Approach 1: YouTube Transcript API and BART Summarization
1. **Fetch Transcript**: Uses the `YouTubeTranscriptApi` to fetch video transcripts.
2. **Preprocessing**: Cleans the transcript text.
3. **Summarization**: Uses the BART model (`facebook/bart-large-cnn`) to summarize the cleaned transcript.
4. **CSV Export**: Saves the summary and video URL to a CSV file.```python
import os
import csv
import re
import torch
import time
from dotenv import load_dotenv
from googleapiclient.discovery import build
from transformers import pipeline
from youtube_transcript_api import YouTubeTranscriptApi# Load environment variables
load_dotenv()# Set up YouTube API
youtube_api_key = os.getenv('YOUTUBE_API_KEY')
youtube = build('youtube', 'v3', developerKey=youtube_api_key)def setup_summarizer(retries=3):
for attempt in range(retries):
try:
device = 0 if torch.cuda.is_available() else -1
summarizer = pipeline("summarization", model="facebook/bart-large-cnn", device=device)
return summarizer
except RuntimeError as e:
if "CUDA-capable device(s) is/are busy or unavailable" in str(e):
torch.cuda.empty_cache()
time.sleep(10)
return pipeline("summarization", model="facebook/bart-large-cnn", device=-1)summarizer = setup_summarizer()
def get_video_id(url):
video_id = None
if 'youtube.com/watch?v=' in url:
video_id = url.split('v=')[1]
elif 'youtu.be/' in url:
video_id = url.split('youtu.be/')[1]
elif 'youtube.com/embed/' in url:
video_id = url.split('embed/')[1]
else:
raise ValueError("Invalid YouTube URL format")
if '&' in video_id:
video_id = video_id.split('&')[0]
return video_iddef get_transcript(video_id):
try:
transcript_list = YouTubeTranscriptApi.get_transcript(video_id)
transcript = ' '.join([t['text'] for t in transcript_list])
return transcript
except Exception as e:
return Nonedef preprocess_text(text):
return re.sub(r'[^\w\s]', '', text)def summarize_text(text):
summary = summarizer(text, max_length=200, min_length=60, do_sample=False)
return summary[0]['summary_text']def summarize_video(url):
try:
video_id = get_video_id(url)
except ValueError as e:
return str(e)
transcript = get_transcript(video_id)
if transcript:
cleaned_text = preprocess_text(transcript)
summary = summarize_text(cleaned_text)
return summary
else:
return "No transcript available or error in fetching transcript."def save_summary_to_csv(video_url, summary):
file_exists = os.path.isfile('video_summaries.csv')
with open('video_summaries.csv', mode='a', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
if not file_exists:
writer.writerow(['YouTube Link', 'Summary'])
writer.writerow([video_url, summary])# Example usage
try:
video_url = 'https://youtu.be/rULO-Axqs40'
summary = summarize_video(video_url)
print("Summary:", summary)
save_summary_to_csv(video_url, summary)
except ValueError as e:
print(e)
```### Approach 2: Audio Transcription and T5 Summarization
1. **Download Audio**: Uses `yt-dlp` to download the audio of a YouTube video.
2. **Transcription**: Uses the `speech_recognition` library to transcribe the audio.
3. **Summarization**: Uses the T5 model (`google-t5/t5-small`) to summarize the transcript.
4. **CSV Export**: Saves the summary and video URL to a CSV file.```python
import re
import yt_dlp as youtube_dl
import speech_recognition as sr
from transformers import pipeline
from googleapiclient.discovery import build
from time import sleep
import torch
import pandas as pdAPI_KEY = 'YOUR_API_KEY'
youtube = build('youtube', 'v3', developerKey=API_KEY)def get_video_details(video_id):
request = youtube.videos().list(part='snippet', id=video_id)
response = request.execute()
return response['items'][0]['snippet']['title']def download_audio(video_url):
ydl_opts = {
'format': 'bestaudio/best',
'postprocessors': [{
'key': 'FFmpegExtractAudio',
'preferredcodec': 'wav',
'preferredquality': '192',
}],
'outtmpl': 'downloaded_audio.%(ext)s',
'quiet': True,
'cookiefile': 'cookies.txt',
}
with youtube_dl.YoutubeDL(ydl_opts) as ydl:
ydl.download([video_url])
return 'downloaded_audio.wav'def download_audio_with_retries(video_url, max_retries=3):
attempt = 0
while attempt < max_retries:
try:
return download_audio(video_url)
except youtube_dl.utils.DownloadError as e:
if 'HTTP Error 429' in str(e):
sleep(10)
attempt += 1
else:
raise e
raise Exception("Failed to download audio after retries")def transcribe_audio(audio_file):
recognizer = sr.Recognizer()
with sr.AudioFile(audio_file) as source:
audio_data = recognizer.record(source)
try:
text = recognizer.recognize_google(audio_data, language='en-US')
return text
except sr.UnknownValueError:
return "Audio could not be understood"def summarize_text(text):
device = 0 if torch.cuda.is_available() else -1
summarizer = pipeline("summarization", model="google-t5/t5-small", device=device)
summary = summarizer(text, max_length=300, min_length=60, do_sample=False)
return summary[0]['summary_text']def extract_video_id(video_url):
patterns = [
r'youtu\.be\/([^#\&\?]{11})',
r'watch\?v=([^#\&\?]{11})',
r'embed\/([^#\&\?]{11})'
]for pattern in patterns:
match = re.search(pattern, video_url)
if match:
return match.group(1)
raise ValueError("Invalid YouTube URL or video ID format")def main(video_url):
try:
video_id = extract_video_id(video_url)
except ValueError as e:
print(e)
returntitle = get_video_details(video_id)
audio_file = download_audio_with_retries(video_url)
transcript = transcribe_audio(audio_file)
summary = summarize_text(transcript)
print("Video Title:", title)
print("Summary:", summary)results = {
"YouTube Link": [video_url],
"Summary": [summary]
}
df = pd.DataFrame(results)
df.to_csv('youtube_summaries.csv', index=False)if __name__ == "__main__":
video_url = input("Enter YouTube video URL: ")
main(video_url)
```## Future Work
1. **Application Development**: Build a web application or a desktop application that allows users to input YouTube URLs and get summaries directly.
2. **Language Support**: Enhance the script to support multiple languages for transcription and summarization.
3. **Batch Processing**: Add functionality to process multiple videos at once.
4. **Error Handling**: Improve error handling and logging mechanisms for more robust and user-friendly interactions.
5. **Real-time Summarization**: Explore possibilities for real-time video summarization during live streams.
6. **User Interface**: Develop a graphical user interface (GUI) for non-technical users.
7. **Performance Optimization**: Optimize the scripts for faster execution and lower resource consumption.## Requirements
- Python 3.11+
- `torch`
- `transformers`
- `youtube_transcript_api`
- `google-api-python-client`
- `yt-dlp`
- `speechrecognition`
- `pandas`
- `ffmpeg`## Installation
1. Clone the repository:
```bash
git clone https://github.com/yourusername/YouTubeVideoSummarizer.git
cd YouTubeVideoSummarizer
```2. Install the required dependencies:
```bash
pip install -r requirements.txt
```3. Set up your environment variables in a `.env` file:
```
YOUTUBE_API_KEY=YOUR_API_KEY
```4. Run the script:
```bash
python script_name.py
```Replace `script_name.py` with the appropriate script you want to run.
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
---
This README provides an overview of the YouTube video summarization project, details on the two approaches used, and outlines future work for further development and enhancement