An open API service indexing awesome lists of open source software.

https://github.com/ayeshaaaaaaaaa/video-description-and-summarization-using-blip-and-bart-models

This project processes videos by extracting frames, generating detailed visual descriptions for each frame using the BLIP model, and then summarizing these descriptions with the BART model.
https://github.com/ayeshaaaaaaaaa/video-description-and-summarization-using-blip-and-bart-models

bart blip video-description

Last synced: 4 months ago
JSON representation

This project processes videos by extracting frames, generating detailed visual descriptions for each frame using the BLIP model, and then summarizing these descriptions with the BART model.

Awesome Lists containing this project

README

        

# Video-Description-and-Summarization-Using-BLIP-and-BART-Models
This project processes videos by extracting frames, generating detailed visual descriptions for each frame using the BLIP model, and then summarizing these descriptions with the BART model.

## Key features include:

**Frame Extraction:** Extracts 1 frame per second from the input video for efficient processing.

**Visual Captioning:** Uses Salesforce's BLIP image-captioning model to generate natural language descriptions for each extracted frame.

**Summarization:** Combines the generated frame descriptions and summarizes them into a cohesive video summary using Facebook's BART large CNN model.

## Workflow Explanation:
**Video Loading:** The video is loaded using OpenCV's cv2.VideoCapture function, which reads the video file specified by its path. The program checks whether the video has been successfully opened. If not, it exits with an error message.

**FPS and Frame Count Extraction:** The video's frames per second (FPS) and the total number of frames are extracted using OpenCV’s CAP_PROP_FPS and CAP_PROP_FRAME_COUNT. These values are then used to calculate the video’s total duration.

![image](https://github.com/user-attachments/assets/87900d4d-121e-4dd4-ad5a-bfebe4351582)

**Model Initialization:** Two models are loaded:

**BLIP Model (Salesforce/blip-image-captioning-base):** Used for generating natural language captions for individual frames.

**BART Model (facebook/bart-large-cnn):** Used for summarizing the textual descriptions generated by BLIP into a more concise and meaningful summary.

Both models are loaded into memory and moved to the appropriate device (GPU if available, otherwise CPU).

![image](https://github.com/user-attachments/assets/50475086-eda2-4d54-8dc2-e46d7dd0efcd)

**Frame Extraction:** Frames are extracted at the rate of 1 frame per second from the video. This is done by reading each frame and selecting frames at intervals matching the FPS value (e.g., for a 30 FPS video, every 30th frame is selected). Extracted frames are stored in the frame_list.

![image](https://github.com/user-attachments/assets/11c2487a-c278-4fb6-bef7-2e2baf4801c3)

**Frame Descriptions:** For each extracted frame, the BLIP model generates a visual description. This step calls the generate_description function (not provided in the code snippet, but assumed to exist), which processes the frame and outputs a caption for the content depicted in that frame.

![image](https://github.com/user-attachments/assets/9d271733-b8da-4c1b-9237-5abf0779da94)

**Combining Frame Descriptions:** Once all frames have been described, the individual descriptions are concatenated into a single large text block. This forms a comprehensive description of the video based on visual content across multiple frames.

**Summarization:** The combined text of frame descriptions is fed into the BART model. The model processes the text and generates a concise summary. The summary is printed out as the final description of the video.

![image](https://github.com/user-attachments/assets/38ef7abb-fa92-4637-8e19-2854ae310059)

## Output:
![image](https://github.com/user-attachments/assets/06c42ed9-6886-4f49-95ea-15ce9e3e1f31)

![image](https://github.com/user-attachments/assets/e06bc6cb-9c1a-4b0c-a655-88286cf3a06e)