https://github.com/deep-diver/vid2persona

This project breathes life into video characters by using AI to describe their personality and then chat with you as them.
https://github.com/deep-diver/vid2persona

Last synced: 2 months ago
JSON representation

This project breathes life into video characters by using AI to describe their personality and then chat with you as them.

Host: GitHub
URL: https://github.com/deep-diver/vid2persona
Owner: deep-diver
Created: 2024-03-05T02:40:25.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-03-12T03:23:46.000Z (over 1 year ago)
Last Synced: 2025-03-31T00:31:36.246Z (4 months ago)
Language: Jupyter Notebook
Size: 62.1 MB
Stars: 45
Watchers: 2
Forks: 6
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Vid2Persona

This project breathes life into video characters by using AI to describe their personality and then chat with you as them.

## Brainstormed workflow

1. get a person's description from the video clip using Large Multimodal Model
- We choose [Get video descriptions](https://cloud.google.com/vertex-ai/generative-ai/docs/video/video-descriptions#vid-desc-rest) service from [Generative AI on Vertex AI](https://cloud.google.com/vertex-ai/generative-ai).

2. based on the description, ask Large Language Model to pretend to be the person
3. then, chatting with that personality
- We choose either [Gemini API from Google AI Studio](https://ai.google.dev/) or [Gemini API from Generative AI on Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/gemini).

The final output is the Gradio based chatting application hosted on [Hugging Face Space](https://huggingface.co/spaces).

Optionally, we could leverage other open source technologies
- [diffusers](https://huggingface.co/docs/diffusers/en/index) to generate images of the person in different poses or the backgrounds
- [transformers](https://huggingface.co/docs/transformers/en/index) to replace closed Gemini model with open models such as [LLaMA2](https://llama.meta.com/), [Gemma](https://blog.google/technology/developers/gemma-open-models/), [Mistral](https://mistral.ai/), etc.

## Realized workflow

### Character description

We obtain a description from an input video using the [Gemini Pro 1.0 API](https://ai.google.dev/). We create a custom prompt (which we brainstormed with help of ChatGPT) to provide as inputs to the API along with the video. The prompt is available in [this file](./vid2persona/prompts/vlm.toml).

Refer to [this notebook](./notebooks/Ask_about_character.ipynb) for a rundown.

Here is an example of how a Gemini response looks like:

```json
{
"characters": [
{
"name": "Alice",
"physicalDescription": "Alice is a young woman with long, wavy brown hair and hazel eyes. She is of average height and has a slim build. Her most distinctive feature is her warm, friendly smile.",
"personalityTraits": [
"Alice is a kind, compassionate, and intelligent woman. She is always willing to help others and is a great listener. She is also very creative and has a great sense of humor.",
],
"likes": [
"Alice loves spending time with her friends and family.",
"She enjoys reading, writing, and listening to music.",
"She is also a big fan of traveling and exploring new places."
],
"dislikes": [
"Alice dislikes rudeness and cruelty.",
"She also dislikes being lied to or taken advantage of.",
"She is not a fan of heights or roller coasters."
],
"background": [
"Alice grew up in a small town in the Midwest.",
"She was always a good student and excelled in her studies.",
"After graduating from high school, she moved to the city to attend college.",
"She is currently working as a social worker."
],
"goals": [
"Alice wants to make a difference in the world.",
"She hopes to one day open her own counseling practice.",
"She also wants to travel the world and experience different cultures."
],
"relationships": [
"Alice is very close to her family and friends.",
"She is also in a loving relationship with her partner, Ben.",
"She has a good relationship with her colleagues and is well-respected by her clients."
]
}
]
}
```

### Chatting with the character

Next, we construct a system prompt from the response above and use it as an input to a Large Language Model (LLM). This prompt is available [here](./vid2persona/prompts/llm.toml). The system prompt helps the LLM to be character-aware.

Refer to [this notebook](./notebooks/llm_personality.ipynb) for a rundown.

> [!NOTE]
> If a video contains multiple characters, we construct the system prompt only for one.

You can find all of this collated into a single pipeline in [this demo](https://huggingface.co/spaces/chansung/vid2persona). Feel free to give it a try!

## Design considerations

We designed the overall pipeline like so for the following reasons:

* Videos can be hard to process efficiently and captioning them requires quite a lot compute cavalry. The existing open solutions didn't meet our needs. This why we delegated this part of the pipeline to Gemini.
* On the other hand, the literature around making LLMs accessible is widely popular, thanks to tools like `bitsandbytes`. For the second part of the pipeline, we wanted to provide the users the flexibility of "bring your own language model". This is also because there's an abundance of high-quality open LLMs particularly good at this task. For our project, we used [HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) because it's small (7B) and also very performant.

For the scaling the second part of the pipeline, [`text-generation-inference`](https://huggingface.co/docs/text-generation-inference) is leveraged.

## Acknowledgments

This is a project built during the Gemini sprint held by Google's ML Developer Programs team. We are thankful to be granted good amount of GCP credits to finish up this project.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/deep-diver/vid2persona

Awesome Lists containing this project

README