An open API service indexing awesome lists of open source software.

https://github.com/thought2code/video-driven-skill

video driven skill
https://github.com/thought2code/video-driven-skill

Last synced: 21 days ago
JSON representation

video driven skill

Awesome Lists containing this project

README

          


English · 简体中文

Video Driven Skill


Automate from how you actually work.



Turn screen recordings into skills you can run, edit, and reuse.


Quick Start · Features · Architecture · License


Java 17
Spring Boot 4.1
React 19
Vite 8
Tailwind CSS 4
SQLite
FFmpeg
License

---

## Overview

Video Driven Skill is an open-source automation studio that transforms **screen recordings** into **runnable, editable skill packages**. Upload a video, extract key frames, annotate intent, let a multimodal AI model draft the skill — then refine, run, version, archive, and export it.

The project is designed for teams and individuals who want automation to start from **how work is actually performed**, not from a blank script editor.

> **Workflow:** Record the process → Pick the frames that matter → Annotate intent → Generate a skill → Review & run → Export & deploy

---

## Features

- **Video-to-Skill Pipeline** — Upload an operation recording and automatically convert it into a structured skill package with `SKILL.md`, `package.json`, scripts, and variables.
- **Smart Frame Extraction** — Auto-extract key frames via FFmpeg, or manually capture the moments that matter.
- **Visual Annotation** — Mark up frames with arrows, notes, and corrections to tell the AI exactly what to do.
- **Multimodal AI Generation** — Leverages any OpenAI-compatible vision model to generate browser, Android, iOS, or desktop automation code.
- **In-Browser Code Editor** — Review, edit, and refine generated code with syntax highlighting and variable management.
- **Incremental Regeneration** — Regenerate the full skill or just a selected code range, with diff review between versions.
- **Local Skill Runner** — Run skills directly with streamed logs and optional screenshots.
- **Skill Repository** — Browse, search, import, export (ZIP), and drag-to-reorder your skill collection.
- **Knowledge Base** — Attach reference images, documents, and notes to each skill for richer context.
- **Archive System** — Preserve videos, frames, and requirements for building future skills from past material.

---

## Quick Start

Install [Docker](https://docs.docker.com/get-docker/) first, then choose the path that matches your goal.

### Option 1: Run pre-built images

Use this if you just want to run the app. The install script downloads the release Compose file, creates `.env`, pulls the pre-built images, and starts the stack.

#### macOS / Linux

```bash
curl -fsSL https://raw.githubusercontent.com/thought2code/video-driven-skill/main/scripts/install.sh | bash
```

#### Windows

```powershell
irm https://raw.githubusercontent.com/thought2code/video-driven-skill/main/scripts/install.ps1 | iex
```

Default install location:

- macOS / Linux: `~/video-driven-skill`
- Windows: `%USERPROFILE%\video-driven-skill`

Open `http://localhost` after the script finishes (Docker uses standard ports 80 / 443).

To use AI generation, set your API key in the generated `.env` file:

```env
AI_API_KEY=your-key-here
AI_BASE_URL=your-base-url
AI_MODEL=your-model
```

Common install options: `--tag v1.0.0`, `--dir `, `--no-open`. Local dev with `npm run dev` uses port 3000.

### Option 2: Build from source

Use this for development, unreleased `main`, or local builds. It requires Docker and Git.

```bash
git clone https://github.com/thought2code/video-driven-skill.git
cd video-driven-skill
```

#### macOS / Linux

```bash
chmod +x scripts/run-in-docker.sh
./scripts/run-in-docker.sh
```

#### Windows

```bat
.\scripts\run-in-docker.cmd
```

On first run, `.env` is created from `.env.example`; set `AI_API_KEY` before using AI features:

```env
AI_API_KEY=your-key-here
AI_BASE_URL=your-base-url
AI_MODEL=your-model
```

For faster base-image pulls in China, add `--cn`. To skip opening the browser, add `--no-open`.

### Public HTTPS (Let's Encrypt)

The frontend runs **Caddy** as a reverse proxy. Set a public hostname in `.env` and Caddy will obtain and renew **Let's Encrypt** certificates automatically. With no domain configured, the stack serves **HTTP only** at `http://localhost`.

**Prerequisites**

1. A server with a public IP and Docker installed.
2. An **A record** for your hostname (e.g. `vds.example.com`) pointing to that IP.
3. Firewall / security group allowing **80** and **443** (TCP; optional **443/UDP** for HTTP/3).

**Configuration** (see `.env.example`):

```env
VDS_DOMAIN=vds.example.com
ACME_EMAIL=you@example.com
```

- `VDS_DOMAIN`: hostname only (no `https://` or path).
- `ACME_EMAIL`: optional, for Let's Encrypt expiry notices.

**Start**

```bash
docker compose up -d --build
```

On first start with `VDS_DOMAIN` set, allow time for ACME validation (often 30s–few minutes), then open `https://vds.example.com`. HTTP redirects to HTTPS.

Certificates persist in Docker volumes `caddy-data` and `caddy-config`.

**Troubleshooting**

- Certificate not issued: verify DNS (`dig vds.example.com`) and that ports 80/443 are reachable from the internet.
- Logs: `docker compose logs -f frontend`

---

## Typical Workflow

1. **Upload** — Upload an operation recording (e.g., a screen capture of a workflow).
2. **Extract Frames** — Auto-extract key frames or manually capture the moments that matter.
3. **Annotate** — Mark up frames with arrows, notes, and corrections.
4. **Describe Intent** — Tell the AI what you want, e.g., "Collect item names from this page and export them."
5. **Generate** — Let the multimodal model produce a complete skill package.
6. **Review & Edit** — Inspect generated code, adjust variables, and refine the output.
7. **Run** — Execute the skill locally with streamed log output.
8. **Iterate** — Regenerate the full skill or just a selected section, with diff comparison.
9. **Export & Deploy** — Package as a ZIP or deploy to your local skill directory.

---

## Architecture

```text
video-driven-skill/
├── backend/ # Spring Boot — API, video processing, AI, skill runner
├── frontend/ # React + Vite — studio UI
├── docker-compose.yml # Docker deployment (build from source)
├── docker-compose.release.yml # GHCR images (no clone)
├── docker-compose.cn.yml # Optional mirror overlay (local build)
├── ARCHITECTURE.md # Architecture (English)
├── ARCHITECTURE.zh-CN.md # Architecture (Chinese)
├── scripts/
│ ├── install.sh / install.ps1 # Install from GHCR (no clone)
│ ├── run-in-docker.cmd / .sh # Build & run from source
│ └── kill-midscene.sh # Optional cleanup helper
```

### Backend (Spring Boot / Java 17)

| Module | Responsibility |
|------------------------------|------------------------------------------------------------------|
| `controller/` | REST API & WebSocket entry points |
| `service/VideoService` | Video upload, FFmpeg frame extraction, streaming |
| `service/AIService` | Prompt construction & multimodal API calls |
| `service/SkillService` | Skill CRUD, import/export, versioning |
| `service/SkillRunnerService` | Workspace setup, dependency injection, execution, log collection |
| `service/KnowledgeService` | Per-skill reference files & manifest |
| `model/` & `repository/` | SQLite-backed domain entities |

Runtime data lives under `~/video-driven-skill/` by default (override with `VIDEO_DRIVEN_SKILL_HOME`; on Windows, the same folder name under your user profile):

- `uploads/` — uploaded videos & extracted frames
- `skills/` — generated skill source files
- `archives/` — reusable video/frame/requirement resources
- `video-driven-skill.db` — SQLite database

With **Docker Compose**, the same layout is stored at `/data` inside the backend container (Compose volume `app-data`), not under `~/video-driven-skill/`. Inspect the host path with `docker volume inspect video-driven-skill_app-data`.

### Frontend (React + Vite + Tailwind CSS)

| Component | Responsibility |
|--------------------------------------------------|---------------------------------------|
| `HomePage` | Upload, import, and recent resources |
| `PlaygroundPage` | Frame annotation & skill workspace |
| `FrameTimeline` / `FrameAnnotator` / `FrameList` | Visual evidence collection |
| `AIProcessor` | Generation control & streamed status |
| `SkillList` | Skill repository with drag-to-reorder |
| `SkillEditor` / `SkillExport` / `SkillRunner` | Review, export & execution |
| `RegeneratePanel` / `CodeComparisonView` | Iteration workflow |
| `KnowledgeBasePanel` | Extra context per skill |

### Skill Package Structure

```text
SKILL.md # Skill intent, instructions, and variable docs
package.json # Metadata
variables.json # User-editable runtime inputs
scripts/main.js # Executable entrypoint
knowledge/ # Optional reference files
```

For a deeper walkthrough, see [ARCHITECTURE.md](ARCHITECTURE.md).

---

## API Overview

| Method | Path | Purpose |
|--------|---------------------------------------|--------------------------------|
| `POST` | `/api/videos/upload` | Upload a video |
| `POST` | `/api/videos/{id}/frames/auto` | Auto-extract frames |
| `POST` | `/api/videos/{id}/frames/manual` | Manual frame capture |
| `GET` | `/api/videos/{id}/stream` | Stream uploaded video |
| `GET` | `/api/skills` | List all skills |
| `PUT` | `/api/skills/order` | Persist skill ordering |
| `POST` | `/api/skills/generate` | Generate a skill |
| `GET` | `/api/skills/{id}` | Read a skill |
| `PUT` | `/api/skills/{id}/files` | Update skill files |
| `GET` | `/api/skills/{id}/export` | Export skill as ZIP |
| `POST` | `/api/skills/{id}/regenerate` | Generate candidate revision |
| `POST` | `/api/skills/{id}/partial-regenerate` | Regenerate selected code range |
| `POST` | `/api/skills/{id}/accept` | Accept candidate revision |
| `GET` | `/api/skills/{id}/versions` | List skill versions |
| `POST` | `/api/skills/{id}/deploy` | Deploy skill locally |

---

## Security & Privacy

This repository is prepared for open-source use:

- No API keys or credentials are committed.
- Local databases, uploads, archives, generated skills, logs, and build outputs are git-ignored.
- Runtime configuration comes from environment variables or local `.env` files.
- **Do not** upload private recordings, credentials, customer data, or production screenshots to any public instance.

If you discover a security issue, please report it responsibly. See [SECURITY.md](SECURITY.md).

---

## License

This project is licensed under the **MIT License**. See [LICENSE](LICENSE) for details.

---


Built with care by the Video Driven Skill team.