https://github.com/marceloeatworld/yolo26-training

YOLO26 hand pose (21 keypoints) & face detection — trained for touchless kiosk interaction
https://github.com/marceloeatworld/yolo26-training

computer-vision directml face-detection hand-pose hand-tracking keypoint-detection onnx photobooth pose-estimation ultralytics yolo yolo26

Last synced: 12 days ago
JSON representation

YOLO26 hand pose (21 keypoints) & face detection — trained for touchless kiosk interaction

Host: GitHub
URL: https://github.com/marceloeatworld/yolo26-training
Owner: marceloeatworld
Created: 2026-04-03T17:18:04.000Z (2 months ago)
Default Branch: main
Last Pushed: 2026-04-03T20:33:02.000Z (2 months ago)
Last Synced: 2026-04-27T13:32:27.749Z (about 2 months ago)
Topics: computer-vision, directml, face-detection, hand-pose, hand-tracking, keypoint-detection, onnx, photobooth, pose-estimation, ultralytics, yolo, yolo26
Language: Shell
Size: 11.7 KB
Stars: 2
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # YOLO26 Hand Pose & Face Detection Models

Custom-trained [YOLO26](https://docs.ultralytics.com/models/yolo26/) models for real-time hand tracking and face detection. Built for touchless kiosk interaction via dwell-based cursor control.

## Models

| Model | Task | Keypoints | ONNX (FP16) | ONNX (FP32) | Output Shape |

|---|---|---|---|---|---|

| `yolo26_hand_pose` | Hand detection + pose | 21 | **20.5 MB** | 41.0 MB | `(1, 300, 69)` |

| `yolo26_face` | Face detection | — | **18.3 MB** | 36.4 MB | `(1, 300, 6)` |

Both models use YOLO26's **end-to-end NMS-free** architecture. Output is in **xyxy** format — no post-processing needed.

### Hand Pose Output Format

Each of the 300 detections contains 69 values:

```

[x1, y1, x2, y2, confidence, class_id, kp1_x, kp1_y, kp1_vis, ..., kp21_x, kp21_y, kp21_vis]

 ├─ bbox (xyxy) ─┘    │          │       └──────── 21 keypoints × 3 ────────────────────────┘

                       │          └─ always 0 (single class: hand)

                       └─ detection confidence

```

### Face Detection Output Format

Each of the 300 detections contains 6 values:

```

[x1, y1, x2, y2, confidence, class_id]

 ├─ bbox (xyxy) ─┘    │          └─ always 0 (single class: face)

                       └─ detection confidence

```

### 21 Hand Keypoints

```

    8 (index tip) ← used for cursor

    |

    7

    |      12    16    20

    6      |     |     |

    |      11    15    19

    5      |     |     |

    |  4   10    14    18

    |  |   |     |     |

    |  3   9     13    17

    |  |   |     |     |

    |  2   |     |     |

    |  |   |     |     |

    |  1   |     |     |

    └──┴───┴─────┴─────┘

              0 (wrist)

```

## Training Results

### Hand Pose

| | |

|---|---|

| **Base model** | [yolo26s-pose.pt](https://docs.ultralytics.com/models/yolo26/) (Ultralytics, pretrained on COCO-pose) |

| **Architecture** | YOLO26s-pose — 10.6M params, 25.0 GFLOPs |

| **Dataset** | [Ultralytics Hand Keypoints](https://docs.ultralytics.com/datasets/pose/hand-keypoints/) — 18,776 train / 7,847 val images |

| **Keypoints** | 21 per hand (wrist + 4 per finger), generated with Google MediaPipe |

| **GPU** | NVIDIA H200 NVL (143 GB VRAM) on [RunPod](https://runpod.io) |

| **Batch size** | 512 |

| **Epochs** | 100 |

| **Optimizer** | AdamW (lr=0.002, momentum=0.9) |

| **Training time** | **1h 45min** |

| **Training cost** | **~$6** (H200 NVL @ $3.40/hr) |

| **Framework** | Ultralytics 8.4.33, PyTorch 2.8.0, CUDA 12.8 |

**Final metrics (epoch 100/100):**

| Metric | Score |

|---|---|

| **Pose mAP50** | **0.942** |

| **Pose mAP50-95** | **0.843** |

| Box mAP50 | 0.993 |

| Box mAP50-95 | 0.912 |

### Face Detection

| | |

|---|---|

| **Base model** | [yolo26s.pt](https://docs.ultralytics.com/models/yolo26/) (Ultralytics, pretrained on COCO) |

| **Architecture** | YOLO26s — 9.6M params, 20.5 GFLOPs |

| **Dataset** | [WiderFace](http://shuoyang1213.me/WIDERFACE/) — 12,876 train / 3,222 val images (downloaded via [HuggingFace CUHK-CSE mirror](https://huggingface.co/datasets/CUHK-CSE/wider_face)) |

| **GPU** | NVIDIA H200 NVL (143 GB VRAM) on [RunPod](https://runpod.io) |

| **Batch size** | 64 |

| **Epochs** | 50 |

| **Optimizer** | MuSGD (lr=0.01, momentum=0.9) |

| **Training time** | **54min** |

| **Training cost** | **~$3** (H200 NVL @ $3.40/hr) |

| **Framework** | Ultralytics 8.4.33, PyTorch 2.8.0, CUDA 12.8 |

**Final metrics (epoch 50/50):**

| Metric | Score |

|---|---|

| **Box mAP50** | **0.744** |

| **Box mAP50-95** | **0.413** |

| Precision | 0.861 |

| Recall | 0.656 |

> Note: WiderFace scores appear low because the dataset includes extremely small faces (crowds, distant people). For kiosk use (single person close to camera), detection is highly reliable.

## Total Training Cost

| Model | GPU | Time | Cost |

|---|---|---|---|

| Hand pose (100 epochs) | H200 NVL (143GB) | 1h 45min | ~$6 |

| Face detection (50 epochs) | H200 NVL (143GB) | 54min | ~$3 |

| **Total** | | **2h 39min** | **~$9** |

## Quick Start

### Download Pre-trained Models

```bash

# Clone with Git LFS

git clone https://github.com/YOUR_USERNAME/yolo26-training.git

cd yolo26-training

# Models are in models/ (tracked with Git LFS)

ls -lh models/

```

### Export .pt to ONNX

```bash

pip install ultralytics onnx onnxslim

# Exports both FP32 and FP16 versions

bash scripts/export_onnx.sh checkpoints/yolo26_hand_pose.pt

bash scripts/export_onnx.sh checkpoints/yolo26_face.pt

```

### Inference (Python)

```python

from ultralytics import YOLO

# Hand pose

model = YOLO("checkpoints/yolo26_hand_pose.pt")

results = model.predict("image.jpg")

for r in results:

    keypoints = r.keypoints.xy   # [N, 21, 2] — pixel coordinates

    boxes = r.boxes.xyxy         # [N, 4] — bounding boxes

    confs = r.boxes.conf         # [N] — confidence scores

# Face detection

model = YOLO("checkpoints/yolo26_face.pt")

results = model.predict("photo.jpg")

has_face = len(results[0].boxes) > 0

```

### Inference (C# / ONNX Runtime + DirectML)

```csharp

// Load model

var session = new InferenceSession("yolo26_hand_pose.onnx", options);

// Run inference → output shape [1, 300, 69]

using var results = session.Run(inputs);

var tensor = results.First().AsTensor();

// Parse: [x1, y1, x2, y2, conf, class, kp1_x, kp1_y, kp1_vis, ...]

var conf = tensor[0, i, 4];        // confidence

var indexTipX = tensor[0, i, 30];   // keypoint 8 (index tip) x

var indexTipY = tensor[0, i, 31];   // keypoint 8 (index tip) y

```

## Train from Scratch

### On RunPod (or any cloud GPU)

```bash

pip install ultralytics onnxruntime-gpu onnx onnxslim

# Hand pose (~1h45 on H200, ~3h on A40)

bash scripts/train_hand_pose.sh

# Face detection (~54min on H200, ~2h on A40)

bash scripts/train_face_detect.sh

```

### With Docker

```bash

docker build -t yolo26-training -f docker/Dockerfile .

docker run --gpus all -v $(pwd)/models:/workspace/output yolo26-training

```

### Recommended GPUs

| GPU | VRAM | Hand Pose | Face Detect | Total Cost |

|---|---|---|---|---|

| **H200 NVL** | 143 GB | 1h 45min | 54min | **~$9** |

| H100 SXM | 80 GB | ~2h 30min | ~1h 15min | ~$10 |

| A100 | 80 GB | ~3h | ~1h 30min | ~$6 |

| A40 | 48 GB | ~4h | ~2h | ~$3 |

### Training Tips

- **Hand pose**: batch=512 works on 48GB+ VRAM. Lower to 128 on 24GB GPUs.

- **Face detect**: batch=64 recommended. WiderFace has 100+ faces/image — higher batch causes OOM.

- **Early stopping**: Both scripts use `patience` to stop early if metrics plateau.

- **WiderFace download**: Script auto-downloads from [HuggingFace CUHK-CSE mirror](https://huggingface.co/datasets/CUHK-CSE/wider_face) (Google Drive links are unreliable).

## Project Structure

```

yolo26-training/

├── README.md

├── .gitignore

├── .gitattributes             # Git LFS tracking for .onnx and .pt files

├── models/                    # Exported ONNX models

│   ├── yolo26_hand_pose_fp16.onnx   # FP16 (20.5 MB) — recommended

│   ├── yolo26_hand_pose_fp32.onnx   # FP32 (41.0 MB)

│   ├── yolo26_face_fp16.onnx        # FP16 (18.3 MB) — recommended

│   └── yolo26_face_fp32.onnx        # FP32 (36.4 MB)

├── checkpoints/               # PyTorch checkpoints

│   ├── yolo26_hand_pose.pt          # Hand pose (19.3 MB stripped)

│   └── yolo26_face.pt               # Face detect (19.3 MB stripped)

├── scripts/

│   ├── train_hand_pose.sh     # Hand pose training (auto-downloads dataset)

│   ├── train_face_detect.sh   # Face detection training (auto-downloads WiderFace)

│   └── export_onnx.sh         # PT → ONNX export (FP32 + FP16)

└── docker/

    └── Dockerfile             # GPU training container

```

## Technical Details

### Why YOLO26?

| Feature | Benefit |

|---|---|

| **NMS-free (end-to-end)** | No post-processing, consistent latency |

| **RLE keypoints** | More accurate keypoint localization |

| **DFL removal** | Cleaner ONNX, better DirectML/TensorRT compatibility |

| **43% faster on CPU** | Better fallback on weak GPUs |

| **Non-human keypoint support** | Better for hand keypoints (no human body bias) |

### ONNX Compatibility

| | |

|---|---|

| **Opset** | 17 |

| **FP16** | Supported natively by DirectML, TensorRT, CoreML |

| **Input** | `(1, 3, 640, 640)` NCHW, RGB, normalized [0, 1] |

| **Tested on** | ONNX Runtime 1.24 + DirectML (Windows), CPU (Linux) |

### FP16 vs FP32

| | FP16 | FP32 |

|---|---|---|

| **Size** | ~50% smaller | Full size |

| **Speed** | Faster on GPU (native FP16 compute) | Standard |

| **Accuracy** | <0.1% difference | Baseline |

| **Recommended** | Deployment | Debugging / fine-tuning |

## Datasets

| Dataset | Images | Annotations | Source |

|---|---|---|---|

| **Hand Keypoints** | 26,768 | 21 keypoints/hand (MediaPipe) | [Ultralytics](https://docs.ultralytics.com/datasets/pose/hand-keypoints/) |

| **WiderFace** | 32,203 | 393,703 face bboxes | [CUHK](http://shuoyang1213.me/WIDERFACE/) via [HuggingFace](https://huggingface.co/datasets/CUHK-CSE/wider_face) |

## License

- **Training code**: MIT

- **Base models**: [Ultralytics YOLO26](https://docs.ultralytics.com/models/yolo26/) — AGPL-3.0 (or [Enterprise License](https://www.ultralytics.com/license))

- **Hand Keypoints dataset**: See [Ultralytics docs](https://docs.ultralytics.com/datasets/pose/hand-keypoints/)

- **WiderFace dataset**: See [WiderFace terms](http://shuoyang1213.me/WIDERFACE/)

---

Trained on April 3, 2026 using [Ultralytics](https://ultralytics.com) 8.4.33 + PyTorch 2.8.0 + CUDA 12.8 on [RunPod](https://runpod.io).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/marceloeatworld/yolo26-training

Awesome Lists containing this project

README