An open API service indexing awesome lists of open source software.

https://github.com/marceloeatworld/yolo26-training

YOLO26 hand pose (21 keypoints) & face detection — trained for touchless kiosk interaction
https://github.com/marceloeatworld/yolo26-training

computer-vision directml face-detection hand-pose hand-tracking keypoint-detection onnx photobooth pose-estimation ultralytics yolo yolo26

Last synced: 12 days ago
JSON representation

YOLO26 hand pose (21 keypoints) & face detection — trained for touchless kiosk interaction

Awesome Lists containing this project

README

          

# YOLO26 Hand Pose & Face Detection Models

Custom-trained [YOLO26](https://docs.ultralytics.com/models/yolo26/) models for real-time hand tracking and face detection. Built for touchless kiosk interaction via dwell-based cursor control.

## Models

| Model | Task | Keypoints | ONNX (FP16) | ONNX (FP32) | Output Shape |
|---|---|---|---|---|---|
| `yolo26_hand_pose` | Hand detection + pose | 21 | **20.5 MB** | 41.0 MB | `(1, 300, 69)` |
| `yolo26_face` | Face detection | — | **18.3 MB** | 36.4 MB | `(1, 300, 6)` |

Both models use YOLO26's **end-to-end NMS-free** architecture. Output is in **xyxy** format — no post-processing needed.

### Hand Pose Output Format

Each of the 300 detections contains 69 values:

```
[x1, y1, x2, y2, confidence, class_id, kp1_x, kp1_y, kp1_vis, ..., kp21_x, kp21_y, kp21_vis]
├─ bbox (xyxy) ─┘ │ │ └──────── 21 keypoints × 3 ────────────────────────┘
│ └─ always 0 (single class: hand)
└─ detection confidence
```

### Face Detection Output Format

Each of the 300 detections contains 6 values:

```
[x1, y1, x2, y2, confidence, class_id]
├─ bbox (xyxy) ─┘ │ └─ always 0 (single class: face)
└─ detection confidence
```

### 21 Hand Keypoints

```
8 (index tip) ← used for cursor
|
7
| 12 16 20
6 | | |
| 11 15 19
5 | | |
| 4 10 14 18
| | | | |
| 3 9 13 17
| | | | |
| 2 | | |
| | | | |
| 1 | | |
└──┴───┴─────┴─────┘
0 (wrist)
```

## Training Results

### Hand Pose

| | |
|---|---|
| **Base model** | [yolo26s-pose.pt](https://docs.ultralytics.com/models/yolo26/) (Ultralytics, pretrained on COCO-pose) |
| **Architecture** | YOLO26s-pose — 10.6M params, 25.0 GFLOPs |
| **Dataset** | [Ultralytics Hand Keypoints](https://docs.ultralytics.com/datasets/pose/hand-keypoints/) — 18,776 train / 7,847 val images |
| **Keypoints** | 21 per hand (wrist + 4 per finger), generated with Google MediaPipe |
| **GPU** | NVIDIA H200 NVL (143 GB VRAM) on [RunPod](https://runpod.io) |
| **Batch size** | 512 |
| **Epochs** | 100 |
| **Optimizer** | AdamW (lr=0.002, momentum=0.9) |
| **Training time** | **1h 45min** |
| **Training cost** | **~$6** (H200 NVL @ $3.40/hr) |
| **Framework** | Ultralytics 8.4.33, PyTorch 2.8.0, CUDA 12.8 |

**Final metrics (epoch 100/100):**

| Metric | Score |
|---|---|
| **Pose mAP50** | **0.942** |
| **Pose mAP50-95** | **0.843** |
| Box mAP50 | 0.993 |
| Box mAP50-95 | 0.912 |

### Face Detection

| | |
|---|---|
| **Base model** | [yolo26s.pt](https://docs.ultralytics.com/models/yolo26/) (Ultralytics, pretrained on COCO) |
| **Architecture** | YOLO26s — 9.6M params, 20.5 GFLOPs |
| **Dataset** | [WiderFace](http://shuoyang1213.me/WIDERFACE/) — 12,876 train / 3,222 val images (downloaded via [HuggingFace CUHK-CSE mirror](https://huggingface.co/datasets/CUHK-CSE/wider_face)) |
| **GPU** | NVIDIA H200 NVL (143 GB VRAM) on [RunPod](https://runpod.io) |
| **Batch size** | 64 |
| **Epochs** | 50 |
| **Optimizer** | MuSGD (lr=0.01, momentum=0.9) |
| **Training time** | **54min** |
| **Training cost** | **~$3** (H200 NVL @ $3.40/hr) |
| **Framework** | Ultralytics 8.4.33, PyTorch 2.8.0, CUDA 12.8 |

**Final metrics (epoch 50/50):**

| Metric | Score |
|---|---|
| **Box mAP50** | **0.744** |
| **Box mAP50-95** | **0.413** |
| Precision | 0.861 |
| Recall | 0.656 |

> Note: WiderFace scores appear low because the dataset includes extremely small faces (crowds, distant people). For kiosk use (single person close to camera), detection is highly reliable.

## Total Training Cost

| Model | GPU | Time | Cost |
|---|---|---|---|
| Hand pose (100 epochs) | H200 NVL (143GB) | 1h 45min | ~$6 |
| Face detection (50 epochs) | H200 NVL (143GB) | 54min | ~$3 |
| **Total** | | **2h 39min** | **~$9** |

## Quick Start

### Download Pre-trained Models

```bash
# Clone with Git LFS
git clone https://github.com/YOUR_USERNAME/yolo26-training.git
cd yolo26-training

# Models are in models/ (tracked with Git LFS)
ls -lh models/
```

### Export .pt to ONNX

```bash
pip install ultralytics onnx onnxslim

# Exports both FP32 and FP16 versions
bash scripts/export_onnx.sh checkpoints/yolo26_hand_pose.pt
bash scripts/export_onnx.sh checkpoints/yolo26_face.pt
```

### Inference (Python)

```python
from ultralytics import YOLO

# Hand pose
model = YOLO("checkpoints/yolo26_hand_pose.pt")
results = model.predict("image.jpg")

for r in results:
keypoints = r.keypoints.xy # [N, 21, 2] — pixel coordinates
boxes = r.boxes.xyxy # [N, 4] — bounding boxes
confs = r.boxes.conf # [N] — confidence scores

# Face detection
model = YOLO("checkpoints/yolo26_face.pt")
results = model.predict("photo.jpg")
has_face = len(results[0].boxes) > 0
```

### Inference (C# / ONNX Runtime + DirectML)

```csharp
// Load model
var session = new InferenceSession("yolo26_hand_pose.onnx", options);

// Run inference → output shape [1, 300, 69]
using var results = session.Run(inputs);
var tensor = results.First().AsTensor();

// Parse: [x1, y1, x2, y2, conf, class, kp1_x, kp1_y, kp1_vis, ...]
var conf = tensor[0, i, 4]; // confidence
var indexTipX = tensor[0, i, 30]; // keypoint 8 (index tip) x
var indexTipY = tensor[0, i, 31]; // keypoint 8 (index tip) y
```

## Train from Scratch

### On RunPod (or any cloud GPU)

```bash
pip install ultralytics onnxruntime-gpu onnx onnxslim

# Hand pose (~1h45 on H200, ~3h on A40)
bash scripts/train_hand_pose.sh

# Face detection (~54min on H200, ~2h on A40)
bash scripts/train_face_detect.sh
```

### With Docker

```bash
docker build -t yolo26-training -f docker/Dockerfile .
docker run --gpus all -v $(pwd)/models:/workspace/output yolo26-training
```

### Recommended GPUs

| GPU | VRAM | Hand Pose | Face Detect | Total Cost |
|---|---|---|---|---|
| **H200 NVL** | 143 GB | 1h 45min | 54min | **~$9** |
| H100 SXM | 80 GB | ~2h 30min | ~1h 15min | ~$10 |
| A100 | 80 GB | ~3h | ~1h 30min | ~$6 |
| A40 | 48 GB | ~4h | ~2h | ~$3 |

### Training Tips

- **Hand pose**: batch=512 works on 48GB+ VRAM. Lower to 128 on 24GB GPUs.
- **Face detect**: batch=64 recommended. WiderFace has 100+ faces/image — higher batch causes OOM.
- **Early stopping**: Both scripts use `patience` to stop early if metrics plateau.
- **WiderFace download**: Script auto-downloads from [HuggingFace CUHK-CSE mirror](https://huggingface.co/datasets/CUHK-CSE/wider_face) (Google Drive links are unreliable).

## Project Structure

```
yolo26-training/
├── README.md
├── .gitignore
├── .gitattributes # Git LFS tracking for .onnx and .pt files
├── models/ # Exported ONNX models
│ ├── yolo26_hand_pose_fp16.onnx # FP16 (20.5 MB) — recommended
│ ├── yolo26_hand_pose_fp32.onnx # FP32 (41.0 MB)
│ ├── yolo26_face_fp16.onnx # FP16 (18.3 MB) — recommended
│ └── yolo26_face_fp32.onnx # FP32 (36.4 MB)
├── checkpoints/ # PyTorch checkpoints
│ ├── yolo26_hand_pose.pt # Hand pose (19.3 MB stripped)
│ └── yolo26_face.pt # Face detect (19.3 MB stripped)
├── scripts/
│ ├── train_hand_pose.sh # Hand pose training (auto-downloads dataset)
│ ├── train_face_detect.sh # Face detection training (auto-downloads WiderFace)
│ └── export_onnx.sh # PT → ONNX export (FP32 + FP16)
└── docker/
└── Dockerfile # GPU training container
```

## Technical Details

### Why YOLO26?

| Feature | Benefit |
|---|---|
| **NMS-free (end-to-end)** | No post-processing, consistent latency |
| **RLE keypoints** | More accurate keypoint localization |
| **DFL removal** | Cleaner ONNX, better DirectML/TensorRT compatibility |
| **43% faster on CPU** | Better fallback on weak GPUs |
| **Non-human keypoint support** | Better for hand keypoints (no human body bias) |

### ONNX Compatibility

| | |
|---|---|
| **Opset** | 17 |
| **FP16** | Supported natively by DirectML, TensorRT, CoreML |
| **Input** | `(1, 3, 640, 640)` NCHW, RGB, normalized [0, 1] |
| **Tested on** | ONNX Runtime 1.24 + DirectML (Windows), CPU (Linux) |

### FP16 vs FP32

| | FP16 | FP32 |
|---|---|---|
| **Size** | ~50% smaller | Full size |
| **Speed** | Faster on GPU (native FP16 compute) | Standard |
| **Accuracy** | <0.1% difference | Baseline |
| **Recommended** | Deployment | Debugging / fine-tuning |

## Datasets

| Dataset | Images | Annotations | Source |
|---|---|---|---|
| **Hand Keypoints** | 26,768 | 21 keypoints/hand (MediaPipe) | [Ultralytics](https://docs.ultralytics.com/datasets/pose/hand-keypoints/) |
| **WiderFace** | 32,203 | 393,703 face bboxes | [CUHK](http://shuoyang1213.me/WIDERFACE/) via [HuggingFace](https://huggingface.co/datasets/CUHK-CSE/wider_face) |

## License

- **Training code**: MIT
- **Base models**: [Ultralytics YOLO26](https://docs.ultralytics.com/models/yolo26/) — AGPL-3.0 (or [Enterprise License](https://www.ultralytics.com/license))
- **Hand Keypoints dataset**: See [Ultralytics docs](https://docs.ultralytics.com/datasets/pose/hand-keypoints/)
- **WiderFace dataset**: See [WiderFace terms](http://shuoyang1213.me/WIDERFACE/)

---

Trained on April 3, 2026 using [Ultralytics](https://ultralytics.com) 8.4.33 + PyTorch 2.8.0 + CUDA 12.8 on [RunPod](https://runpod.io).