https://github.com/marceloeatworld/yolo26-training
YOLO26 hand pose (21 keypoints) & face detection — trained for touchless kiosk interaction
https://github.com/marceloeatworld/yolo26-training
computer-vision directml face-detection hand-pose hand-tracking keypoint-detection onnx photobooth pose-estimation ultralytics yolo yolo26
Last synced: 12 days ago
JSON representation
YOLO26 hand pose (21 keypoints) & face detection — trained for touchless kiosk interaction
- Host: GitHub
- URL: https://github.com/marceloeatworld/yolo26-training
- Owner: marceloeatworld
- Created: 2026-04-03T17:18:04.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-04-03T20:33:02.000Z (2 months ago)
- Last Synced: 2026-04-27T13:32:27.749Z (about 2 months ago)
- Topics: computer-vision, directml, face-detection, hand-pose, hand-tracking, keypoint-detection, onnx, photobooth, pose-estimation, ultralytics, yolo, yolo26
- Language: Shell
- Size: 11.7 KB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# YOLO26 Hand Pose & Face Detection Models
Custom-trained [YOLO26](https://docs.ultralytics.com/models/yolo26/) models for real-time hand tracking and face detection. Built for touchless kiosk interaction via dwell-based cursor control.
## Models
| Model | Task | Keypoints | ONNX (FP16) | ONNX (FP32) | Output Shape |
|---|---|---|---|---|---|
| `yolo26_hand_pose` | Hand detection + pose | 21 | **20.5 MB** | 41.0 MB | `(1, 300, 69)` |
| `yolo26_face` | Face detection | — | **18.3 MB** | 36.4 MB | `(1, 300, 6)` |
Both models use YOLO26's **end-to-end NMS-free** architecture. Output is in **xyxy** format — no post-processing needed.
### Hand Pose Output Format
Each of the 300 detections contains 69 values:
```
[x1, y1, x2, y2, confidence, class_id, kp1_x, kp1_y, kp1_vis, ..., kp21_x, kp21_y, kp21_vis]
├─ bbox (xyxy) ─┘ │ │ └──────── 21 keypoints × 3 ────────────────────────┘
│ └─ always 0 (single class: hand)
└─ detection confidence
```
### Face Detection Output Format
Each of the 300 detections contains 6 values:
```
[x1, y1, x2, y2, confidence, class_id]
├─ bbox (xyxy) ─┘ │ └─ always 0 (single class: face)
└─ detection confidence
```
### 21 Hand Keypoints
```
8 (index tip) ← used for cursor
|
7
| 12 16 20
6 | | |
| 11 15 19
5 | | |
| 4 10 14 18
| | | | |
| 3 9 13 17
| | | | |
| 2 | | |
| | | | |
| 1 | | |
└──┴───┴─────┴─────┘
0 (wrist)
```
## Training Results
### Hand Pose
| | |
|---|---|
| **Base model** | [yolo26s-pose.pt](https://docs.ultralytics.com/models/yolo26/) (Ultralytics, pretrained on COCO-pose) |
| **Architecture** | YOLO26s-pose — 10.6M params, 25.0 GFLOPs |
| **Dataset** | [Ultralytics Hand Keypoints](https://docs.ultralytics.com/datasets/pose/hand-keypoints/) — 18,776 train / 7,847 val images |
| **Keypoints** | 21 per hand (wrist + 4 per finger), generated with Google MediaPipe |
| **GPU** | NVIDIA H200 NVL (143 GB VRAM) on [RunPod](https://runpod.io) |
| **Batch size** | 512 |
| **Epochs** | 100 |
| **Optimizer** | AdamW (lr=0.002, momentum=0.9) |
| **Training time** | **1h 45min** |
| **Training cost** | **~$6** (H200 NVL @ $3.40/hr) |
| **Framework** | Ultralytics 8.4.33, PyTorch 2.8.0, CUDA 12.8 |
**Final metrics (epoch 100/100):**
| Metric | Score |
|---|---|
| **Pose mAP50** | **0.942** |
| **Pose mAP50-95** | **0.843** |
| Box mAP50 | 0.993 |
| Box mAP50-95 | 0.912 |
### Face Detection
| | |
|---|---|
| **Base model** | [yolo26s.pt](https://docs.ultralytics.com/models/yolo26/) (Ultralytics, pretrained on COCO) |
| **Architecture** | YOLO26s — 9.6M params, 20.5 GFLOPs |
| **Dataset** | [WiderFace](http://shuoyang1213.me/WIDERFACE/) — 12,876 train / 3,222 val images (downloaded via [HuggingFace CUHK-CSE mirror](https://huggingface.co/datasets/CUHK-CSE/wider_face)) |
| **GPU** | NVIDIA H200 NVL (143 GB VRAM) on [RunPod](https://runpod.io) |
| **Batch size** | 64 |
| **Epochs** | 50 |
| **Optimizer** | MuSGD (lr=0.01, momentum=0.9) |
| **Training time** | **54min** |
| **Training cost** | **~$3** (H200 NVL @ $3.40/hr) |
| **Framework** | Ultralytics 8.4.33, PyTorch 2.8.0, CUDA 12.8 |
**Final metrics (epoch 50/50):**
| Metric | Score |
|---|---|
| **Box mAP50** | **0.744** |
| **Box mAP50-95** | **0.413** |
| Precision | 0.861 |
| Recall | 0.656 |
> Note: WiderFace scores appear low because the dataset includes extremely small faces (crowds, distant people). For kiosk use (single person close to camera), detection is highly reliable.
## Total Training Cost
| Model | GPU | Time | Cost |
|---|---|---|---|
| Hand pose (100 epochs) | H200 NVL (143GB) | 1h 45min | ~$6 |
| Face detection (50 epochs) | H200 NVL (143GB) | 54min | ~$3 |
| **Total** | | **2h 39min** | **~$9** |
## Quick Start
### Download Pre-trained Models
```bash
# Clone with Git LFS
git clone https://github.com/YOUR_USERNAME/yolo26-training.git
cd yolo26-training
# Models are in models/ (tracked with Git LFS)
ls -lh models/
```
### Export .pt to ONNX
```bash
pip install ultralytics onnx onnxslim
# Exports both FP32 and FP16 versions
bash scripts/export_onnx.sh checkpoints/yolo26_hand_pose.pt
bash scripts/export_onnx.sh checkpoints/yolo26_face.pt
```
### Inference (Python)
```python
from ultralytics import YOLO
# Hand pose
model = YOLO("checkpoints/yolo26_hand_pose.pt")
results = model.predict("image.jpg")
for r in results:
keypoints = r.keypoints.xy # [N, 21, 2] — pixel coordinates
boxes = r.boxes.xyxy # [N, 4] — bounding boxes
confs = r.boxes.conf # [N] — confidence scores
# Face detection
model = YOLO("checkpoints/yolo26_face.pt")
results = model.predict("photo.jpg")
has_face = len(results[0].boxes) > 0
```
### Inference (C# / ONNX Runtime + DirectML)
```csharp
// Load model
var session = new InferenceSession("yolo26_hand_pose.onnx", options);
// Run inference → output shape [1, 300, 69]
using var results = session.Run(inputs);
var tensor = results.First().AsTensor();
// Parse: [x1, y1, x2, y2, conf, class, kp1_x, kp1_y, kp1_vis, ...]
var conf = tensor[0, i, 4]; // confidence
var indexTipX = tensor[0, i, 30]; // keypoint 8 (index tip) x
var indexTipY = tensor[0, i, 31]; // keypoint 8 (index tip) y
```
## Train from Scratch
### On RunPod (or any cloud GPU)
```bash
pip install ultralytics onnxruntime-gpu onnx onnxslim
# Hand pose (~1h45 on H200, ~3h on A40)
bash scripts/train_hand_pose.sh
# Face detection (~54min on H200, ~2h on A40)
bash scripts/train_face_detect.sh
```
### With Docker
```bash
docker build -t yolo26-training -f docker/Dockerfile .
docker run --gpus all -v $(pwd)/models:/workspace/output yolo26-training
```
### Recommended GPUs
| GPU | VRAM | Hand Pose | Face Detect | Total Cost |
|---|---|---|---|---|
| **H200 NVL** | 143 GB | 1h 45min | 54min | **~$9** |
| H100 SXM | 80 GB | ~2h 30min | ~1h 15min | ~$10 |
| A100 | 80 GB | ~3h | ~1h 30min | ~$6 |
| A40 | 48 GB | ~4h | ~2h | ~$3 |
### Training Tips
- **Hand pose**: batch=512 works on 48GB+ VRAM. Lower to 128 on 24GB GPUs.
- **Face detect**: batch=64 recommended. WiderFace has 100+ faces/image — higher batch causes OOM.
- **Early stopping**: Both scripts use `patience` to stop early if metrics plateau.
- **WiderFace download**: Script auto-downloads from [HuggingFace CUHK-CSE mirror](https://huggingface.co/datasets/CUHK-CSE/wider_face) (Google Drive links are unreliable).
## Project Structure
```
yolo26-training/
├── README.md
├── .gitignore
├── .gitattributes # Git LFS tracking for .onnx and .pt files
├── models/ # Exported ONNX models
│ ├── yolo26_hand_pose_fp16.onnx # FP16 (20.5 MB) — recommended
│ ├── yolo26_hand_pose_fp32.onnx # FP32 (41.0 MB)
│ ├── yolo26_face_fp16.onnx # FP16 (18.3 MB) — recommended
│ └── yolo26_face_fp32.onnx # FP32 (36.4 MB)
├── checkpoints/ # PyTorch checkpoints
│ ├── yolo26_hand_pose.pt # Hand pose (19.3 MB stripped)
│ └── yolo26_face.pt # Face detect (19.3 MB stripped)
├── scripts/
│ ├── train_hand_pose.sh # Hand pose training (auto-downloads dataset)
│ ├── train_face_detect.sh # Face detection training (auto-downloads WiderFace)
│ └── export_onnx.sh # PT → ONNX export (FP32 + FP16)
└── docker/
└── Dockerfile # GPU training container
```
## Technical Details
### Why YOLO26?
| Feature | Benefit |
|---|---|
| **NMS-free (end-to-end)** | No post-processing, consistent latency |
| **RLE keypoints** | More accurate keypoint localization |
| **DFL removal** | Cleaner ONNX, better DirectML/TensorRT compatibility |
| **43% faster on CPU** | Better fallback on weak GPUs |
| **Non-human keypoint support** | Better for hand keypoints (no human body bias) |
### ONNX Compatibility
| | |
|---|---|
| **Opset** | 17 |
| **FP16** | Supported natively by DirectML, TensorRT, CoreML |
| **Input** | `(1, 3, 640, 640)` NCHW, RGB, normalized [0, 1] |
| **Tested on** | ONNX Runtime 1.24 + DirectML (Windows), CPU (Linux) |
### FP16 vs FP32
| | FP16 | FP32 |
|---|---|---|
| **Size** | ~50% smaller | Full size |
| **Speed** | Faster on GPU (native FP16 compute) | Standard |
| **Accuracy** | <0.1% difference | Baseline |
| **Recommended** | Deployment | Debugging / fine-tuning |
## Datasets
| Dataset | Images | Annotations | Source |
|---|---|---|---|
| **Hand Keypoints** | 26,768 | 21 keypoints/hand (MediaPipe) | [Ultralytics](https://docs.ultralytics.com/datasets/pose/hand-keypoints/) |
| **WiderFace** | 32,203 | 393,703 face bboxes | [CUHK](http://shuoyang1213.me/WIDERFACE/) via [HuggingFace](https://huggingface.co/datasets/CUHK-CSE/wider_face) |
## License
- **Training code**: MIT
- **Base models**: [Ultralytics YOLO26](https://docs.ultralytics.com/models/yolo26/) — AGPL-3.0 (or [Enterprise License](https://www.ultralytics.com/license))
- **Hand Keypoints dataset**: See [Ultralytics docs](https://docs.ultralytics.com/datasets/pose/hand-keypoints/)
- **WiderFace dataset**: See [WiderFace terms](http://shuoyang1213.me/WIDERFACE/)
---
Trained on April 3, 2026 using [Ultralytics](https://ultralytics.com) 8.4.33 + PyTorch 2.8.0 + CUDA 12.8 on [RunPod](https://runpod.io).