https://github.com/nasim-raj-laskar/sagemaker-lane-segmentation

U-Net lane segmentation pipeline with SageMaker training, MLflow tracking, and registry-gated deployment.
https://github.com/nasim-raj-laskar/sagemaker-lane-segmentation

amazon-sagemaker

Last synced: 5 days ago
JSON representation

U-Net lane segmentation pipeline with SageMaker training, MLflow tracking, and registry-gated deployment.

Host: GitHub
URL: https://github.com/nasim-raj-laskar/sagemaker-lane-segmentation
Owner: nasim-raj-laskar
Created: 2026-03-10T19:02:13.000Z (3 months ago)
Default Branch: main
Last Pushed: 2026-04-30T18:25:56.000Z (about 2 months ago)
Last Synced: 2026-04-30T20:13:39.441Z (about 2 months ago)
Topics: amazon-sagemaker
Language: Python
Homepage:
Size: 23.9 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

Lane Segmentation MLOps Pipeline on Amazon SageMaker

Pixel-level binary semantic segmentation of lane boundaries using a fully-convolutional U-Net encoder-decoder trained on 289 annotated road images. The pipeline integrates SageMaker Training Jobs, SageMaker Model Registry with threshold-gated approval, MLflow experiment tracking, and a Streamlit inference frontend backed by TFSMLayer-wrapped SavedModel artifacts.

Architecture

Model

Symmetric encoder-decoder (U-Net) with lateral skip connections between mirrored resolution stages. Skip connections concatenate encoder feature maps directly into the decoder path, preserving high-frequency spatial detail lost during max-pooling downsampling.

| Component | Specification |
|---|---|
| Input tensor | `(N, 256, 832, 3)` — float32, normalized to `[0, 1]` |
| Encoder depth | 4 stages — filter progression `[64, 128, 256, 512]` |
| Bottleneck | 1024 filters, no spatial downsampling |
| Decoder depth | 4 stages — filter progression `[512, 256, 128, 64]` |
| Upsampling | Bilinear interpolation (`unpool='bilinear'`) |
| Output | `(N, 256, 832, 1)` — sigmoid activation, binary mask |
| Loss | Sørensen–Dice coefficient: `L = 1 - (2·|X∩Y| + ε) / (|X| + |Y| + ε)` |
| Optimizer | Adam, `lr=1e-4`, default β₁=0.9, β₂=0.999 |
| Metrics | Binary accuracy, Mean IoU (`num_classes=2`) |

Dice loss is preferred over binary cross-entropy here due to severe foreground/background class imbalance — lane pixels constitute a small fraction of total image area, causing BCE to converge to a degenerate all-background solution.

Infrastructure

```yaml
Compute:
instance_type: ml.g4dn.xlarge # 4 vCPU, 16 GiB RAM, 1x NVIDIA T4 (16 GiB VRAM)
framework_version: "2.11.0" # TF 2.11 — last version with Keras 2 API
container: AWS Deep Learning Container (763104351884.dkr.ecr..amazonaws.com)

Storage:
training_data: s3:///raw-data/ # 289 RGB images + 289 binary masks
model_artifacts: s3:///model-artifacts/ # versioned tar.gz SavedModel archives
experiment_logs: SageMaker MLflow Tracking Server

Orchestration:
training: SageMaker Training Jobs (managed spot optional)
registry: SageMaker Model Registry (ModelPackageGroup: lane-segmentation-models)
tracking: SageMaker MLflow Apps (OIDC-authenticated tracking server)
```

Repository Structure

```
lane-segmentation-pipeline/
├── src/
│ ├── train.py # Training loop, checkpointing, S3 artifact upload, registry registration
│ ├── model.py # U-Net graph construction via keras-unet-collection
│ ├── data_loader.py # tf.data pipeline: decode → resize → normalize → augment → batch
│ ├── mlflow_config.py # MLflow client init, run context manager, param/metric logging
│ ├── model_registry.py # SageMaker boto3 calls: create_model_package, list_model_packages
│ └── requirements.txt
├── config/
│ ├── model.yaml # Hyperparameters, data config, approval threshold
│ └── train.yaml # SageMaker instance config, S3 paths, job name prefix
├── dataset/
│ ├── image/ # 289 × RGB road frames (variable resolution, resized to 256×832)
│ └── mask/ # 289 × binary lane masks (uint8, values ∈ {0, 255})
├── assets/
│ ├── ui.png
│ └── output.mp4
├── models/ # Local SavedModel cache (populated by app.py on first load)
├── app.py # Streamlit frontend: inference, registry status, job launcher
├── main.py # SageMaker Estimator configuration and .fit() invocation
├── model_registry_utils.py # CLI wrapper: list packages, patch approval status
└── MODEL_REGISTRY.md
```

Environment Setup

**Requirements:** AWS account with `sagemaker:*`, `s3:*`, `iam:PassRole` permissions; Python ≥ 3.9; AWS CLI v2.

```bash
git clone https://github.com/nasim-raj-laskar/lane-segmentation-pipeline.git
cd lane-segmentation-pipeline/
pip install -r src/requirements.txt
```

Create a `.env` file:

```bash
AWS_ACCOUNT_ID=
AWS_REGION=
S3_BUCKET=
SAGEMAKER_ROLE=SageMakerExecutionRole
MLFLOW_ARN=arn:aws:sagemaker:::mlflow-tracking-server/
```

```bash
# Sync raw dataset to S3 input channel
aws s3 sync dataset/ s3:///raw-data/
```

Training

Launch SageMaker Training Job

```bash
python main.py
```

This instantiates a `sagemaker.tensorflow.TensorFlow` estimator targeting `ml.g4dn.xlarge`, injects `config/model.yaml` hyperparameters as `--hyperparameters`, and calls `.fit()` with the S3 data channel. Training artifacts are written to `/opt/ml/model/` inside the container and automatically uploaded to S3 on job completion.

Hyperparameter Reference (config/model.yaml)

```yaml
epochs: 15
batch_size: 4 # constrained by T4 VRAM at 256×832 resolution
learning_rate: 0.0001
accuracy_threshold: 0.85 # minimum val_binary_accuracy for auto-approval
img_height: 256
img_width: 832
normalization_factor: 255.0
mask_threshold: 255 # binarization cutoff for mask preprocessing
test_size: 0.2
random_state: 42
s3_bucket:
s3_model_prefix: model-artifacts/lane_segmentation_model
timestamp_format: '%Y%m%d_%H%M%S'
```

Infrastructure Configuration (config/train.yaml)

```yaml
sagemaker:
instance_type: ml.g4dn.xlarge
instance_count: 1
framework_version: "2.11.0"
py_version: py39

s3:
bucket:
data_path: raw-data
model_artifacts_path: model-artifacts
code_location: code

training:
job_name_prefix: lane-segmentation-training
```

Model Registry

Approval Gate Logic

Post-training, `src/model_registry.py` calls `sagemaker:CreateModelPackage`. Approval status is determined by comparing `final_val_accuracy` against `accuracy_threshold`:

```python
def register_model(self, model_s3_uri, metrics, accuracy_threshold=0.8):
val_accuracy = metrics.get('final_val_accuracy', 0)
approval_status = "Approved" if val_accuracy >= accuracy_threshold else "PendingManualApproval"

self.sagemaker.create_model_package({
'ModelPackageGroupName': 'lane-segmentation-models',
'ModelApprovalStatus': approval_status,
'InferenceSpecification': {
'Containers': [{
'Image': f'763104351884.dkr.ecr.{region}.amazonaws.com/tensorflow-inference:2.12-cpu',
'ModelDataUrl': model_s3_uri
}]
}
})
```

For stricter multi-criteria gating (val_loss + Mean IoU):

```python
approval_status = "Approved" if (
val_accuracy >= accuracy_threshold and
val_loss < 0.3 and
mean_iou > 0.7
) else "PendingManualApproval"
```

CLI Operations

```bash
# Enumerate all model package versions with approval status and metrics
python model_registry_utils.py

# Patch approval status on a specific model package ARN
python model_registry_utils.py approve \
arn:aws:sagemaker:::model-package/lane-segmentation-models/1
```

S3 Artifact Layout

```
s3:///model-artifacts/lane_segmentation_model/
├── v1/
│ ├── 20240430_143022.tar.gz # TensorFlow SavedModel (saved_model.pb + variables/)
│ └── metrics.json
├── v2/
│ ├── 20240430_150145.tar.gz
│ └── metrics.json
└── v3/
├── 20240430_152301.tar.gz
└── metrics.json
```

`metrics.json` schema:

```json
{
"final_train_loss": 0.6542,
"final_val_loss": 0.4382,
"final_train_accuracy": 0.3605,
"final_val_accuracy": 0.8366,
"epochs": 15,
"batch_size": 4,
"learning_rate": 0.0001,
"timestamp": "20240430_164055",
"version": 3
}
```

Inference Application

Model Loading

`app.py` resolves the latest `Approved` model package ARN via `list_model_packages`, downloads the SavedModel artifact from S3, and wraps it in a `TFSMLayer` to maintain Keras 3.x functional API compatibility (Keras 3 dropped native `tf.saved_model.load` integration):

```python
@st.cache_resource
def load_model():
registry = ModelRegistry()
model_s3_uri, model_package_arn = registry.get_latest_approved_model()

model_layer = tf.keras.layers.TFSMLayer(
'models/approved/1',
call_endpoint='serving_default'
)
inputs = tf.keras.Input(shape=(256, 832, 3))
outputs = model_layer(inputs)
return tf.keras.Model(inputs=inputs, outputs=outputs), model_package_arn
```

Launch

```bash
streamlit run app.py --server.port 8501 --server.address 0.0.0.0
```

Streamlit inference UI — image upload, binary mask overlay, registry status panel, and training job launcher

Streamlit frontend: image upload → TFSMLayer inference → binary mask overlay. Registry status and training job launcher rendered in the sidebar.

MLOps Lifecycle

```mermaid
graph TD
A[SageMaker Training Job] --> B[Epoch Metrics Logged to MLflow]
B --> C[val_binary_accuracy evaluated against threshold]
C -->|>= threshold| D[ModelApprovalStatus: Approved]
C -->|< threshold| E[ModelApprovalStatus: PendingManualApproval]
D --> F[app.py resolves latest Approved ARN]
E --> G[Manual review via model_registry_utils.py]
G -->|approve| F
G -->|reject| H[Package remains in PendingManualApproval]
F --> I[TFSMLayer inference serving]
```

Experiment Tracking

MLflow run context is opened in `src/train.py` before the Keras `.fit()` call. Hyperparameters are logged once; per-epoch metrics are logged with `step=epoch` for time-series visualization in the MLflow UI:

```python
with mlflow.start_run(run_name=f"lane_seg_{timestamp}"):
mlflow.log_params({
'epochs': config['epochs'],
'batch_size': config['batch_size'],
'learning_rate': config['learning_rate']
})
for epoch in range(epochs):
mlflow.log_metrics({
'train_loss': history.history['loss'][epoch],
'val_loss': history.history['val_loss'][epoch],
'train_accuracy': history.history['binary_accuracy'][epoch],
'val_accuracy': history.history['val_binary_accuracy'][epoch]
}, step=epoch)
```

Tracked metrics: Dice loss, binary cross-entropy, binary accuracy, Mean IoU, epoch wall-clock time, GPU utilization, peak VRAM allocation, total parameter count, SavedModel size on disk, and batch inference latency (p50/p95).

Interactive Demo (Hugging Face Spaces)

Lightweight interactive dashboard deployed on Hugging Face Spaces for real-time inference and visualization. Includes adjustable thresholding, overlay tuning, and performance metrics.

👉 Try Live Demo

References

- [SageMaker `CreateModelPackage` API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModelPackage.html)
- [MLflow Tracking Server](https://mlflow.org/docs/latest/tracking.html#mlflow-tracking-servers)
- [TensorFlow SavedModel format](https://www.tensorflow.org/guide/saved_model)
- [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597)
- [V-Net / Dice Loss](https://arxiv.org/abs/1606.04797)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome