https://github.com/nasim-raj-laskar/sagemaker-lane-segmentation
U-Net lane segmentation pipeline with SageMaker training, MLflow tracking, and registry-gated deployment.
https://github.com/nasim-raj-laskar/sagemaker-lane-segmentation
amazon-sagemaker
Last synced: 5 days ago
JSON representation
U-Net lane segmentation pipeline with SageMaker training, MLflow tracking, and registry-gated deployment.
- Host: GitHub
- URL: https://github.com/nasim-raj-laskar/sagemaker-lane-segmentation
- Owner: nasim-raj-laskar
- Created: 2026-03-10T19:02:13.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-04-30T18:25:56.000Z (about 2 months ago)
- Last Synced: 2026-04-30T20:13:39.441Z (about 2 months ago)
- Topics: amazon-sagemaker
- Language: Python
- Homepage:
- Size: 23.9 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Lane Segmentation MLOps Pipeline on Amazon SageMaker
Pixel-level binary semantic segmentation of lane boundaries using a fully-convolutional U-Net encoder-decoder trained on 289 annotated road images. The pipeline integrates SageMaker Training Jobs, SageMaker Model Registry with threshold-gated approval, MLflow experiment tracking, and a Streamlit inference frontend backed by TFSMLayer-wrapped SavedModel artifacts.
Architecture
Model
Symmetric encoder-decoder (U-Net) with lateral skip connections between mirrored resolution stages. Skip connections concatenate encoder feature maps directly into the decoder path, preserving high-frequency spatial detail lost during max-pooling downsampling.
| Component | Specification |
|---|---|
| Input tensor | `(N, 256, 832, 3)` — float32, normalized to `[0, 1]` |
| Encoder depth | 4 stages — filter progression `[64, 128, 256, 512]` |
| Bottleneck | 1024 filters, no spatial downsampling |
| Decoder depth | 4 stages — filter progression `[512, 256, 128, 64]` |
| Upsampling | Bilinear interpolation (`unpool='bilinear'`) |
| Output | `(N, 256, 832, 1)` — sigmoid activation, binary mask |
| Loss | Sørensen–Dice coefficient: `L = 1 - (2·|X∩Y| + ε) / (|X| + |Y| + ε)` |
| Optimizer | Adam, `lr=1e-4`, default β₁=0.9, β₂=0.999 |
| Metrics | Binary accuracy, Mean IoU (`num_classes=2`) |
Dice loss is preferred over binary cross-entropy here due to severe foreground/background class imbalance — lane pixels constitute a small fraction of total image area, causing BCE to converge to a degenerate all-background solution.
Infrastructure
```yaml
Compute:
instance_type: ml.g4dn.xlarge # 4 vCPU, 16 GiB RAM, 1x NVIDIA T4 (16 GiB VRAM)
framework_version: "2.11.0" # TF 2.11 — last version with Keras 2 API
container: AWS Deep Learning Container (763104351884.dkr.ecr..amazonaws.com)
Storage:
training_data: s3:///raw-data/ # 289 RGB images + 289 binary masks
model_artifacts: s3:///model-artifacts/ # versioned tar.gz SavedModel archives
experiment_logs: SageMaker MLflow Tracking Server
Orchestration:
training: SageMaker Training Jobs (managed spot optional)
registry: SageMaker Model Registry (ModelPackageGroup: lane-segmentation-models)
tracking: SageMaker MLflow Apps (OIDC-authenticated tracking server)
```
Repository Structure
```
lane-segmentation-pipeline/
├── src/
│ ├── train.py # Training loop, checkpointing, S3 artifact upload, registry registration
│ ├── model.py # U-Net graph construction via keras-unet-collection
│ ├── data_loader.py # tf.data pipeline: decode → resize → normalize → augment → batch
│ ├── mlflow_config.py # MLflow client init, run context manager, param/metric logging
│ ├── model_registry.py # SageMaker boto3 calls: create_model_package, list_model_packages
│ └── requirements.txt
├── config/
│ ├── model.yaml # Hyperparameters, data config, approval threshold
│ └── train.yaml # SageMaker instance config, S3 paths, job name prefix
├── dataset/
│ ├── image/ # 289 × RGB road frames (variable resolution, resized to 256×832)
│ └── mask/ # 289 × binary lane masks (uint8, values ∈ {0, 255})
├── assets/
│ ├── ui.png
│ └── output.mp4
├── models/ # Local SavedModel cache (populated by app.py on first load)
├── app.py # Streamlit frontend: inference, registry status, job launcher
├── main.py # SageMaker Estimator configuration and .fit() invocation
├── model_registry_utils.py # CLI wrapper: list packages, patch approval status
└── MODEL_REGISTRY.md
```
Environment Setup
**Requirements:** AWS account with `sagemaker:*`, `s3:*`, `iam:PassRole` permissions; Python ≥ 3.9; AWS CLI v2.
```bash
git clone https://github.com/nasim-raj-laskar/lane-segmentation-pipeline.git
cd lane-segmentation-pipeline/
pip install -r src/requirements.txt
```
Create a `.env` file:
```bash
AWS_ACCOUNT_ID=
AWS_REGION=
S3_BUCKET=
SAGEMAKER_ROLE=SageMakerExecutionRole
MLFLOW_ARN=arn:aws:sagemaker:::mlflow-tracking-server/
```
```bash
# Sync raw dataset to S3 input channel
aws s3 sync dataset/ s3:///raw-data/
```
Training
Launch SageMaker Training Job
```bash
python main.py
```
This instantiates a `sagemaker.tensorflow.TensorFlow` estimator targeting `ml.g4dn.xlarge`, injects `config/model.yaml` hyperparameters as `--hyperparameters`, and calls `.fit()` with the S3 data channel. Training artifacts are written to `/opt/ml/model/` inside the container and automatically uploaded to S3 on job completion.
Hyperparameter Reference (config/model.yaml)
```yaml
epochs: 15
batch_size: 4 # constrained by T4 VRAM at 256×832 resolution
learning_rate: 0.0001
accuracy_threshold: 0.85 # minimum val_binary_accuracy for auto-approval
img_height: 256
img_width: 832
normalization_factor: 255.0
mask_threshold: 255 # binarization cutoff for mask preprocessing
test_size: 0.2
random_state: 42
s3_bucket:
s3_model_prefix: model-artifacts/lane_segmentation_model
timestamp_format: '%Y%m%d_%H%M%S'
```
Infrastructure Configuration (config/train.yaml)
```yaml
sagemaker:
instance_type: ml.g4dn.xlarge
instance_count: 1
framework_version: "2.11.0"
py_version: py39
s3:
bucket:
data_path: raw-data
model_artifacts_path: model-artifacts
code_location: code
training:
job_name_prefix: lane-segmentation-training
```
Model Registry
Approval Gate Logic
Post-training, `src/model_registry.py` calls `sagemaker:CreateModelPackage`. Approval status is determined by comparing `final_val_accuracy` against `accuracy_threshold`:
```python
def register_model(self, model_s3_uri, metrics, accuracy_threshold=0.8):
val_accuracy = metrics.get('final_val_accuracy', 0)
approval_status = "Approved" if val_accuracy >= accuracy_threshold else "PendingManualApproval"
self.sagemaker.create_model_package({
'ModelPackageGroupName': 'lane-segmentation-models',
'ModelApprovalStatus': approval_status,
'InferenceSpecification': {
'Containers': [{
'Image': f'763104351884.dkr.ecr.{region}.amazonaws.com/tensorflow-inference:2.12-cpu',
'ModelDataUrl': model_s3_uri
}]
}
})
```
For stricter multi-criteria gating (val_loss + Mean IoU):
```python
approval_status = "Approved" if (
val_accuracy >= accuracy_threshold and
val_loss < 0.3 and
mean_iou > 0.7
) else "PendingManualApproval"
```
CLI Operations
```bash
# Enumerate all model package versions with approval status and metrics
python model_registry_utils.py
# Patch approval status on a specific model package ARN
python model_registry_utils.py approve \
arn:aws:sagemaker:::model-package/lane-segmentation-models/1
```
S3 Artifact Layout
```
s3:///model-artifacts/lane_segmentation_model/
├── v1/
│ ├── 20240430_143022.tar.gz # TensorFlow SavedModel (saved_model.pb + variables/)
│ └── metrics.json
├── v2/
│ ├── 20240430_150145.tar.gz
│ └── metrics.json
└── v3/
├── 20240430_152301.tar.gz
└── metrics.json
```
`metrics.json` schema:
```json
{
"final_train_loss": 0.6542,
"final_val_loss": 0.4382,
"final_train_accuracy": 0.3605,
"final_val_accuracy": 0.8366,
"epochs": 15,
"batch_size": 4,
"learning_rate": 0.0001,
"timestamp": "20240430_164055",
"version": 3
}
```
Inference Application
Model Loading
`app.py` resolves the latest `Approved` model package ARN via `list_model_packages`, downloads the SavedModel artifact from S3, and wraps it in a `TFSMLayer` to maintain Keras 3.x functional API compatibility (Keras 3 dropped native `tf.saved_model.load` integration):
```python
@st.cache_resource
def load_model():
registry = ModelRegistry()
model_s3_uri, model_package_arn = registry.get_latest_approved_model()
model_layer = tf.keras.layers.TFSMLayer(
'models/approved/1',
call_endpoint='serving_default'
)
inputs = tf.keras.Input(shape=(256, 832, 3))
outputs = model_layer(inputs)
return tf.keras.Model(inputs=inputs, outputs=outputs), model_package_arn
```
Launch
```bash
streamlit run app.py --server.port 8501 --server.address 0.0.0.0
```
Streamlit frontend: image upload → TFSMLayer inference → binary mask overlay. Registry status and training job launcher rendered in the sidebar.
MLOps Lifecycle
```mermaid
graph TD
A[SageMaker Training Job] --> B[Epoch Metrics Logged to MLflow]
B --> C[val_binary_accuracy evaluated against threshold]
C -->|>= threshold| D[ModelApprovalStatus: Approved]
C -->|< threshold| E[ModelApprovalStatus: PendingManualApproval]
D --> F[app.py resolves latest Approved ARN]
E --> G[Manual review via model_registry_utils.py]
G -->|approve| F
G -->|reject| H[Package remains in PendingManualApproval]
F --> I[TFSMLayer inference serving]
```
Experiment Tracking
MLflow run context is opened in `src/train.py` before the Keras `.fit()` call. Hyperparameters are logged once; per-epoch metrics are logged with `step=epoch` for time-series visualization in the MLflow UI:
```python
with mlflow.start_run(run_name=f"lane_seg_{timestamp}"):
mlflow.log_params({
'epochs': config['epochs'],
'batch_size': config['batch_size'],
'learning_rate': config['learning_rate']
})
for epoch in range(epochs):
mlflow.log_metrics({
'train_loss': history.history['loss'][epoch],
'val_loss': history.history['val_loss'][epoch],
'train_accuracy': history.history['binary_accuracy'][epoch],
'val_accuracy': history.history['val_binary_accuracy'][epoch]
}, step=epoch)
```
Tracked metrics: Dice loss, binary cross-entropy, binary accuracy, Mean IoU, epoch wall-clock time, GPU utilization, peak VRAM allocation, total parameter count, SavedModel size on disk, and batch inference latency (p50/p95).
Interactive Demo (Hugging Face Spaces)
Lightweight interactive dashboard deployed on Hugging Face Spaces for real-time inference and visualization. Includes adjustable thresholding, overlay tuning, and performance metrics.
References
- [SageMaker `CreateModelPackage` API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModelPackage.html)
- [MLflow Tracking Server](https://mlflow.org/docs/latest/tracking.html#mlflow-tracking-servers)
- [TensorFlow SavedModel format](https://www.tensorflow.org/guide/saved_model)
- [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597)
- [V-Net / Dice Loss](https://arxiv.org/abs/1606.04797)