{"id":29093236,"url":"https://github.com/nvidia-ai-iot/nanosam","last_synced_at":"2025-06-28T08:07:49.173Z","repository":{"id":194192601,"uuid":"688239237","full_name":"NVIDIA-AI-IOT/nanosam","owner":"NVIDIA-AI-IOT","description":"A distilled Segment Anything (SAM) model capable of running real-time with NVIDIA TensorRT","archived":false,"fork":false,"pushed_at":"2023-11-20T09:10:27.000Z","size":99971,"stargazers_count":728,"open_issues_count":29,"forks_count":65,"subscribers_count":9,"default_branch":"main","last_synced_at":"2025-03-20T11:41:29.136Z","etag":null,"topics":["jetson-orin","jetson-orin-nano","nvidia","real-time","segment-anything","tensorrt"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NVIDIA-AI-IOT.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-09-07T00:16:18.000Z","updated_at":"2025-03-19T06:04:36.000Z","dependencies_parsed_at":null,"dependency_job_id":"85658acc-adbf-43ee-9bab-13c570ab1c67","html_url":"https://github.com/NVIDIA-AI-IOT/nanosam","commit_stats":null,"previous_names":["nvidia-ai-iot/nanosam"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/NVIDIA-AI-IOT/nanosam","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-AI-IOT%2Fnanosam","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-AI-IOT%2Fnanosam/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-AI-IOT%2Fnanosam/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-AI-IOT%2Fnanosam/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NVIDIA-AI-IOT","download_url":"https://codeload.github.com/NVIDIA-AI-IOT/nanosam/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-AI-IOT%2Fnanosam/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262396520,"owners_count":23304447,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["jetson-orin","jetson-orin-nano","nvidia","real-time","segment-anything","tensorrt"],"created_at":"2025-06-28T08:07:39.391Z","updated_at":"2025-06-28T08:07:49.166Z","avatar_url":"https://github.com/NVIDIA-AI-IOT.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003e\u003cspan\u003eNanoSAM\u003c/span\u003e\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\u003ca href=\"#usage\"/\u003e👍 Usage\u003c/a\u003e - \u003ca href=\"#performance\"/\u003e⏱️ Performance\u003c/a\u003e - \u003ca href=\"#setup\"\u003e🛠️ Setup\u003c/a\u003e - \u003ca href=\"#examples\"\u003e🤸 Examples\u003c/a\u003e - \u003ca href=\"#training\"\u003e🏋️ Training\u003c/a\u003e \u003cbr\u003e- \u003ca href=\"#evaluation\"\u003e🧐 Evaluation\u003c/a\u003e - \u003ca href=\"#acknowledgement\"\u003e👏 Acknowledgment\u003c/a\u003e - \u003ca href=\"#see-also\"\u003e🔗 See also\u003c/a\u003e\u003c/p\u003e\n\nNanoSAM is a [Segment Anything (SAM)](https://github.com/facebookresearch/segment-anything) model variant that is capable of running in 🔥 ***real-time*** 🔥 on [NVIDIA Jetson Orin Platforms](https://store.nvidia.com/en-us/jetson/store) with [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt).  \n\n\u003c!-- \u003cimg src=\"assets/tshirt_gif_compressed_v2.gif\" height=\"20%\" width=\"20%\"/\u003e   --\u003e\n\u003cp align=\"center\"\u003e\u003cimg src=\"assets/basic_usage_out.jpg\" height=\"256px\"/\u003e\u003c/p\u003e\n\u003c!--\u003cimg src=\"assets/mouse_gif_compressed.gif\"  height=\"50%\" width=\"50%\"/\u003e --\u003e\n\n\u003e NanoSAM is trained by distilling the [MobileSAM](https://github.com/ChaoningZhang/MobileSAM) image encoder\n\u003e on unlabeled images.  For an introduction to knowledge distillation, we recommend checking out [this tutorial](https://github.com/NVIDIA-AI-IOT/jetson-intro-to-distillation).\n\n\u003ca id=\"usage\"\u003e\u003c/a\u003e\n## 👍 Usage\n\nUsing NanoSAM from Python looks like this\n\n```python3\nfrom nanosam.utils.predictor import Predictor\n\npredictor = Predictor(\n    image_encoder=\"data/resnet18_image_encoder.engine\",\n    mask_decoder=\"data/mobile_sam_mask_decoder.engine\"\n)\n\nimage = PIL.Image.open(\"dog.jpg\")\n\npredictor.set_image(image)\n\nmask, _, _ = predictor.predict(np.array([[x, y]]), np.array([1]))\n```\n\n\u003cdetails\u003e\n\u003csummary\u003eNotes\u003c/summary\u003e\nThe point labels may be\n\n| Point Label | Description |\n|:--------------------:|-------------|\n| 0 | Background point |\n| 1 | Foreground point |\n| 2 | Bounding box top-left |\n| 3 | Bounding box bottom-right |\n\u003c/details\u003e\n\n\u003e Follow the instructions below for how to build the engine files.\n\n\u003ca id=\"performance\"\u003e\u003c/a\u003e\n## ⏱️ Performance\n\nNanoSAM runs real-time on Jetson Orin Nano.\n\n\u003ctable style=\"border-top: solid 1px; border-left: solid 1px; border-right: solid 1px; border-bottom: solid 1px\"\u003e\n    \u003cthead\u003e\n        \u003ctr\u003e\n            \u003cth rowspan=2 style=\"text-align: center; border-right: solid 1px\"\u003eModel †\u003c/th\u003e\n            \u003cth colspan=2 style=\"text-align: center; border-right: solid 1px\"\u003e:stopwatch: Jetson Orin Nano (ms)\u003c/th\u003e\n            \u003cth colspan=2 style=\"text-align: center; border-right: solid 1px\"\u003e:stopwatch: Jetson AGX Orin (ms)\u003c/th\u003e\n            \u003cth colspan=4 style=\"text-align: center; border-right: solid 1px\"\u003e :dart: Accuracy (mIoU) ‡\u003c/th\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003cth style=\"text-align: center; border-right: solid 1px\"\u003eImage Encoder\u003c/th\u003e\n            \u003cth style=\"text-align: center; border-right: solid 1px\"\u003eFull Pipeline\u003c/th\u003e\n            \u003cth style=\"text-align: center; border-right: solid 1px\"\u003eImage Encoder\u003c/th\u003e\n            \u003cth style=\"text-align: center; border-right: solid 1px\"\u003eFull Pipeline\u003c/th\u003e\n            \u003cth style=\"text-align: center; border-right: solid 1px\"\u003eAll\u003c/th\u003e\n            \u003cth style=\"text-align: center; border-right: solid 1px\"\u003eSmall\u003c/th\u003e\n            \u003cth style=\"text-align: center; border-right: solid 1px\"\u003eMedium\u003c/th\u003e\n            \u003cth style=\"text-align: center; border-right: solid 1px\"\u003eLarge\u003c/th\u003e\n        \u003c/tr\u003e\n    \u003c/thead\u003e\n    \u003ctbody\u003e\n        \u003ctr\u003e\n            \u003ctd style=\"text-align: center; border-right: solid 1px\"\u003eMobileSAM\u003c/td\u003e\n            \u003ctd style=\"text-align: center; border-right: solid 1px\"\u003eTBD\u003c/td\u003e\n            \u003ctd style=\"text-align: center; border-right: solid 1px\"\u003e146\u003c/td\u003e\n            \u003ctd style=\"text-align: center; border-right: solid 1px\"\u003e35\u003c/td\u003e\n            \u003ctd style=\"text-align: center; border-right: solid 1px\"\u003e39\u003c/td\u003e\n            \u003ctd style=\"text-align: center; border-right: solid 1px\"\u003e0.728\u003c/td\u003e\n            \u003ctd style=\"text-align: center; border-right: solid 1px\"\u003e0.658\u003c/td\u003e\n            \u003ctd style=\"text-align: center; border-right: solid 1px\"\u003e0.759\u003c/td\u003e\n            \u003ctd style=\"text-align: center; border-right: solid 1px\"\u003e0.804\u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd style=\"text-align: center; border-right: solid 1px\"\u003eNanoSAM (ResNet18)\u003c/td\u003e\n            \u003ctd style=\"text-align: center; border-right: solid 1px\"\u003eTBD\u003c/td\u003e\n            \u003ctd style=\"text-align: center; border-right: solid 1px\"\u003e27\u003c/td\u003e\n            \u003ctd style=\"text-align: center; border-right: solid 1px\"\u003e4.2\u003c/td\u003e\n            \u003ctd style=\"text-align: center; border-right: solid 1px\"\u003e8.1\u003c/td\u003e\n            \u003ctd style=\"text-align: center; border-right: solid 1px\"\u003e0.706\u003c/td\u003e\n            \u003ctd style=\"text-align: center; border-right: solid 1px\"\u003e0.624\u003c/td\u003e\n            \u003ctd style=\"text-align: center; border-right: solid 1px\"\u003e0.738\u003c/td\u003e\n            \u003ctd style=\"text-align: center; border-right: solid 1px\"\u003e0.796\u003c/td\u003e\n        \u003c/tr\u003e\n    \u003c/tbody\u003e\n\u003c/table\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eNotes\u003c/summary\u003e\n\n† The MobileSAM image encoder is optimized with FP32 precision because it produced erroneous results when built for FP16 precision with TensorRT.  The NanoSAM image encoder\nis built with FP16 precision as we did not notice a significant accuracy degredation.  Both pipelines use the same mask decoder which is built with FP32 precision.  For all models, the accuracy reported uses the same model configuration used to measure latency.\n\n‡ Accuracy is computed by prompting SAM with ground-truth object bounding box annotations from the COCO 2017 validation dataset.  The IoU is then computed between the mask output of the SAM model for the object and the ground-truth COCO segmentation mask for the object.  The mIoU is the average IoU over all objects in the COCO 2017 validation set matching the target object size (small, medium, large).  \n\n\u003c/details\u003e\n\n\u003ca id=\"setup\"\u003e\u003c/a\u003e\n## 🛠️ Setup\n\nNanoSAM is fairly easy to get started with.\n\n1. Install the dependencies\n\n    1. Install PyTorch\n\n    2. Install [torch2trt](https://github.com/NVIDIA-AI-IOT/torch2trt)\n    3. Install NVIDIA TensorRT\n    4. (optional) Install [TRTPose](https://github.com/NVIDIA-AI-IOT/trt_pose) - For the pose example.\n        \n        ```bash\n        git clone https://github.com/NVIDIA-AI-IOT/trt_pose\n        cd trt_pose\n        python3 setup.py develop --user\n        ```\n\n    5. (optional) Install the Transformers library - For the OWL ViT example.\n\n        ```bash\n        python3 -m pip install transformers\n        ```\n\n2. Install the NanoSAM Python package\n    \n    ```bash\n    git clone https://github.com/NVIDIA-AI-IOT/nanosam\n    cd nanosam\n    python3 setup.py develop --user\n    ```\n\n3. Build the TensorRT engine for the mask decoder\n\n    1. Export the MobileSAM mask decoder ONNX file (or download directly from [here](https://drive.google.com/file/d/1jYNvnseTL49SNRx9PDcbkZ9DwsY8up7n/view?usp=drive_link))\n    \n        ```bash\n        python3 -m nanosam.tools.export_sam_mask_decoder_onnx \\\n            --model-type=vit_t \\\n            --checkpoint=assets/mobile_sam.pt \\\n            --output=data/mobile_sam_mask_decoder.onnx\n        ```\n\n    2. Build the TensorRT engine\n\n        ```bash\n        trtexec \\\n            --onnx=data/mobile_sam_mask_decoder.onnx \\\n            --saveEngine=data/mobile_sam_mask_decoder.engine \\\n            --minShapes=point_coords:1x1x2,point_labels:1x1 \\\n            --optShapes=point_coords:1x1x2,point_labels:1x1 \\\n            --maxShapes=point_coords:1x10x2,point_labels:1x10\n        ```\n\n        \u003e This assumes the mask decoder ONNX file is downloaded to ``data/mobile_sam_mask_decoder.onnx``\n\n        \u003cdetails\u003e\n        \u003csummary\u003eNotes\u003c/summary\u003e\n        This command builds the engine to support up to 10 keypoints.  You can increase\n        this limit as needed by specifying a different max shape.\n        \u003c/details\u003e\n\n4. Build the TensorRT engine for the NanoSAM image encoder\n\n    1. Download the image encoder: [resnet18_image_encoder.onnx](https://drive.google.com/file/d/14-SsvoaTl-esC3JOzomHDnI9OGgdO2OR/view?usp=drive_link)\n    \n    2. Build the TensorRT engine\n\n        ```bash\n        trtexec \\\n            --onnx=data/resnet18_image_encoder.onnx \\\n            --saveEngine=data/resnet18_image_encoder.engine \\\n            --fp16\n        ```\n\n5. Run the basic usage example\n\n    ```\n    python3 examples/basic_usage.py \\\n        --image_encoder=data/resnet18_image_encoder.engine \\\n        --mask_decoder=data/mobile_sam_mask_decoder.engine\n    ```\n\n    \u003e This outputs a result to ``data/basic_usage_out.jpg``\n\n\nThat's it!  From there, you can read the example code for examples on how\nto use NanoSAM with Python.  Or try running the more advanced examples below.\n\n\u003ca id=\"examples\"\u003e\u003c/a\u003e\n## 🤸 Examples\n\nNanoSAM can be applied in many creative ways.\n\n### Example 1 - Segment with bounding box\n\n\u003cimg src=\"assets/basic_usage_out.jpg\" height=\"256\"/\u003e\n\nThis example uses a known image with a fixed bounding box to control NanoSAM\nsegmentation.  \n\nTo run the example, call\n\n```python3\npython3 examples/basic_usage.py \\\n    --image_encoder=\"data/resnet18_image_encoder.engine\" \\\n    --mask_decoder=\"data/mobile_sam_mask_decoder.engine\"\n```\n\n### Example 2 - Segment with bounding box (using OWL-ViT detections)\n\n\u003cimg src=\"assets/owl_out.png\"  height=\"256\"/\u003e\n\nThis example demonstrates using OWL-ViT to detect objects using a text prompt(s),\nand then segmenting these objects using NanoSAM.\n\nTo run the example, call\n\n```bash\npython3 examples/segment_from_owl.py \\\n    --prompt=\"A tree\" \\\n    --image_encoder=\"data/resnet18_image_encoder.engine\" \\\n    --mask_decoder=\"data/mobile_sam_mask_decoder.engine\n```\n\n\u003cdetails\u003e\n\u003csummary\u003eNotes\u003c/summary\u003e\n- While OWL-ViT does not run real-time on Jetson Orin Nano (3sec/img), it is nice for experimentation\nas it allows you to detect a wide variety of objects.  You could substitute any\nother real-time pre-trained object detector to take full advantage of NanoSAM's \nspeed.\n\u003c/details\u003e\n\n### Example 3 - Segment with keypoints (offline using TRTPose detections)\n\n\u003cimg src=\"assets/pose_out.png\"  height=\"256\"/\u003e\n\nThis example demonstrates how to use human pose keypoints from [TRTPose](https://github.com/NVIDIA-AI-IOT/trt_pose) to control NanoSAM segmentation.\n\nTo run the example, call\n\n```bash\npython3 examples/segment_from_pose.py\n```\n\nThis will save an output figure to ``data/segment_from_pose_out.png``.\n\n### Example 4 - Segment with keypoints (online using TRTPose detections)\n\n\u003cimg src=\"assets/tshirt_gif_compressed_v2.gif\"  height=\"40%\" width=\"40%\"/\u003e\n\nThis example demonstrates how to use human pose to control segmentation on\na live camera feed.  This example requires an attached display and camera.\n\nTo run the example, call\n\n```python3\npython3 examples/demo_pose_tshirt.py\n```\n\n### Example 5 - Segment and track (experimental)\n\n\u003cimg src=\"assets/mouse_gif_compressed.gif\"  height=\"40%\" width=\"40%\"/\u003e\n\nThis example demonstrates a rudimentary segmentation tracking with NanoSAM.\nThis example requires an attached display and camera.\n\nTo run the example, call\n\n```python3\npython3 examples/demo_click_segment_track.py \u003cimage_encoder_engine\u003e \u003cmask_decoder_engine\u003e\n```\n\nOnce the example is running **double click** an object you want to track.\n\n\u003cdetails\u003e\n\u003csummary\u003eNotes\u003c/summary\u003e\nThis tracking method is very simple and can get lost easily.  It is intended to\ndemonstrate creative ways you can use NanoSAM, but would likely be improved with\nmore work.\n\u003c/details\u003e\n\n\u003ca id=\"training\"\u003e\u003c/a\u003e\n## 🏋️ Training\n\nYou can train NanoSAM on a single GPU\n\n1. Download and extract the COCO 2017 train images\n\n    ```bash\n    # mkdir -p data/coco  # uncomment if it doesn't exist\n    mkdir -p data/coco\n    cd data/coco\n    wget http://images.cocodataset.org/zips/train2017.zip\n    unzip train2017.zip\n    cd ../..\n    ```\n\n2. Build the MobileSAM image encoder (used as teacher model)\n\n    1. Export to ONNX\n\n        ```bash\n        python3 -m nanosam.tools.export_sam_image_encoder_onnx \\\n            --checkpoint=\"assets/mobile_sam.pt\" \\\n            --output=\"data/mobile_sam_image_encoder_bs16.onnx\" \\\n            --model_type=vit_t \\\n            --batch_size=16\n        ```\n\n    2. Build the TensorRT engine with batch size 16\n\n        ```bash\n        trtexec \\\n            --onnx=data/mobile_sam_image_encoder_bs16.onnx \\\n            --shapes=image:16x3x1024x1024 \\\n            --saveEngine=data/mobile_sam_image_encoder_bs16.engine\n        ```\n\n3. Train the NanoSAM image encoder by distilling MobileSAM\n\n    ```bash\n    python3 -m nanosam.tools.train \\\n        --images=data/coco/train2017 \\\n        --output_dir=data/models/resnet18 \\\n        --model_name=resnet18 \\\n        --teacher_image_encoder_engine=data/mobile_sam_image_encoder_bs16.engine \\\n        --batch_size=16\n    ```\n\n    \u003cdetails\u003e\n    \u003csummary\u003eNotes\u003c/summary\u003e\n    Once training, visualizations of progress and checkpoints will be saved to\n    the specified output directory.  You can stop training and resume from the last\n    saved checkpoint if needed.\n\n    For a list of arguments, you can type \n\n    ```bash\n    python3 -m nanosam.tools.train --help\n    ```\n    \u003c/details\u003e\n\n4. Export the trained NanoSAM image encoder to ONNX\n\n    ```bash\n    python3 -m nanosam.tools.export_image_encoder_onnx \\\n        --model_name=resnet18 \\\n        --checkpoint=\"data/models/resnet18/checkpoint.pth\" \\\n        --output=\"data/resnet18_image_encoder.onnx\"\n    ```\n\nYou can then build the TensorRT engine as detailed in the getting started section.\n\n\u003ca id=\"evaluation\"\u003e\u003c/a\u003e\n## 🧐 Evaluation\n\nYou can reproduce the accuracy results above by evaluating against COCO ground\ntruth masks\n\n\n1. Download and extract the COCO 2017 validation set.\n\n    ```bash\n    # mkdir -p data/coco  # uncomment if it doesn't exist\n    cd data/coco\n    wget http://images.cocodataset.org/zips/val2017.zip\n    wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip\n    unzip val2017.zip\n    unzip annotations_trainval2017.zip\n    cd ../..\n    ```\n\n2. Compute the IoU of NanoSAM mask predictions against the ground truth COCO mask annotation.\n\n    ```bash\n    python3 -m nanosam.tools.eval_coco \\\n        --coco_root=data/coco/val2017 \\\n        --coco_ann=data/coco/annotations/instances_val2017.json \\\n        --image_encoder=data/resnet18_image_encoder.engine \\\n        --mask_decoder=data/mobile_sam_mask_decoder.engine \\\n        --output=data/resnet18_coco_results.json\n    ```\n\n    \u003e This uses the COCO ground-truth bounding boxes as inputs to NanoSAM\n\n3. Compute the average IoU over a selected category or size\n\n    ```bash\n    python3 -m nanosam.tools.compute_eval_coco_metrics \\\n        data/efficientvit_b0_coco_results.json \\\n        --size=\"all\"\n    ```\n\n    \u003cdetails\u003e\n    \u003csummary\u003eNotes\u003c/summary\u003e\n    For all options type ``python3 -m nanosam.tools.compute_eval_coco_metrics --help``.\n\n    To compute the mIoU for a specific category id.\n\n    ```bash\n    python3 -m nanosam.tools.compute_eval_coco_metrics \\\n        data/resnet18_coco_results.json \\\n        --category_id=1\n    ```\n    \u003c/details\u003e\n\n\n\u003ca id=\"acknowledgement\"\u003e\u003c/a\u003e\n## 👏 Acknowledgement\n\nThis project is enabled by the great projects below.\n\n- [SAM](https://github.com/facebookresearch/segment-anything) - The original Segment Anything model.\n- [MobileSAM](https://github.com/ChaoningZhang/MobileSAM) - The distilled Tiny ViT Segment Anything model.\n\n\u003ca id=\"see-also\"\u003e\u003c/a\u003e\n## 🔗 See also\n\n- [Jetson Introduction to Knowledge Distillation Tutorial](https://github.com/NVIDIA-AI-IOT/jetson-intro-to-distillation) - For an introduction to knowledge distillation as a model optimization technique.\n- [Jetson Generative AI Playground](https://nvidia-ai-iot.github.io/jetson-generative-ai-playground/) - For instructions and tips for using a variety of LLMs and transformers on Jetson.\n- [Jetson Containers](https://github.com/dusty-nv/jetson-containers) - For a variety of easily deployable and modular Jetson Containers\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnvidia-ai-iot%2Fnanosam","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnvidia-ai-iot%2Fnanosam","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnvidia-ai-iot%2Fnanosam/lists"}