{"id":22068585,"url":"https://github.com/ChenDelong1999/subobjects","last_synced_at":"2025-07-24T06:31:49.536Z","repository":{"id":223970127,"uuid":"761621792","full_name":"ChenDelong1999/subobjects","owner":"ChenDelong1999","description":"Official repository of paper \"Subobject-level Image Tokenization\"","archived":false,"fork":false,"pushed_at":"2024-02-23T01:32:42.000Z","size":729,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-02-23T02:31:23.531Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ChenDelong1999.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-02-22T07:09:07.000Z","updated_at":"2024-02-23T02:31:25.354Z","dependencies_parsed_at":"2024-02-23T02:41:26.302Z","dependency_job_id":null,"html_url":"https://github.com/ChenDelong1999/subobjects","commit_stats":null,"previous_names":["chendelong1999/subobjects"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ChenDelong1999%2Fsubobjects","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ChenDelong1999%2Fsubobjects/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ChenDelong1999%2Fsubobjects/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ChenDelong1999%2Fsubobjects/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ChenDelong1999","download_url":"https://codeload.github.com/ChenDelong1999/subobjects/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227421313,"owners_count":17775009,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-30T20:04:10.504Z","updated_at":"2025-07-24T06:31:49.527Z","avatar_url":"https://github.com/ChenDelong1999.png","language":null,"readme":"\u003cdiv align=\"center\"\u003e\n\n## [Subobject-level Image Tokenization](https://arxiv.org/abs/2402.14327)\n\n[Delong Chen (陈德龙)](https://chendelong.world/)\n\u003cimg src=\"assets/meta_logo.png\" alt=\"Logo\" width=\"12\"\u003e\n\u003cimg src=\"assets/hkust_logo.png\" alt=\"Logo\" width=\"8\"\u003e, \u0026nbsp; \n[Samuel Cahyawijaya](https://samuelcahyawijaya.github.io/)\n\u003cimg src=\"assets/hkust_logo.png\" alt=\"Logo\" width=\"8\"\u003e, \u0026nbsp; \n[Jianfeng Liu (刘剑锋)](https://www.linkedin.com/in/jianfeng-liu-9539897b/) \n\u003cimg src=\"assets/xiaobing_logo.jpg\" alt=\"Logo\" width=\"10\"\u003e, \u0026nbsp; \n\n[Baoyuan Wang (王宝元)](https://sites.google.com/site/zjuwby/)\n\u003cimg src=\"assets/xiaobing_logo.jpg\" alt=\"Logo\" width=\"10\"\u003e, \u0026nbsp; \n[Pascale Fung](https://pascale.home.ece.ust.hk/)\n\u003cimg src=\"assets/meta_logo.png\" alt=\"Logo\" width=\"12\"\u003e\n\u003cimg src=\"assets/hkust_logo.png\" alt=\"Logo\" width=\"8\"\u003e \u0026nbsp; \n\n\u003cimg src=\"assets/meta_logo.png\" alt=\"Logo\" width=\"16\"\u003e Meta FAIR Paris\u0026nbsp; \u0026nbsp; \n\u003cimg src=\"assets/hkust_logo.png\" alt=\"Logo\" width=\"10\"\u003e Hong Kong University of Science and Technology \u0026nbsp; \u0026nbsp; \n\u003cimg src=\"assets/xiaobing_logo.jpg\" alt=\"Logo\" width=\"15\"\u003e Xiaobing.AI\n\n\n\u003c/div\u003e\n\n![teaser](assets/teaser.png)\n\n\n## Updates\n\n- **2025/07/04**: Our paper is accepted to **ICML 2025**. We released a [notebook](https://github.com/ChenDelong1999/subobjects/blob/main/segmentation.ipynb) for EPOC token segmentation.\n\n- **2025/03/12** (arXiv v3): We introduce a lightweight 🤗[DirectSAM-b0](https://huggingface.co/chendelong/DirectSAM-b0-1024px-sa1b-2ep-1017) (only 3.7M parameters) and combined it with the [Watershed algorithm](https://en.wikipedia.org/wiki/Watershed_(image_processing)), deriving the **E**fficient and **P**an**O**pti**C** (**EPOC**) tokenizer (EPOC = DirectSAM + Watershed). We provide both 🤗[intrinsic evaluations](https://huggingface.co/datasets/chendelong/HEIT) and extensive VLM experiments to demonstrate the advantages of adaptive image tokenization.\n\n- **2024/04/24** (arXiv v2): We updated our paper with the Direct Segment Anything Model (DirectSAM), which efficiently generates comprehensive subobject segmentations with a single forward pass! Checkout our 🎬 demo video on [YouTube](https://www.youtube.com/watch?v=tlNs7xUQ0x4) or [bilibili](https://www.bilibili.com/video/BV1yH4y1A7V3/). The pretrained DirectSAM model is released on HuggingFace: 🤗[DirectSAM-1800px-0424](https://huggingface.co/chendelong/DirectSAM-1800px-0424), and the training code is also available in this repo.\n\n- **2024/02/23** (arXiv v1): Our paper is featured in AK's 🤗[Huggingface Daily Papers](https://huggingface.co/papers/2402.14327).\n\n\n## Visualizations\n\n![compare segmentations](assets/compare_segmentations.png)\n\n![DirectSAM visualizations](assets/DirectSAM_visualizations.jpg)\n\n\n## DirectSAM Inferece\n\n- Clone the repository \n\n    ```bash\n    git clone https://github.com/ChenDelong1999/subobjects.git\n    cd subobjects\n    ```\n\n- Install dependencies\n\n    ```bash\n    conda create -n subobjects python=3.11 -y\n    conda activate subobjects\n    pip install -r requirements.txt\n    ```\n\n- Run DirectSAM on an example image\n\n    ```python\n    import requests\n    from PIL import Image\n    from transformers import AutoModelForSemanticSegmentation, AutoImageProcessor\n    from utils import inference_single_image, visualize_direct_sam_result\n\n    checkpoint = \"chendelong/DirectSAM-1800px-0424\"\n\n    image_processor = AutoImageProcessor.from_pretrained(checkpoint, reduce_labels=True)\n    model = AutoModelForSemanticSegmentation.from_pretrained(checkpoint).to('cuda').eval()\n\n    url = \"http://images.cocodataset.org/val2017/000000002149.jpg\"\n    image = Image.open(requests.get(url, stream=True).raw).convert(\"RGB\")\n\n    probs = inference_single_image(image, image_processor, model, resolution=None, pyramid_layers=0)\n    visualize_direct_sam_result(probs, image, threshold=0.25)\n    ```\n\nThe `probs` is the predicted boundary probabilities of the image, which is an ndarray of shape (height, width) between 0 and 1. The `visualize_direct_sam_result` function will show visualizations using `matplotlib`, where the `threshold` controls the binarization of the boundary probabilities.\n\nQuality of segmentation can be improved by increasing the input resolution and the number of pyramid layers. The above two groups of figures are generated using `resolution=3600`, `pyramid_layers=1`/`pyramid_layers=2`, and `threshold=0.03`.\n\nUsing half-precision `model.half()` can speed up the inference and reduce the GPU memory requirement.\n\n## DirectSAM Training\n\nWe provide an example script to fine-tune DirectSAM on the [ADE20K dataset](https://huggingface.co/datasets/scene_parse_150). The implementation is based on 🤗 HuggingFace Trainer, please see [this blog](https://huggingface.co/docs/transformers/tasks/semantic_segmentation) for a detailed tutorial.\n\nThe following command will start a distributed training with 512x512 resolution input and half-precision training, which takes around 9GB memory per GPU. \n\n```bash\ncd DirectSAM\nCUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node 4 trainer.py\n```\n\nThe following figures compare the segmentation results of DirectSAM before and after the above finetuning on ADE20K.\n\n![DirectSAM finetuning](assets/ade20k_finetuning_visualization.jpg)\n\n\n\n## Acknowledgements\n\nCheckout amazing follow up works that used our model:\n- [DirectSAM-RS](https://github.com/StevenMsy/DirectSAM-RS): Prompting DirectSAM for Semantic Contour Extraction in Remote Sensing Images\n- [RemoteSAM](https://github.com/1e12Leon/RemoteSAM): Towards Segment Anything for Earth Observation\n- [Subobject Video Tokenization](https://arxiv.org/abs/2505.23617): Grounded Video Tokenization via Panoptic Sub-object Trajectory\n\nIf you find our work useful, please consider citing: \n\n```bibtex\n@article{chen2024subobject,\n  author       = {Delong Chen and\n                  Samuel Cahyawijaya and\n                  Jianfeng Liu and\n                  Baoyuan Wang and\n                  Pascale Fung},\n  title        = {Subobject-level Image Tokenization},\n  journal      = {CoRR},\n  volume       = {abs/2402.14327},\n  year         = {2024},\n  url          = {https://doi.org/10.48550/arXiv.2402.14327},\n  doi          = {10.48550/ARXIV.2402.14327},\n  eprinttype    = {arXiv},\n  eprint       = {2402.14327}\n}\n```\n\n![DirectSAM qingming](assets/DirectSAM_qingming.jpg)\n\n\u003e This repository is not released by Meta. The code and models are for research purposes only.","funding_links":[],"categories":["Paper List"],"sub_categories":["Follow-up Papers"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FChenDelong1999%2Fsubobjects","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FChenDelong1999%2Fsubobjects","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FChenDelong1999%2Fsubobjects/lists"}