{"id":27915272,"url":"https://github.com/apple/ml-cubifyanything","last_synced_at":"2025-05-06T15:45:35.979Z","repository":{"id":291311098,"uuid":"952214741","full_name":"apple/ml-cubifyanything","owner":"apple","description":null,"archived":false,"fork":false,"pushed_at":"2025-03-21T01:04:20.000Z","size":1642,"stargazers_count":254,"open_issues_count":4,"forks_count":5,"subscribers_count":11,"default_branch":"main","last_synced_at":"2025-05-03T20:02:42.559Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/apple.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-03-20T23:15:48.000Z","updated_at":"2025-05-03T12:46:48.000Z","dependencies_parsed_at":"2025-05-03T20:02:45.913Z","dependency_job_id":"d0504c43-d9fb-4554-8acb-041652576914","html_url":"https://github.com/apple/ml-cubifyanything","commit_stats":null,"previous_names":["apple/ml-cubifyanything"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apple%2Fml-cubifyanything","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apple%2Fml-cubifyanything/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apple%2Fml-cubifyanything/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apple%2Fml-cubifyanything/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/apple","download_url":"https://codeload.github.com/apple/ml-cubifyanything/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252714596,"owners_count":21792699,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-05-06T15:45:30.383Z","updated_at":"2025-05-06T15:45:35.635Z","avatar_url":"https://github.com/apple.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CA-1M and Cubify Anything\n\nThis repository includes the public implementation of Cubify Transformer and the\nassociated CA-1M dataset.\n\n## Paper\n\n**Apple**\n\n[Cubify Anything: Scaling Indoor 3D Object Detection](https://arxiv.org/abs/2412.04458)\n\nJustin Lazarow, David Griffiths, Gefen Kohavi, Francisco Crespo, Afshin Dehghan\n\n**CVPR 2025**\n\n![Teaser](teaser.jpg?raw=true \"Teaser\")\n\n## Repository Overview\n\nThis repository includes:\n\n1. Links to the underlying data and annotations of the CA-1M dataset.\n2. Links to released models of the Cubify Transformer (CuTR) model from the Cubify Anything paper.\n3. Basic readers and inference code to run CuTR on the provided data.\n4. Basic support for using images captured from own device using the NeRF Capture app.\n\n## Installation\n\nWe recommend Python 3.10 and a recent 2.x build of PyTorch. We include a `requirements.txt` which should encapsulate\nall necessary dependencies. Please make sure you have `torch` installed first, e.g.,:\n\n```\npip install torch torchvision\n```\n\nThen, within the root of the repository:\n\n```\npip install -r requirements.txt\npip install -e .\n```\n\n## CA-1M versus ARKitScenes?\n\nThis work is related to [ARKitScenes](https://machinelearning.apple.com/research/arkitscenes). We generally share\nthe same underlying captures. Some notable differences in CA-1M:\n\n1. Each scene has been exhaustively annotated with class-agnostic 3D boxes. We release these in the laser scanner's coordinate frame.\n2. For each frame in each capture, we include \"per-frame\" 3D box ground-truth which was produced using the rendering\n   process outlined in the Cubify Anything paper. These annotations are, therefore, *independent* of any pose.\n\nSome other nice things:\n\n1. We release the GT poses (registered to laser scanner) for every frame in each capture.\n2. We release the GT depth (rendered from laser scanner) at 512 x 384 for every frame in each capture.\n3. Each frame has been already oriented into an upright position.\n\n**NOTE:** CA-1M will only include captures which were successfully registered to the laser scanner. Therefore\nnot every capture including in ARKitScenes will be present in CA-1M.\n\n## Downloading and using the CA-1M data\n\n### Data License\n\n**All data is released under the [CC-by-NC-ND](data/LICENSE_DATA).**\n\nAll links to the data are contained in `data/train.txt` and `data/val.txt`. You can use `curl` to download all files\nlisted. If you don't need the whole dataset in advance, you can either explicitly pass these\nlinks explicitly or pass the split's `txt` file itself and use the `--video-ids` argument to filter the desired videos.\n\nIf you pass the `txt` file, please note that file will be cached under `data/[split]`.\n\n## Understanding the CA-1M data\n\nCA-1M is released in WebDataset format. Therefore, it is essentially a fancy tar archive\n*per* capture (i.e., a video). Therefore, a single archive `ca1m-[split]-XXXXXXX.tar` corresponds to all data\nof capture XXXXXXXX.\n\nBoth splits are released at full frame rate.\n\nAll data should be neatly loaded by `CubifyAnythingDataset`. Please refer to `dataset.py` for more\nspecifics on how to read/parse data on disk. Some general pointers:\n\n```python\n[video_id]/[integer_timestamp].wide/image.png               # A 1024x768 RGB image corresponding to the main camera.\n[video_id]/[integer_timestamp].wide/depth.png               # A 256x192 depth image stored as a UInt16 (as millimeters) derived from the capture device's onboard LiDAR (ARKit depth).\n[video_id]/[integer_timestamp].wide/depth/confidence.tiff   # A 256x192 confidence image storing the [0, 1] confidence value of each depth measurement (currently unused).\n[video_id]/[integer_timestamp].wide/instances.json          # A list of GT instances alongside their 3D boxes (i.e., the resulting of the GT rendering process).\n[video_id]/[integer_timestamp].wide/T_gravity.json          # A rotation matrix which encodes the pitch/roll of the camera, which we assume is known (e.g., IMU).\n\n[video_id]/[integer_timestamp].gt/RT.json                   # A 4x4 (row major) JSON-encoded matrix corresponding to the registered pose in the laser-scanner space.\n[video_id]/[integer_timestamp].gt/depth.png                 # A 512x384 depth image stored as a UInt16 (as millimeters) derived from the FARO laser scanner registration.\n\n```\n\nNote that since we have already oriented the images, these dimensions may be transposed. GT depth may have 0 values which corresponding to unregistered points.\n\nAn additional file is included as `[video_id]/world.gt/instances.json` which corresponds to the full world set of 3D annotations from which\nthe per-frame labels are generated from. These instances include some structural labels: `wall`, `floor`, `ceiling`, `door_frame` which\nmight aid in rendering.\n\n## Visualization\n\nWe include visualization support using [rerun](https://rerun.io). Visualization should happen\nautomatically. If you wish to not run any models, but only visualize the data, use `--viz-only`.\n\nDuring inference, you may wish to inspect the 3D accuracy of the predictions. We support\nvisualizing the predictions on the GT point cloud (derived from Faro depth) when using\nthe `--viz-on-gt-points` flag.\n\n### Sample command\n\n``` bash\npython tools/demo.py [path_to_downloaded_data]/ca1m-val-42898570.tar --viz-only\n```\n\n``` bash\npython tools/demo.py data/train.txt --viz-only --video-ids 45261548\n```\n\n## Skipping Frames\n\nThe data is provided at a high frame rate, so using `--every-nth-frame N` will only\nprocess every N frames.\n\n## Running the CuTR models\n\n**All models are released under the Apple ML Research Model Terms of Use in [LICENSE_MODEL](LICENSE_MODEL).**\n\n1. [RGB-D](https://ml-site.cdn-apple.com/models/cutr/cutr_rgbd.pth)\n2. [RGB](https://ml-site.cdn-apple.com/models/cutr/cutr_rgb.pth)\n\nModels can be provided to `demo.py` using the `--model-path` argument. We detect whether this is an RGB\nor RGB-D model and disable depth accordingly.\n\n### RGB-D\n\nThe first variant of CuTR expects an RGB image and a metric depth map. We train on ARKit depth,\nalthough you may find it works with other metric depth estimators as well.\n\n#### Sample Command\n\nIf your computer is MPS enabled:\n\n``` bash\npython tools/demo.py data/val.txt --video-ids 42898570 --model-path [path_to_models]/cutr_rgbd.pth --viz-on-gt-points --device mps\n```\n\nIf your computer is CUDA enabled:\n\n``` bash\npython tools/demo.py data/val.txt --video-ids 42898570 --model-path [path_to_models]/cutr_rgbd.pth --viz-on-gt-points --device cuda\n```\n\nOtherwise:\n\n``` bash\npython tools/demo.py data/val.txt --video-ids 42898570 --model-path [path_to_models]/cutr_rgbd.pth --viz-on-gt-points --device cpu\n```\n\n### RGB Only\n\nThe second variant of CuTR expects an RGB image alone and attempts to derive the metric scale of\nthe scene from the image itself.\n\n#### Sample Command\n\nIf your device is MPS enabled:\n\n``` bash\npython tools/demo.py data/val.txt --video-ids 42898570 --model-path [path_to_models]/cutr_rgb.pth --viz-on-gt-points --device mps\n```\n\n## Run on captures from your own device\n\nWe also have basic support for running on RGB/Depth captured from your own device.\n\n1. Make sure you have [NeRF Capture](https://apps.apple.com/au/app/nerfcapture/id6446518379) installed on your device\n2. Start the NeRF Capture app *before* running `demo.py` (force quit and reopen if for some reason things stop working or a connection is not made).\n3. Run the normal commands but pass \"stream\" instead of the usual tar/folder path.\n4. Hit \"Send\" in the app to send a frame for inference. This will be visualized in the rerun window.\n\nWe will continue to print \"Still waiting\" to show liveliness.\n\nIf you have a device equipped with LiDAR, you can use this combined with the RGB-D models, otherwise, you can\nonly use the RGB only model.\n\n#### RGB-D (on MPS)\n\n``` bash\npython tools/demo.py stream --model-path [path_to_models]/cutr_rgbd.pth --device mps\n```\n\n#### RGB (on MPS)\n\n``` bash\npython tools/demo.py stream --model-path [path_to_models]/cutr_rgb.pth --device mps\n```\n\n## Citation\n\nIf you use CA-1M or CuTR in your research, please use the following entry:\n\n```\n@article{lazarow2024cubify,\n  title={Cubify Anything: Scaling Indoor 3D Object Detection},\n  author={Lazarow, Justin and Griffiths, David and Kohavi, Gefen and Crespo, Francisco and Dehghan, Afshin},\n  journal={arXiv preprint arXiv:2412.04458},\n  year={2024}\n}\n```\n\n## Licenses\n\nThe sample code is released under [Apple Sample Code License](LICENSE).\n\nThe data is released under [CC-by-NC-ND](data/LICENSE_DATA).\n\nThe models are released under [Apple ML Research Model Terms of Use](LICENSE_MODEL).\n\n## Acknowledgements\n\nWe use and acknowledge contributions from multiple open-source projects in [ACKNOWLEDGEMENTS](ACKNOWLEDGEMENTS).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapple%2Fml-cubifyanything","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fapple%2Fml-cubifyanything","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapple%2Fml-cubifyanything/lists"}