{"id":13724157,"url":"https://github.com/see2sound/see2sound","last_synced_at":"2026-04-07T07:02:33.303Z","repository":{"id":243143085,"uuid":"790801095","full_name":"see2sound/see2sound","owner":"see2sound","description":"Official code for SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound","archived":false,"fork":false,"pushed_at":"2025-03-28T22:24:46.000Z","size":3368,"stargazers_count":129,"open_issues_count":4,"forks_count":10,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-08-16T06:20:45.625Z","etag":null,"topics":["audio-processing","computer-vision","see-2-sound"],"latest_commit_sha":null,"homepage":"https://see2sound.github.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/see2sound.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-04-23T14:54:28.000Z","updated_at":"2025-08-10T18:38:32.000Z","dependencies_parsed_at":"2024-06-07T00:23:05.845Z","dependency_job_id":"707590dc-5850-43d2-ab84-fe5b94eb25d5","html_url":"https://github.com/see2sound/see2sound","commit_stats":null,"previous_names":["see2sound/see2sound"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/see2sound/see2sound","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/see2sound%2Fsee2sound","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/see2sound%2Fsee2sound/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/see2sound%2Fsee2sound/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/see2sound%2Fsee2sound/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/see2sound","download_url":"https://codeload.github.com/see2sound/see2sound/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/see2sound%2Fsee2sound/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31503394,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-07T03:10:19.677Z","status":"ssl_error","status_checked_at":"2026-04-07T03:10:13.982Z","response_time":105,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio-processing","computer-vision","see-2-sound"],"created_at":"2024-08-03T01:01:51.254Z","updated_at":"2026-04-07T07:02:33.287Z","avatar_url":"https://github.com/see2sound.png","language":"Python","funding_links":[],"categories":["\u003cspan id=\"audio\"\u003eAudio\u003c/span\u003e"],"sub_categories":["\u003cspan id=\"tool\"\u003eLLM (LLM \u0026 Tool)\u003c/span\u003e"],"readme":"\u003cdiv align=\"center\"\u003e\n\u003ch2\u003eSEE-2-SOUND🔊: Zero-Shot Spatial Environment-to-Spatial Sound\u003c/h2\u003e\n\n[**Rishit Dagli**](https://rishitdagli.com/)\u003csup\u003e1\u003c/sup\u003e · [**Shivesh Prakash**](https://shivesh777.github.io/)\u003csup\u003e1\u003c/sup\u003e · [**Rupert Wu**](https://www.cs.toronto.edu/~rupert/)\u003csup\u003e1\u003c/sup\u003e · [**Houman Khosravani**](https://scholar.google.ca/citations?user=qzhk98YAAAAJ\u0026hl=en)\u003csup\u003e1,2,3\u003c/sup\u003e\n\n\u003csup\u003e1\u003c/sup\u003eUniversity of Toronto\u0026emsp;\u0026emsp;\u0026emsp;\u0026emsp;\u003csup\u003e2\u003c/sup\u003eTemerty Centre for Artificial Intelligence Research and Education in Medicine\u0026emsp;\u0026emsp;\u0026emsp;\u0026emsp;\u003csup\u003e3\u003c/sup\u003eSunnybrook Research Institute\n\n\u003ca href=\"https://twitter.com/intent/tweet?text=Wow:\u0026url=https%3A%2F%2Fgithub.com%2Fsee2sound%2Fsee2sound\"\u003e\n  \u003cimg src=\"https://img.shields.io/twitter/url?style=social\u0026url=https%3A%2F%2Fgithub.com%2Fsee2sound%2Fsee2sound\" alt=\"Twitter\"\u003e\n\u003c/a\u003e\n\u003ca href=\"https://arxiv.org/abs/2406.06612\"\u003e\u003cimg src='https://img.shields.io/badge/arXiv-See2Sound-red' alt='Paper PDF'\u003e\u003c/a\u003e\n\u003ca href='https://see2sound.github.io'\u003e\u003cimg src='https://img.shields.io/badge/Project_Page-See2Sound-green' alt='Project Page'\u003e\u003c/a\u003e\n\u003ca href=\"https://huggingface.co/spaces/rishitdagli/see-2-sound\"\u003e\u003cimg src=\"https://img.shields.io/badge/%F0%9F%A4%97%20Gradio%20Demo-Huggingface-orange\"\u003e\u003c/a\u003e\n\u003ca href='https://huggingface.co/papers/2406.06612'\u003e\u003cimg src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Paper-yellow'\u003e\u003c/a\u003e\n\u003c/div\u003e\n\u003c/div\u003e\n\nThis work presents **SEE-2-SOUND**, a method to generate spatial audio from images, animated images, and videos to accompany the visual content. Check out our [website](https://see2sound.github.io) to view some results of this work.\n\n![teaser](assets/teaser.png)\n\n## Installation\n\nYou could also skip this section and run this entirely in a docker container, for which you can find the instructions in [Run in Docker](#run-in-docker), or using [Gradio](#build-using-gradio) (for any HF/Gradio issues cc [@jadechoghari](https://github.com/jadechoghari) 🤗).\n\nFirst, install the pip package by running:\n\n```sh\npip install -e git+https://github.com/see2sound/see2sound.git#egg=see2sound\n```\n\nNow, install all the required packages:\n\n```sh\ngit clone https://github.com/see2sound/see2sound\ncd see2sound\npip install -r requirements.txt\n```\n\nEvaluating the code (not inference though) requires the [Visual Acoustic Matching](https://github.com/facebookresearch/visual-acoustic-matching) (AViTAR) codebase. However, due to the many changes required to run AViTAR, you should install the codebase through a [fork](https://github.com/Rishit-dagli/visual-acoustic-matching-s2s) we host. Install this by running:\n\n```sh\npip install git+https://github.com/Rishit-dagli/visual-acoustic-matching-s2s\n```\n\n\u003cdiv align=\"center\"\u003eOR\u003c/div\u003e\n\n```sh\ngit clone https://github.com/Rishit-dagli/visual-acoustic-matching-s2s\ncd visual-acoustic-matching-s2s\npip install -e .\n```\n\nCheck out the [Tips](#tips) section for tips on installing the requirements.\n\n## Overview of codebase\n\nSEE-2-SOUND consists of three main components: source estimation, audio generation, and surround sound spatial audio generation.\n\n![methods](assets/methods.png)\n\nIn the source estimation phase, the model identifies regions of interest in the input media and estimates their 3D positions on a viewing sphere. It also estimates the monocular depth map of the input image.\n\nNext, in the audio generation phase, the model generates mono audio clips for each identified region of interest, leveraging a pre-trained CoDi model. These audio clips are then combined with the spatial information to create a 4D representation for each region.\n\nFinally, the model generates 5.1 surround sound spatial audio by placing sound sources in a virtual room and computing Room Impulse Responses (RIRs) for each source-microphone pair. Microphones are positioned according to the 5.1 channel configuration, ensuring compatibility with prevalent audio systems and enhancing the immersive quality of the audio output.\n\nFor evaluation, we propose a new quantitative evaluation technique and also do a user study. We propose a new method due to the difficulty in evaluating such a system, especially in the absence of any baselines. For quantitative evaluation, we produce outputs from an image to audio system ([CoDi](http://arxiv.org/abs/2305.11846)) which serves as a baseline and our system. We then run these approaches through [AViTAR](https://arxiv.org/abs/2202.06875) which edits the audio to match the visual content and then we compute similarity scores between pairs of these audio for each image.\n\n### inference\n\nThe `See2Sound` class has a few main methods:\n\n- `setup` that downloads and loads models into memory (in high memory mode)\n- `adjust_audio` that simulates a room and computes the spatial audio\n- `run` that puts together inference code\n\n### evaluation\n\nThe `eval_See2Sound` class has a few main methods\n\n- `setup` to download and load models into memory (in high memory mode)\n- `generate_audio` to generate mono audio\n- `run_avitar` to run the AViTAR model\n- `compute_acoustic_similarity` to compute quantitative metrics\n\n## Usage\n\nHere is a guide on using this codebase in general it should be relatively quick to get started with this since it's well packaged as a `pip` package.\n\nAll of the code is designed to be run from the root of this repository.\n\n### inference\n\n```py\nimport see2sound\n\n\nconfig_file_path = \"default_config.yaml\"\n\nmodel = see2sound.See2Sound(config_path = config_file_path)\nmodel.setup()\nmodel.run(path = \"test.png\", output_path = \"test.wav\")\n```\n\n### evaluation\n\nYou should only run evaluation to do any quantitative evaluations with the quantitative evaluation method we propose in our work.\n\n```py\nfrom see2sound.evaluation import eval_See2Sound\n\n\nimage_dir_path = \"path to a directory with all images\"\n\nevaluator = evalSee2Sound(config_path = config_file_path)\nevaluator.setup()\nevaluator.evaluate(image_dir_path)\n```\n\n## Run in Docker\n\nYou could run the inference and evaluation in a container, for the purpose of writing a guide to run the container image we use Docker. However, you should be able to use any other container runtime too.\n\nStart by building the container image by running:\n\n```sh\ndocker build . -t rishitdagli/see2sound:latest\n```\n\nor you can also directly use the prebuilt image (41 GB compressed):\n\n```\ndocker pull rishitdagli/see2sound:latest\n```\n\nYou can now use `docker run` and start running inference or evaluation in the container with the environment setup and models pre-downloaded for you.\n\n## Build using Gradio\n\nYou could also setup the app using Gradio.\n\n```sh\npip install -r gradio-req.txt\npython app.py\n```\n\n## Tips\n\nWe share some tips on running the code and reproducing our results.\n\n### on installing required packages\n\n- If you just want to run inference, I recommend using `torch==2.3.0` and using this slim [`requirements.txt` file](https://huggingface.co/spaces/rishitdagli/see-2-sound/blob/main/requirements.txt).\n- You could find some ways to perform the quantitative evaluation with the original [Visual Acoustic Matching](https://github.com/facebookresearch/visual-acoustic-matching) repository, we would, however, suggest using the [fork](https://github.com/Rishit-dagli/visual-acoustic-matching-s2s) which has some additional features which are required if you want to run our code from the `pip` package.\n- The repository has the dependency `tensorflow` which is required by `speech_metrics` and `vam`. However, this is only needed for the quantitative evaluations in our work and not for inference.\n- Our codebase works with PyTorch 2.x, to this extent all of our code and results were produced with PyTorch 2.3.0, our `requirements.txt` file, however, has PyTorch 1.13.1 since PyTorch 1.x is required for our quantitative evaluation.\n\n### on downloading models\n\n- If you are running inference, the code downloads a few artifacts: Segment Anything weights, Depth Anything weights, CoDi weights, and CLIP ViT-H Tokenizer.\n- If you are running evaluation, the code downloads a few artifacts: Segment Anything weights, Depth Anything weights, CoDi weights, CLIP ViT-H Tokenizer, and AViTAR weights.\n- The paths to all of these files can be entered in the config `yaml` where the package will look for them or download them at. Thus one could make the inference or evaluation run without any network connection.\n\n### on compute\n\n- We have currently optimized the code for and run all of the experiments on a A100 - 80 GB GPU. However, we have also tested the code on a A100 - 40 GB GPU, a H100 - 80 GB GPU, and a V100 - 32 GB GPU (run with the low memory mode) where the inference and evaluation seem to work pretty fast.\n- In general, we would recommend a GPU above 40 GB vRAM, you could, however, run this on a GPU with 24 GB or more vRAM in the low memory mode (trades off-peak vRAM usage with the time taken).\n- We would recommend having at least 24 GB CPU RAM for the code to work well, ideally, we would recommend 32 GB CPU RAM though.\n\n### on running inference\n\n- All of our experiments were run with Segment Anything ViT-H and Depth Anything ViT-L. However, any of the models can be replaced for the smaller variants through the config `yaml` file or also different models altogether.\n- We would suggest running the inference for at least 500 diffusion steps and somewhere between 3 to 5 as `num_audios`.\n\n## Credits\n\nThis code base is built on top of, and thanks to them for maintaining the repositories:\n\n- [CoDi](https://github.com/microsoft/i-Code/tree/main/i-Code-V3)\n- [Segment Anything](https://github.com/facebookresearch/segment-anything)\n- [Depth Anything](https://github.com/LiheYoung/Depth-Anything/tree/main)\n- [Visual Acoustic Matching](https://github.com/facebookresearch/visual-acoustic-matching)\n\n## Citation\n\nIf you find See-2-Sound helpful, please consider citing:\n\n```bibtex\n@misc{dagli2024see2sound,\n      title={SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound}, \n      author={Rishit Dagli and Shivesh Prakash and Robert Wu and Houman Khosravani},\n      year={2024},\n      eprint={2406.06612},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsee2sound%2Fsee2sound","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsee2sound%2Fsee2sound","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsee2sound%2Fsee2sound/lists"}