{"id":22988757,"url":"https://github.com/aadit3003/genai-multilingual-tti","last_synced_at":"2026-05-20T14:06:38.851Z","repository":{"id":266936801,"uuid":"899811803","full_name":"Aadit3003/genai-multilingual-tti","owner":"Aadit3003","description":"Training our own Multilingual Stable Diffusion model (French and German) on subsets of the WIT and CC-12 datasets.","archived":false,"fork":false,"pushed_at":"2024-12-14T18:34:59.000Z","size":36513,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-02T11:49:28.803Z","etag":null,"topics":["diffusion-models","multilingual","stable-diffusion"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Aadit3003.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-07T04:41:56.000Z","updated_at":"2024-12-14T18:35:03.000Z","dependencies_parsed_at":null,"dependency_job_id":"b6dea808-89a1-4e09-b6f7-d523b251e159","html_url":"https://github.com/Aadit3003/genai-multilingual-tti","commit_stats":null,"previous_names":["aadit3003/genai-multilingual-tti"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Aadit3003/genai-multilingual-tti","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aadit3003%2Fgenai-multilingual-tti","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aadit3003%2Fgenai-multilingual-tti/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aadit3003%2Fgenai-multilingual-tti/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aadit3003%2Fgenai-multilingual-tti/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Aadit3003","download_url":"https://codeload.github.com/Aadit3003/genai-multilingual-tti/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aadit3003%2Fgenai-multilingual-tti/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":269713884,"owners_count":24463244,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-10T02:00:08.965Z","response_time":71,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["diffusion-models","multilingual","stable-diffusion"],"created_at":"2024-12-15T04:13:58.297Z","updated_at":"2026-05-20T14:06:38.822Z","avatar_url":"https://github.com/Aadit3003.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Do captions in different languages produce different images?: Efficiently training Multilingual Diffusion 🇬🇧🇩🇪🇫🇷\n\nIn this project, we ask the question: Is it worth adding non-English support to monolingual Text-to-Image models or can we simply get away with translating the non-English prompts to English. We train a multilingual Diffusion model, based on Stable Diffusion v2.1, to support German and French, in addition to English. To do so, first, we construct two high quality training datasets for our proposed training method by filtering the [WIT dataset](https://github.com/google-research-datasets/wit) and the [Conceptual Captions 12M](https://ai.google.com/research/ConceptualCaptions/) dataset. Next, we perform two stages of training: \n* 1). Teacher Learning (To add multilingual capabilities to the CLIP ViT-H/14 text encoder model)\n* 2). Concept Alignment (To align Stable Diffusion with the new text encoder by fine tuning the U-Net with LoRA rank 4)\n\nFinally, we test our multilingual diffusion model (which we dub **RKS-diffusion**) and the standard Stable Diffusion v2.1 model on our high quality WIT test subset (see results below).\n  \nThis project was completed as part of 10-623 under the guidance of Prof. Matt Gormley and Henry Chai at CMU. For more details refer to our [poster](https://github.com/Aadit3003/genai-multilingual-tti/blob/81dd1af650e2620a808335af1d819b7823cf94db/Gen_AI_Poster_Final.pdf)\n\n## **Main Contributions**\n* Our Fine-tuned Multilingual Diffusion Model: [RKS-Diffusion](https://huggingface.co/AaditD/rks-diffusion) (Supports English, French, and German)\n* Our Fine-tuned Multilingual CLIP Text Model: [RKS-CLIP-Text-Encoder](https://huggingface.co/AaditD/rks-clip-text-encoder)\n* Our high quality subset of WIT :  [Multilingual-RKS-WIT](https://huggingface.co/datasets/AaditD/multilingual_rks) (train/test split: 6k/1.5k image-caption pairs, with equal English, German, and French representation in captions).\n\n_Note : RKS is a reference to Arceus_\n## **Results**\nFID and IS scores for the images generated using the two models, on three languages: English (EN), German (DE), and French (FR)\n\n\u003ctable class=\"tg\"\u003e\u003cthead\u003e\n  \u003ctr\u003e\n    \u003cth class=\"tg-0pky\" rowspan=\"2\"\u003e\u003cspan style=\"font-weight:bold\"\u003eModel\u003c/span\u003e\u003c/th\u003e\n    \u003cth class=\"tg-c3ow\" colspan=\"3\"\u003e\u003cspan style=\"font-weight:bold\"\u003eFID(↓)\u003c/span\u003e\u003c/th\u003e\n    \u003cth class=\"tg-c3ow\" colspan=\"3\"\u003e\u003cspan style=\"font-weight:bold\"\u003eIS(↑)\u003c/span\u003e\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003cth class=\"tg-c3ow\"\u003eEN\u003c/th\u003e\n    \u003cth class=\"tg-c3ow\"\u003eDE\u003c/th\u003e\n    \u003cth class=\"tg-c3ow\"\u003eFR\u003c/th\u003e\n    \u003cth class=\"tg-c3ow\"\u003eEN\u003c/th\u003e\n    \u003cth class=\"tg-c3ow\"\u003eDE\u003c/th\u003e\n    \u003cth class=\"tg-c3ow\"\u003eFR\u003c/th\u003e\n  \u003c/tr\u003e\u003c/thead\u003e\n\u003ctbody\u003e\n  \u003ctr\u003e\n    \u003ctd class=\"tg-0pky\"\u003eStable Diffusion v2.1 (Baseline)\u003c/td\u003e\n    \u003ctd class=\"tg-6ic8\"\u003e1.08\u003c/td\u003e\n    \u003ctd class=\"tg-dvpl\"\u003e1.16\u003c/td\u003e\n    \u003ctd class=\"tg-dvpl\"\u003e1.3\u003c/td\u003e\n    \u003ctd class=\"tg-dvpl\"\u003e11.33\u003c/td\u003e\n    \u003ctd class=\"tg-6ic8\"\u003e11.45\u003c/td\u003e\n    \u003ctd class=\"tg-dvpl\"\u003e11.17\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd class=\"tg-0pky\"\u003eRKS-Diffusion (Ours)\u003c/td\u003e\n    \u003ctd class=\"tg-dvpl\"\u003e0.99\u003c/td\u003e\n    \u003ctd class=\"tg-dvpl\"\u003e1.04\u003c/td\u003e\n    \u003ctd class=\"tg-6ic8\"\u003e0.95\u003c/td\u003e\n    \u003ctd class=\"tg-dvpl\"\u003e11.73\u003c/td\u003e\n    \u003ctd class=\"tg-6ic8\"\u003e12.42\u003c/td\u003e\n    \u003ctd class=\"tg-dvpl\"\u003e11.63\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/tbody\u003e\u003c/table\u003e\n\nFor the baselines, English achieves the best FID score, and surprisingly, German gets the best IS score (perhaps due to its similarity to English). **RKS-Diffusion outperforms the baseline on all metrics for French and German, while still not sacrificing English performance**. French gets the biggest improvement in FID score, and German gets the biggest improvement in IS score.\n\n## Directory Structure\n* **data**\n    * utils\n        * ```clip.py```: CLIP Score generator (Used in wit_dataset_filtering.py)\n        * ```translator.py```: Translates German and French captions with NLLB-200 (Used in wit_dataset_filtering.py)\n    * ```cc_dataset_filtering.py```: Filter the CC-12 dataset\n    * ```wit_dataset_filtering.py```: Filter the WIT dataset\n    * final_dataset_translated.csv: The complete high quality filtered WIT sample\n    * final_test.csv: Test split of WIT sample (Used for Evaluation)\n    * final_train.csv: Train split of WIT sample (Used for Stage-2 Training of U-Net)\n    * teacher_set.csv: Train set using CC-12 (Used for Stage-1 Training of CLIP Text Encoder)\n* **scripts**\n    * ```evaluation.py```: Evaluation (FID and IS) code for the generated images\n    * ```lora_inference.py```: Inference code for the trained RKS-Diffusion model with the trained RKS-CLIP-Text-Encoder (using pipeline)\n    * ```manual_inference.py```: Inference code for Stable Diffusion v2.1 (from scratch, i.e. manually performing the reverse denoising process) \n    * ```teacher_learning.py```: Training Stage-1 code for the RKS-CLIP-Text-Encoder\n    * ```train_text_to_image_lora_rks.py```: Training Stage-2 code for fine-tuning Stable Diffusion with LoRA (adapted from this blog by [Huggingface](https://huggingface.co/blog/lora))\n    *  lora.sh: The hyperparameters for the LoRA fine-tuning\n    * ```visualize.py```: Code to compare and visualize the output of the same prompt in three languages with the baseline and our model\n\n## Reproduce the Results\n\nRecreate the environment (Importantly, you need to perform installation of 'diffusers' from the source):\n```\nconda env create --file requirements.txt -n genai\nconda activate genai\npip install git+https://github.com/huggingface/diffusers\n```\n\n### Recreate Dataset\nFirst, download data files from: [CC12](https://huggingface.co/datasets/flax-community/conceptual-12m-multilingual-marian) and [WIT](https://github.com/google-research-datasets/wit/blob/main/DATA.md)\n```\ncd data\npython cc_dataset_filtering.py\npython wit_dataset_filtering.py\n```\n\n### Training\n```cd scripts\npython teacher_learning.py  # Teacher Learning\nbash lora.sh  # Concept Alignment\n```\n\n### Evaluation\n```\npython manual_inference.py --output_dir \u003cbaseline_output_dir\u003e \npython lora_inference.py --checkpoint_path \u003cyour_checkpoint_path\u003e --output_dir \u003crks_output_dir\u003e \npython eval.py --generated_image_dir \u003cbaseline_output_dir\u003e \u003e baseline_results.txt\npython eval.py --generated_image_dir \u003crks_output_dir\u003e \u003e rks_diffusion_results.txt\n```\n\n### Visualize\n```\npython visualize.py --checkpoint_path \u003cyour_checkpoint_path\u003e --step_size \u003cthe step size to iterate over: 1000, 2000, ..\u003e\n```\n\n## Training Runs\nYou can find W\u0026B dashboards of our training runs here:\n* Training Stage 1: [Teacher Learning](https://wandb.ai/aadit/Gen-AI-Multilingual-TTI/runs/0p2fhqio/overview)\n* Training Stage 2: [Concept Alignment](https://wandb.ai/aadit/text2image-fine-tune/runs/1nkio8i8?nw=nwuseraaditd)\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faadit3003%2Fgenai-multilingual-tti","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faadit3003%2Fgenai-multilingual-tti","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faadit3003%2Fgenai-multilingual-tti/lists"}