{"id":22066388,"url":"https://github.com/jaketae/koclip","last_synced_at":"2025-05-13T02:06:50.294Z","repository":{"id":47721003,"uuid":"382674846","full_name":"jaketae/koclip","owner":"jaketae","description":"KoCLIP: Korean port of OpenAI CLIP, in Flax","archived":false,"fork":false,"pushed_at":"2023-08-22T03:35:17.000Z","size":29254,"stargazers_count":151,"open_issues_count":1,"forks_count":18,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-05-13T02:06:21.270Z","etag":null,"topics":["flax","jax","openai-clip","roberta","vision-transformer"],"latest_commit_sha":null,"homepage":"https://tinyurl.com/koclip-app","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jaketae.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-07-03T17:32:47.000Z","updated_at":"2025-05-12T20:19:13.000Z","dependencies_parsed_at":"2024-11-30T19:38:09.612Z","dependency_job_id":null,"html_url":"https://github.com/jaketae/koclip","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaketae%2Fkoclip","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaketae%2Fkoclip/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaketae%2Fkoclip/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaketae%2Fkoclip/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jaketae","download_url":"https://codeload.github.com/jaketae/koclip/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253856655,"owners_count":21974581,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["flax","jax","openai-clip","roberta","vision-transformer"],"created_at":"2024-11-30T19:27:56.567Z","updated_at":"2025-05-13T02:06:50.274Z","avatar_url":"https://github.com/jaketae.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# KoCLIP\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1nPY78-vjBarhkYxHshM5gSSEGAm-BXsn?usp=sharing)\n\nThis repository contains code for KoCLIP, a Korean port of OpenAI's CLIP. This project was conducted as part of Hugging Face's [Flax/JAX community week](https://github.com/huggingface/transformers/blob/master/examples/research_projects/jax-projects/README.md#quickstart-flax-and-jax) co-organized with Google's Flax, JAX, and Cloud teams ([announcement](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104)).\n\n\n## Quickstart\n\nTo follow along the code snippets below, we recommend that you refer to the [Colab notebook](./inference.ipynb).\n\n### PyTorch\n\nKoCLIP is now available through the Hugging Face Auto API.\n\n```python\n\u003e\u003e\u003e from transformers import AutoProcessor, AutoModel\n\u003e\u003e\u003e processor = AutoProcessor.from_pretrained(\"koclip/koclip-base-pt\")\n\u003e\u003e\u003e model = AutoModel.from_pretrained(\"koclip/koclip-base-pt\")\n```\n\n### JAX\n\nKoCLIP can also be loaded through the current `koclip` library.\n\n```python\nimport requests\nimport jax\nfrom PIL import Image\n\nfrom koclip import load_koclip\n\nmodel, processor = load_koclip(\"koclip-base\")\n```\n\n### Inference\n\n1. Prepare image and text captions.\n\n```python\nurl = \"http://images.cocodataset.org/val2017/000000039769.jpg\"\nimage = Image.open(requests.get(url, stream=True).raw)\ntext = [\"소파 위에 고양이\", \"강아지와 강아지 주인\", \"쳇바퀴를 달리는 햄스터\", \"자동차\"]\nimage\n```\n\n2. Run inference.\n\n```python\ninputs = processor(\n    text=text,\n    images=image, \n    return_tensors=\"jax\", # could also be \"pt\" \n    padding=True\n)\n\noutputs = model(**inputs)\nprobs = jax.nn.softmax(outputs.logits_per_image, axis=1)\n\nfor idx, prob in sorted(enumerate(*probs), key=lambda x: x[1], reverse=True):\n    print(text[idx], prob)\n```\n\n## Models\n\nWe trained a total of two models, `koclip-base` and `koclip-large`. Both models use RoBERTa-large. The decision to use a somewhat large language model was motivated by the intuition that annotated Korean datasets are rare; a well-trained, performant LM would be key to good multimodal pipeline given limited data.\n\n| KoCLIP         | LM                   | ViT                            |\n| -------------- | -------------------- | ------------------------------ |\n| `koclip-base`  | `klue/roberta-large` | `openai/clip-vit-base-patch32` |\n| `koclip-large` | `klue/roberta-large` | `google/vit-large-patch16-224` |\n\n## Training\n\nKoCLIP was fine-tuned using 82,783 images from the [MSCOCO](https://cocodataset.org/#home) 2014 image captioning dataset. Korean translations of image captions were obtained from [AI Hub](https://aihub.or.kr/keti_data_board/visual_intelligence), an open database maintained by subsidiaries of the Korean Ministry of Science and ICT. Validation metrics were monitored using approximately 40,000 images from the validation set of the aforementioned dataset. \n\nKoCLIP was trained on a TPU3-v8 VM. Both text and image encoder backbones were loaded from their pretrained checkpoints. KoCLIP was trained to maximize the similarity score between matching pairs of images and captions.\n\n## Findings\n\nIn this section, we detail some interesting findings we made throughout the project.\n\n### Prompting\n\nWe found that KoCLIP performs better when prompting is used to induce zero-shot behavior. Namely, instead of feeding it a single word or short phrase, casting a template such as\n\n```\n이것은 {{}} 이다.\n```\n\nnoticably helped the model produce more reliable results. We hypothesize that this is due to the nature of captions in the MSCOCO datset, which are most often full sentences, albeit sometimes short in length.  \n\n### Multilinguality\n\nAlthough KoCLIP was trained exclusively on a Korean dataset, we found that English queries also work surprisingly well for simple words (e.g. \"dog\", \"car\"). This could be one of two reasons, or a combination thereof:\n\n* *ViT Pretraining*: The ViT backbone for `koclip-base`, `openai/clip-vit-base-patch32`, was already pretrained on an English dataset. Hence, it is possible that its embeddings still lie in a latent space where vector arithematic can be performed with English text embeddings. One reason against this hypothesis is that `koclip-large` also demonstrates similar multilingual behavior.\n\n* *LM Knowledge Bleed*: `klue/roberta-large` was trained on a large corpus of Korean text in a self-supervised fashion. One might reasonably suspect that English words were included in parts of the corpus, especially given the high frequency of English word transliterations in contemporary conversational Korean. This might also explain why English queries work for both `koclip-base` and `koclip-large`. One reason against this hypothesis is that the authors of KLUE explicitly state in their paper that one criterion for text selection was that \"the corpus must be written in contemporary Korean.\"\n\nAt the end of the day, we still found it intriguing that a model that was fine-tuned exclusively on Korean managed to produce semantic embeddings from English queries that work well with ViT.\n\n## Team\n\n* [GUIJIN SON](https://github.com/guijinSON)\n* [Hansol Park](https://github.com/tree-park)\n* [Jake Tae](https://github.com/jaketae)\n* [Trent Oh](https://github.com/trent-dev)\n\n## Acknowledgement\n\nThe `FlaxHybridCLIP` model was adpated from the Hugging Face transformer repository, under [jax-projects](https://github.com/huggingface/transformers/tree/master/examples/research_projects/jax-projects/hybrid_clip).  We also express gratitude to the teams at Google for generously offering TPU VMs for this project. Last but not least, we thank the [KLUE team](https://github.com/KLUE-benchmark) for making pretrained Korean RoBERTa-large weights publicly available.\n\n## References\n\n```bibtex\n@misc{park2021klue,\n      title={KLUE: Korean Language Understanding Evaluation}, \n      author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jung-Woo Ha and Kyunghyun Cho},\n      year={2021},\n      eprint={2105.09680},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n\n```bibtex\n@misc{radford2021learning,\n      title={Learning Transferable Visual Models From Natural Language Supervision}, \n      author={Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},\n      year={2021},\n      eprint={2103.00020},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n```\n\n```bibtex\n@misc{lin2015microsoft,\n      title={Microsoft COCO: Common Objects in Context}, \n      author={Tsung-Yi Lin and Michael Maire and Serge Belongie and Lubomir Bourdev and Ross Girshick and James Hays and Pietro Perona and Deva Ramanan and C. Lawrence Zitnick and Piotr Dollár},\n      year={2015},\n      eprint={1405.0312},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n```\n\n```bibtex\n@misc{srinivasan2021wit,\n      title={WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning}, \n      author={Krishna Srinivasan and Karthik Raman and Jiecao Chen and Michael Bendersky and Marc Najork},\n      year={2021},\n      eprint={2103.01913},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaketae%2Fkoclip","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjaketae%2Fkoclip","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaketae%2Fkoclip/lists"}