{"id":13510712,"url":"https://github.com/unum-cloud/uform","last_synced_at":"2025-05-14T05:10:40.647Z","repository":{"id":78353049,"uuid":"604556121","full_name":"unum-cloud/uform","owner":"unum-cloud","description":"Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ \u0026 🖋️","archived":false,"fork":false,"pushed_at":"2025-01-03T23:12:16.000Z","size":685,"stargazers_count":1131,"open_issues_count":14,"forks_count":66,"subscribers_count":15,"default_branch":"main","last_synced_at":"2025-05-11T14:01:59.622Z","etag":null,"topics":["bert","clip","clustering","contrastive-learning","cross-attention","huggingface-transformers","image-search","language-vision","llava","multi-lingual","multimodal","neural-network","openai","openclip","pretrained-models","pytorch","representation-learning","semantic-search","transformer","vector-search"],"latest_commit_sha":null,"homepage":"https://unum-cloud.github.io/uform/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/unum-cloud.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-02-21T10:04:40.000Z","updated_at":"2025-05-08T00:13:08.000Z","dependencies_parsed_at":null,"dependency_job_id":"98c667d7-0cce-457d-b4a6-89c11ffaef05","html_url":"https://github.com/unum-cloud/uform","commit_stats":{"total_commits":215,"total_committers":15,"mean_commits":"14.333333333333334","dds":0.4418604651162791,"last_synced_commit":"e6c7b427f438eb9128d262bea06771c1c6b06caf"},"previous_names":[],"tags_count":40,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/unum-cloud%2Fuform","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/unum-cloud%2Fuform/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/unum-cloud%2Fuform/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/unum-cloud%2Fuform/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/unum-cloud","download_url":"https://codeload.github.com/unum-cloud/uform/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254076850,"owners_count":22010611,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","clip","clustering","contrastive-learning","cross-attention","huggingface-transformers","image-search","language-vision","llava","multi-lingual","multimodal","neural-network","openai","openclip","pretrained-models","pytorch","representation-learning","semantic-search","transformer","vector-search"],"created_at":"2024-08-01T02:01:51.077Z","updated_at":"2025-05-14T05:10:40.605Z","avatar_url":"https://github.com/unum-cloud.png","language":"Python","readme":"\u003ch1 align=\"center\"\u003eUForm\u003c/h1\u003e\n\u003ch3 align=\"center\"\u003e\nPocket-Sized Multimodal AI\u003cbr/\u003e\nFor Content Understanding and Generation\u003cbr/\u003e\n\u003c/h3\u003e\n\u003cbr/\u003e\n\n\u003cp align=\"center\"\u003e\n\u003ca href=\"https://discord.gg/jsMURnSFM2\"\u003e\u003cimg height=\"25\" src=\"https://github.com/unum-cloud/.github/raw/main/assets/discord.svg\" alt=\"Discord\"\u003e\u003c/a\u003e\n\u0026nbsp; \u0026nbsp; \u0026nbsp;\n\u003ca href=\"https://www.linkedin.com/company/unum-cloud/\"\u003e\u003cimg height=\"25\" src=\"https://github.com/unum-cloud/.github/raw/main/assets/linkedin.svg\" alt=\"LinkedIn\"\u003e\u003c/a\u003e\n\u0026nbsp; \u0026nbsp; \u0026nbsp;\n\u003ca href=\"https://twitter.com/unum_cloud\"\u003e\u003cimg height=\"25\" src=\"https://github.com/unum-cloud/.github/raw/main/assets/twitter.svg\" alt=\"Twitter\"\u003e\u003c/a\u003e\n\u0026nbsp; \u0026nbsp; \u0026nbsp;\n\u003ca href=\"https://unum.cloud/post\"\u003e\u003cimg height=\"25\" src=\"https://github.com/unum-cloud/.github/raw/main/assets/blog.svg\" alt=\"Blog\"\u003e\u003c/a\u003e\n\u0026nbsp; \u0026nbsp; \u0026nbsp;\n\u003ca href=\"https://github.com/unum-cloud/uform\"\u003e\u003cimg height=\"25\" src=\"https://github.com/unum-cloud/.github/raw/main/assets/github.svg\" alt=\"GitHub\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\nMultimodal Embeddings from 64 to 768 Dimensions • 1B Parameter Chat\n\u003cbr/\u003e\nShort Texts • Images • 🔜 Video Clips • 🔜 Long Documents\n\u003cbr/\u003e\nONNX • CoreML • PyTorch\n\u003cbr/\u003e\n\u003ca href=\"https://github.com/unum-cloud/uform/blob/main/python/README.md\"\u003ePython\u003c/a\u003e\n • \n\u003ca href=\"https://github.com/unum-cloud/uform/blob/main/javascript/README.md\"\u003eJavaScript\u003c/a\u003e\n • \n\u003ca href=\"https://github.com/unum-cloud/uform/blob/main/swift/README.md\"\u003eSwift\u003c/a\u003e\n\u003c/p\u003e\n\n---\n\n![UForm Chat Preview](https://github.com/ashvardanian/usearch-images/blob/main/assets/uform-gen-preview.jpg?raw=true)\n\nWelcome to UForm, a __multimodal__ AI library that's as versatile as it is efficient.\nUForm [tiny embedding models](#encoder) will help you understand and search visual and textual content across various languages.\nUForm [small generative models](#decoder), on the other hand, don't only support conversational and chat use-cases, but are great for fast image captioning and Visual Question Answering (VQA).\nWith compact __custom pre-trained transformer models__, this can run anywhere from your server farm down to your smartphone.\n\n## Features\n\n- __Tiny Embeddings__: 64-dimensional [Matryoshka][matryoshka]-style embeddings for extremely fast [search][usearch].\n- __Throughput__: Thanks to the small size, the inference speed is [2-4x faster](#speed) than competitors.\n- __Portable__: Models come with native ONNX support, making them easy to deploy on any platform.\n- __Quantization Aware__: Down-cast embeddings from `f32` to `i8` without losing much recall.\n- __Multilingual__: Trained on a balanced dataset, the recall is great across over 20 languages.\n\n[usearch]: https://github.com/unum-cloud/usearch\n[matryoshka]: https://arxiv.org/abs/2205.13147\n\n## Models\n\nFor accuracy and speed benchmarks refer to the [evaluation page](https://github.com/unum-cloud/uform/blob/main/BENCHMARKS.md).\n\n### Embedding Models\n\n\u003ctable style=\"width:100%; border-collapse:collapse;\"\u003e\n    \u003cthead\u003e\n        \u003ctr\u003e\n            \u003cth\u003eModel\u003c/th\u003e\n            \u003cth style=\"text-align:right;\"\u003eParameters\u003c/th\u003e\n            \u003cth style=\"text-align:right;\"\u003eLanguages\u003c/th\u003e\n            \u003cth style=\"text-align:right;\"\u003eArchitecture\u003c/th\u003e\n        \u003c/tr\u003e\n    \u003c/thead\u003e\n    \u003ctbody\u003e\n        \u003ctr\u003e\n            \u003ctd\u003e\u003ccode\u003e\u003ca href=\"https://huggingface.co/unum-cloud/uform-vl-english-large/\"\u003euform3-image-text-english-large\u003c/a\u003e\u003c/code\u003e  🆕\u003c/td\u003e\n            \u003ctd style=\"text-align:right;\"\u003e365 M\u003c/td\u003e\n            \u003ctd style=\"text-align:right;\"\u003e1\u003c/td\u003e\n            \u003ctd style=\"text-align:right;\"\u003e12 layer BERT, ViT-L/14\u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003e\u003ccode\u003e\u003ca href=\"https://huggingface.co/unum-cloud/uform-vl-english/\"\u003euform3-image-text-english-base\u003c/a\u003e\u003c/code\u003e\u003c/td\u003e\n            \u003ctd style=\"text-align:right;\"\u003e143 M\u003c/td\u003e\n            \u003ctd style=\"text-align:right;\"\u003e1\u003c/td\u003e\n            \u003ctd style=\"text-align:right;\"\u003e4 layer BERT, ViT-B/16\u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003e\u003ccode\u003e\u003ca href=\"https://huggingface.co/unum-cloud/uform-vl-english-small/\"\u003euform3-image-text-english-small\u003c/a\u003e\u003c/code\u003e  🆕\u003c/td\u003e\n            \u003ctd style=\"text-align:right;\"\u003e79 M\u003c/td\u003e\n            \u003ctd style=\"text-align:right;\"\u003e1\u003c/td\u003e\n            \u003ctd style=\"text-align:right;\"\u003e4 layer BERT, ViT-S/16\u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003e\u003ccode\u003e\u003ca href=\"https://huggingface.co/unum-cloud/uform-vl-multilingual-v2/\"\u003euform3-image-text-multilingual-base\u003c/a\u003e\u003c/code\u003e\u003c/td\u003e\n            \u003ctd style=\"text-align:right;\"\u003e206M\u003c/td\u003e\n            \u003ctd style=\"text-align:right;\"\u003e21\u003c/td\u003e\n            \u003ctd style=\"text-align:right;\"\u003e12 layer BERT, ViT-B/16\u003c/td\u003e\n        \u003c/tr\u003e\n    \u003c/tbody\u003e\n\u003c/table\u003e\n\n### Generative Models\n\n\u003ctable style=\"width:100%; border-collapse:collapse;\"\u003e\n    \u003cthead\u003e\n        \u003ctr\u003e\n            \u003cth\u003eModel\u003c/th\u003e\n            \u003cth style=\"text-align:right;\"\u003eParameters\u003c/th\u003e\n            \u003cth style=\"text-align:right;\"\u003ePurpose\u003c/th\u003e\n            \u003cth style=\"text-align:right;\"\u003eArchitecture\u003c/th\u003e\n        \u003c/tr\u003e\n    \u003c/thead\u003e\n    \u003ctbody\u003e\n        \u003ctr\u003e\n            \u003ctd\u003e\u003ccode\u003e\u003ca href=\"https://huggingface.co/unum-cloud/uform-gen2-dpo/\"\u003euform-gen2-dpo\u003c/a\u003e\u003c/code\u003e  🆕\u003c/td\u003e\n            \u003ctd style=\"text-align:right;\"\u003e1.2 B\u003c/td\u003e\n            \u003ctd style=\"text-align:right;\"\u003eChat, Image Captioning, VQA\u003c/td\u003e\n            \u003ctd style=\"text-align:right;\"\u003eqwen1.5-0.5B, ViT-H/14\u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003e\u003ccode\u003e\u003ca href=\"https://huggingface.co/unum-cloud/uform-gen2-qwen-500m/\"\u003euform-gen2-qwen-500m\u003c/a\u003e\u003c/code\u003e\u003c/td\u003e\n            \u003ctd style=\"text-align:right;\"\u003e1.2 B\u003c/td\u003e\n            \u003ctd style=\"text-align:right;\"\u003eChat, Image Captioning, VQA\u003c/td\u003e\n            \u003ctd style=\"text-align:right;\"\u003eqwen1.5-0.5B, ViT-H/14\u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003e\u003ccode\u003e\u003ca href=\"https://huggingface.co/unum-cloud/uform-gen/\"\u003euform-gen\u003c/a\u003e\u003c/code\u003e ⚠️\u003c/td\u003e\n            \u003ctd style=\"text-align:right;\"\u003e1.5 B\u003c/td\u003e\n            \u003ctd style=\"text-align:right;\"\u003eImage Captioning, VQA\u003c/td\u003e\n            \u003ctd style=\"text-align:right;\"\u003ellama-1.3B, ViT-B/16\u003c/td\u003e\n        \u003c/tr\u003e\n    \u003c/tbody\u003e\n\u003c/table\u003e\n\n## Quick Start Examples\n\n### Embedding Models\n\nFirst, `pip install uform`.\nThen, load the model:\n\n```py\nfrom uform import get_model, Modality\n\nprocessors, models = get_model('unum-cloud/uform3-image-text-english-small')\n\nmodel_text = models[Modality.TEXT_ENCODER]\nmodel_image = models[Modality.IMAGE_ENCODER]\nprocessor_text = processors[Modality.TEXT_ENCODER]\nprocessor_image = processors[Modality.IMAGE_ENCODER]\n```\n\nEmbed images:\n\n```py\nimport requests\nfrom io import BytesIO\nfrom PIL import Image\n\nimage_url = 'https://media-cdn.tripadvisor.com/media/photo-s/1b/28/6b/53/lovely-armenia.jpg'\nimage = Image.open(BytesIO(requests.get(image_url).content))\nimage_data = processor_image(image)\nimage_features, image_embedding = model_image.encode(image_data, return_features=True)\n```\n\nEmbed queries:\n\n```py\ntext = 'a cityscape bathed in the warm glow of the sun, with varied architecture and a towering, snow-capped mountain rising majestically in the background'\ntext_data = processor_text(text)\ntext_features, text_embedding = model_text.encode(text_data, return_features=True)\n```\n\nFor more details check out:\n\n- Python docs on embedding models in [python/README.md](https://github.com/unum-cloud/uform/blob/main/python/README.md#embedding-models)\n- JavaScript docs on embedding models in [javascript/README.md](https://github.com/unum-cloud/uform/blob/main/javascript/README.md#embedding-models)\n- Swift docs on embedding models in [swift/README.md](https://github.com/unum-cloud/uform/blob/main/swift/README.md#embedding-models)\n\n### Generative Models\n\nThe generative models are natively compatible with \n\n```python\nfrom transformers import AutoModel, AutoProcessor\n\nmodel = AutoModel.from_pretrained('unum-cloud/uform-gen2-dpo', trust_remote_code=True)\nprocessor = AutoProcessor.from_pretrained('unum-cloud/uform-gen2-dpo', trust_remote_code=True)\n\nprompt = 'Question or Instruction'\nimage = Image.open('image.jpg')\n\ninputs = processor(text=[prompt], images=[image], return_tensors='pt')\n\nwith torch.inference_mode():\n     output = model.generate(\n        **inputs,\n        do_sample=False,\n        use_cache=True,\n        max_new_tokens=256,\n        eos_token_id=151645,\n        pad_token_id=processor.tokenizer.pad_token_id\n    )\nprompt_len = inputs['input_ids'].shape[1]\ndecoded_text = processor.batch_decode(output[:, prompt_len:])[0]\n```\n\nFor more details check out:\n\n- Python docs on generative models in [python/README.md](https://github.com/unum-cloud/uform/blob/main/python/README.md#generative-models)\n- JavaScript docs on generative models 🔜\n- Swift docs on generative models 🔜\n\n## Technical Details\n\n### Down-casting, Quantization, Matryoshka, and Slicing\n\nDepending on the application, the embeddings can be down-casted to smaller numeric representations without losing much recall.\nSwitching from `f32` to `f16` is recommended in almost all cases, unless you are running on very old hardware without half-precision support.\nSwitching to `i8` with linear scaling is also possible, but will be noticeable in the recall on larger collections with millions of searchable entries.\nSimilarly, for higher-dimensional embeddings (512 or 768), a common strategy is to quantize them into single-bit representations for faster search.\n\n```python\nimport numpy as np\n\nf32_embedding: np.ndarray = model.encode_text(text_data, return_features=False)\nf16_embedding: np.ndarray = f32_embedding.astype(np.float16)\ni8_embedding: np.ndarray = (f32_embedding * 127).astype(np.int8)\nb1_embedding: np.ndarray = np.packbits((f32_embedding \u003e 0).astype(np.uint8))\n```\n\nAlternative approach to quantization is to use the Matryoshka embeddings, where the embeddings are sliced into smaller parts, and the search is performed in a hierarchical manner.\n\n```python\nimport numpy as np\n\nlarge_embedding: np.ndarray = model.encode_text(text_data, return_features=False)\nsmall_embedding: np.ndarray = large_embedding[:, :256]\ntiny_embedding: np.ndarray = large_embedding[:, :64]\n```\n\nBoth approaches are natively supported by the [USearch][github-usearch] vector-search engine and the [SimSIMD][github-simsimd] numerics libraries.\nWhen dealing with small collections (up to millions of entries) and looking for low-latency cosine distance calculations, you can [achieve 5x-2500x performance improvement][report-simsimd] over Torch, NumPy, SciPy, and vanilla Python using SimSIMD.\n\n```python\nfrom simsimd import cosine, hamming\n\ndistance: float = cosine(f32_embedding, f32_embedding) # 32x SciPy performance on Apple M2 CPU\ndistance: float = cosine(f16_embedding, f16_embedding) # 79x SciPy performance on Apple M2 CPU\ndistance: float = cosine(i8_embedding, i8_embedding) # 133x SciPy performance on Apple M2 CPU\ndistance: float = hamming(b1_embedding, b1_embedding) # 17x SciPy performance on Apple M2 CPU\n```\n\nSimilarly, when dealing with large collections (up to billions of entries per server) and looking for high-throughput search, you can [achieve 100x performance improvement][report-usearch] over FAISS and other vector-search solutions using USearch.\nHere are a couple of examples:\n\n```python\nfrom usearch.index import Index\n\nf32_index = Index(ndim=64, metric='cos', dtype='f32') # for Matryoshka embeddings\nf16_index = Index(ndim=64, metric='cos', dtype='f16') # for Matryoshka embeddings\ni8_index = Index(ndim=256, metric='cos', dtype='i8') # for quantized embeddings\nb1_index = Index(ndim=768, metric='hamming', dtype='b1') # for binary embeddings\n```\n\n[github-usearch]: https://github.com/unum-cloud/usearch\n[github-simsimd]: https://github.com/ashvardanian/simsimd\n[report-usearch]: https://www.unum.cloud/blog/2023-11-07-scaling-vector-search-with-intel\n[report-simsimd]: https://ashvardanian.com/posts/python-c-assembly-comparison/\n\n### Compact Packaging\n\nPyTorch is a heavy dependency to carry, especially if you run on Edge or IoT devices.\nUsing vanilla ONNX runtime, one can significantly reduce memory consumption and deployment latency.\n\n```sh\n$ conda create -n uform_torch python=3.10 -y\n$ conda create -n uform_onnx python=3.10 -y\n$ conda activate uform_torch \u0026\u0026 pip install -e \".[torch]\" \u0026\u0026 conda deactivate\n$ conda activate uform_onnx \u0026\u0026 pip install -e \".[onnx]\" \u0026\u0026 conda deactivate\n$ du -sh $(conda info --envs | grep 'uform_torch' | awk '{print $2}')\n\u003e 5.2G    ~/conda/envs/uform_torch\n$ du -sh $(conda info --envs | grep 'uform_onnx' | awk '{print $2}')\n\u003e 461M    ~/conda/envs/uform_onnx\n```\n\nMost of that weight can be further reduced down to 100 MB for both the model and the runtime.\nYou can pick one of many supported [ONNX execution providers][onnx-providers], which includes XNNPACK, CUDA and TensorRT for Nvidia GPUs, OpenVINO on Intel, DirectML on Windows, ROCm on AMD, CoreML on Apple devices, and more to come.\n\n[onnx-providers]: https://onnxruntime.ai/docs/execution-providers/\n\n### Multimodal Chat in CLI\n\nThe generative models can be used for chat-like experiences in the command line.\nFor that, you can use the `uform-chat` CLI tool, which is available in the UForm package.\n\n```bash\n$ pip install uform\n$ uform-chat --model unum-cloud/uform-gen2-dpo --image=zebra.jpg\n$ uform-chat --model unum-cloud/uform-gen2-dpo \\\n\u003e     --image=\"https://bit.ly/3tIVg9M\" \\\n\u003e     --device=\"cuda:0\" \\\n\u003e     --fp16\n```\n","funding_links":[],"categories":["多模态大模型","Python","CLIs","pytorch"],"sub_categories":["网络服务_其他"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funum-cloud%2Fuform","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Funum-cloud%2Fuform","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funum-cloud%2Fuform/lists"}