https://github.com/microsoft/unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://github.com/microsoft/unilm

beit beit-3 bitnet deepnet document-ai foundation-models kosmos kosmos-1 layoutlm layoutxlm llm minilm mllm multimodal nlp pre-trained-model textdiffuser trocr unilm xlm-e

Last synced: about 1 year ago
JSON representation

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Host: GitHub
URL: https://github.com/microsoft/unilm
Owner: microsoft
License: mit
Created: 2019-07-23T04:15:28.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2025-03-04T00:17:39.000Z (over 1 year ago)
Last Synced: 2025-05-05T22:42:46.091Z (about 1 year ago)
Topics: beit, beit-3, bitnet, deepnet, document-ai, foundation-models, kosmos, kosmos-1, layoutlm, layoutxlm, llm, minilm, mllm, multimodal, nlp, pre-trained-model, textdiffuser, trocr, unilm, xlm-e
Language: Python
Homepage: https://aka.ms/GeneralAI
Size: 66.4 MB
Stars: 21,188
Watchers: 305
Forks: 2,620
Open Issues: 660
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md

Awesome Lists containing this project

awesome-lists - uniml
awesome-Multi-Document-Summarization - microsoft/unilm
awesome-vision-language-pretraining - unilm
awesome-ChatGPT-repositories - unilm - Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities (NLP)
awesome-self-supervised-vision - unilm
awesome-machine-learning-resources - **[Library
StarryDivineSky - microsoft/unilm - NLP及更高版本的统一语言模型预训练 (预训练模型)
awesome-multimodal-ai - Kosmos-2
AiTreasureBox - microsoft/unilm - 11-03_21802_0](https://img.shields.io/github/stars/microsoft/unilm.svg)|Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities| (Repos)
awesome-azure-openai-copilot - UniLM (Foundation Models) - Collection of large-scale pretrained architectures for multimodal and language tasks. (Model Training & Inference)
awesome - microsoft/unilm - Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities (Python)
Awesome-MIM - unilm - scale Self-supervised Pre-training Across Tasks, Languages, and Modalities. (Related Project / Project of Self-supervised Learning)
awesome-github-projects - unilm - Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities ⭐22,157 `Python` (🤖 AI & Machine Learning)
awesome-opensource-ai - E5 (Microsoft) - High-performance text embeddings for retrieval. ![GitHub stars](https://img.shields.io/github/stars/microsoft/unilm?style=social) (5. Retrieval-Augmented Generation (RAG) & Knowledge)
Awesome-Multimodal-Research - 08/2022 - 3](https://arxiv.org/abs/2208.10442) is a general-purpose multimodal foundation model, which achieves state-of-the-art transfer performance on both vision and vision-language tasks. https://github.com/microsoft/unilm/tree/master/beit* (News)
awesome-open-source-ai-tools - microsoft/unilm - Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities (Research & Data Analysis)

README

          
## [aka.ms/GeneralAI](https://aka.ms/GeneralAI)

# Hiring

We are hiring at all levels (including FTE researchers and interns)! If you are interested in working with us on Foundation Models (aka large-scale pre-trained models) and General AI, NLP, MT, Speech, Document AI and Multimodal AI, please send your resume to fuwei@microsoft.com.

# Foundation Architecture

### TorchScale - A Library of Foundation Architectures ([repo](https://github.com/microsoft/torchscale))

Fundamental research to develop new architectures for foundation models and AI, focusing on modeling generality and capability, as well as training stability and efficiency.

> Stability - [**DeepNet**](https://github.com/microsoft/unilm/tree/master/deepnet): scaling Transformers to 1,000 Layers and beyond

> Generality - [**Foundation Transformers (Magneto)**](https://arxiv.org/abs/2210.06423): towards true general-purpose modeling across tasks and modalities (including language, vision, speech, and multimodal)

> Capability - A [**Length-Extrapolatable**](https://arxiv.org/abs/2212.10554) Transformer

> Efficiency & Transferability - [**X-MoE**](https://github.com/microsoft/unilm/tree/master/xmoe): scalable & finetunable sparse Mixture-of-Experts (MoE)

### The Revolution of Model Architecture

> [**BitNet**](https://arxiv.org/abs/2310.11453): 1-bit Transformers for Large Language Models

> [**RetNet**](https://arxiv.org/abs/2307.08621): Retentive Network: A Successor to Transformer for Large Language Models

> [**LongNet**](https://arxiv.org/abs/2307.02486): Scaling Transformers to 1,000,000,000 Tokens

# Foundation Models

### The Evolution of (M)LLM (Multimodal LLM)

> [**Kosmos-2.5**](https://github.com/microsoft/unilm/tree/master/kosmos-2.5): **A Multimodal Literate Model**

> [**Kosmos-2**](https://github.com/microsoft/unilm/tree/master/kosmos-2): **Grounding Multimodal Large Language Models to the World**

> [**Kosmos-1**](https://arxiv.org/abs/2302.14045): **A Multimodal Large Language Model (MLLM)**

> [**MetaLM**](https://github.com/microsoft/unilm/tree/master/metalm): **Language Models are General-Purpose Interfaces**

**The Big Convergence** - Large-scale self-supervised pre-training across ```tasks``` (predictive and generative), ```languages``` (100+ languages), and ```modalities``` (language, image, audio, layout/format + language, vision + language, audio + language, etc.)

### Language & Multilingual

> [**UniLM**](https://github.com/microsoft/unilm/tree/master/unilm): unified pre-training for language understanding and generation

> [**InfoXLM/XLM-E**](https://github.com/microsoft/unilm/tree/master/infoxlm): multilingual/cross-lingual pre-trained models for 100+ languages

> [**DeltaLM/mT6**](https://github.com/microsoft/unilm/tree/master/deltalm): encoder-decoder pre-training for language generation and translation for 100+ languages

> [**MiniLM**](https://github.com/microsoft/unilm/tree/master/minilm): small and fast pre-trained models for language understanding and generation

> [**AdaLM**](https://github.com/microsoft/unilm/tree/master/adalm): domain, language, and task adaptation of pre-trained models

> [**EdgeLM**](https://github.com/microsoft/unilm/tree/master/edgelm)(```NEW```): small pre-trained models on edge/client devices

> [**SimLM**](https://github.com/microsoft/unilm/tree/master/simlm) (```NEW```): large-scale pre-training for similarity matching

> [**E5**](https://github.com/microsoft/unilm/tree/master/e5) (```NEW```): text embeddings

> [**MiniLLM**](https://arxiv.org/abs/2306.08543) (```NEW```): Knowledge Distillation of Large Language Models

### Vision

> [**BEiT**](https://github.com/microsoft/unilm/tree/master/beit)/[**BEiT-2**](https://github.com/microsoft/unilm/tree/master/beit2): generative self-supervised pre-training for vision / BERT Pre-Training of Image Transformers

> [**DiT**](https://github.com/microsoft/unilm/tree/master/dit): self-supervised pre-training for Document Image Transformers

> [**TextDiffuser**](https://github.com/microsoft/unilm/tree/master/textdiffuser)/[**TextDiffuser-2**](https://github.com/microsoft/unilm/tree/master/textdiffuser-2) (```NEW```): Diffusion Models as Text Painters

### Speech

> [**WavLM**](https://github.com/microsoft/unilm/tree/master/wavlm): speech pre-training for full stack tasks

> [**VALL-E**](https://github.com/microsoft/unilm/tree/master/valle): a neural codec language model for TTS

### Multimodal (X + Language)

> [**LayoutLM**](https://github.com/microsoft/unilm/tree/master/layoutlm)/[**LayoutLMv2**](https://github.com/microsoft/unilm/tree/master/layoutlmv2)/[**LayoutLMv3**](https://github.com/microsoft/unilm/tree/master/layoutlmv3): multimodal (text + layout/format + image) **Document Foundation Model** for [Document AI](https://www.microsoft.com/en-us/research/project/document-ai/) (e.g. scanned documents, PDF, etc.)

> [**LayoutXLM**](https://github.com/microsoft/unilm/tree/master/layoutxlm): multimodal (text + layout/format + image) **Document Foundation Model** for multilingual Document AI

> [**MarkupLM**](https://github.com/microsoft/unilm/tree/master/markuplm): markup language model pre-training for visually-rich document understanding

> [**XDoc**](https://github.com/microsoft/unilm/tree/master/xdoc): unified pre-training for cross-format document understanding

> [**UniSpeech**](https://arxiv.org/abs/2101.07597): unified pre-training for self-supervised learning and supervised learning for ASR

> [**UniSpeech-SAT**](https://arxiv.org/pdf/2110.05752.pdf): universal speech representation learning with speaker-aware pre-training

> [**SpeechT5**](https://arxiv.org/abs/2110.07205): encoder-decoder pre-training for spoken language processing

> [**SpeechLM**](https://arxiv.org/abs/2209.15329): Enhanced Speech Pre-Training with Unpaired Textual Data

> [**VLMo**](https://github.com/microsoft/unilm/tree/master/vlmo): Unified vision-language pre-training 

> [**VL-BEiT**](https://github.com/microsoft/unilm/tree/master/vl-beit) (```NEW```): Generative Vision-Language Pre-training - evolution of **BEiT** to multimodal

> [**BEiT-3**](https://github.com/microsoft/unilm/tree/master/beit3) (```NEW```): a general-purpose multimodal foundation model, and a major milestone of **The Big Convergence** of Large-scale Pre-training Across Tasks, Languages, and Modalities.

### Toolkits

> [**s2s-ft**](https://github.com/microsoft/unilm/tree/master/s2s-ft): sequence-to-sequence fine-tuning toolkit

> [**Aggressive Decoding**](https://arxiv.org/pdf/2205.10350.pdf) (```NEW```): lossless and efficient sequence-to-sequence decoding algorithm

### Applications

> [**TrOCR**](https://github.com/microsoft/unilm/tree/master/trocr): transformer-based OCR w/ pre-trained models

 

> [**LayoutReader**](https://github.com/microsoft/unilm/tree/master/layoutreader): pre-training of text and layout for reading order detection

> [**XLM-T**](https://github.com/microsoft/unilm/tree/master/xlmt): multilingual NMT w/ pretrained cross-lingual encoders

## Links

### LLMOps ([repo](https://github.com/microsoft/lmops))

General technology for enabling AI capabilities w/ LLMs and MLLMs.

### RedStone ([repo](https://github.com/microsoft/redstone))

Curating General, Code, Math, and QA Data for Large Language Models.

## News

- December, 2024: [**RedStone**](https://github.com/microsoft/redstone) was released!

- December, 2023: [**LongNet**](https://github.com/microsoft/unilm/tree/master/longnet) and [**LongViT**](https://github.com/microsoft/unilm/tree/master/longvit) released

- [Model Release] Dec, 2023: [**TextDiffuser-2**](https://github.com/microsoft/unilm/tree/master/textdiffuser-2) models, code and [demo](https://huggingface.co/spaces/JingyeChen22/TextDiffuser-2). 

- Sep, 2023: [**Kosmos-2.5**](https://arxiv.org/abs/2309.11419) - a multimodal literate model for machine reading of text-intensive images.

- [Model Release] May, 2023: [**TextDiffuser**](https://github.com/microsoft/unilm/tree/master/textdiffuser) models and code.

- [Model Release] March, 2023: [**BEiT-3**](https://github.com/microsoft/unilm/tree/master/beit3) pretrained models and code.

- March, 2023: [**Kosmos-1**](https://arxiv.org/abs/2302.14045) - a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot).

- January, 2023: [**VALL-E**](https://arxiv.org/abs/2301.02111) a language modeling approach for text to speech synthesis (TTS), which achieves state-of-the-art zero-shot TTS performance. See https://aka.ms/valle for demos of our work.

- [Model Release] January, 2023: [**E5**](https://github.com/microsoft/unilm/tree/master/e5) - Text Embeddings by Weakly-Supervised Contrastive Pre-training.

- November, 2022: [**TorchScale 0.1.1**](https://github.com/microsoft/torchscale) was released!

- November, 2022: [**TrOCR**](https://arxiv.org/abs/2109.10282) was accepted by AAAI 2023.

- [Model Release] November, 2022: [**XDoc**](https://github.com/microsoft/unilm/tree/master/xdoc) **BASE** models for cross-format document understanding.

- [Model Release] September, 2022: [**TrOCR**](https://github.com/microsoft/unilm/tree/master/trocr) **BASE** and **LARGE** models for Scene Text Recognition (STR).

- [Model Release] September, 2022: [**BEiT v2**](https://github.com/microsoft/unilm/tree/master/beit2) code and pretrained models.

- August, 2022: [**BEiT-3**](https://arxiv.org/abs/2208.10442) - a general-purpose multimodal foundation model, which achieves state-of-the-art transfer performance on both vision and vision-language tasks

- July, 2022: [**SimLM**](https://github.com/microsoft/unilm/tree/master/simlm) - Large-scale self-supervised pre-training for similarity matching

- June, 2022: [**DiT**](https://arxiv.org/abs/2203.02378) and [**LayoutLMv3**](https://arxiv.org/abs/2204.08387) were accepted by ACM Multimedia 2022.

- June, 2022: [**MetaLM**](https://github.com/microsoft/unilm/tree/master/metalm) - Language models are general-purpose interfaces to foundation models (language/multilingual, vision, speech, and multimodal)

- June, 2022: [**VL-BEiT**](https://github.com/microsoft/unilm/tree/master/vl-beit) - bidirectional multimodal Transformer learned from scratch with one unified pretraining task, one shared backbone, and one-stage training, supporting both vision and vision-language tasks.

- [Model Release] June, 2022: [**LayoutLMv3 Chinese**](https://github.com/microsoft/unilm/tree/master/layoutlmv3#form-understanding-on-xfund) - Chinese version of LayoutLMv3

- [Code Release] May, 2022: [**Aggressive Decoding**](https://github.com/microsoft/unilm/tree/master/decoding) - Lossless Speedup for Seq2seq Generation

- April, 2022: **Transformers at Scale** = [DeepNet](https://arxiv.org/abs/2203.00555) + [X-MoE](https://arxiv.org/abs/2204.09179)

- [Model Release] April, 2022: [**LayoutLMv3**](https://github.com/microsoft/unilm/tree/master/layoutlmv3) - Pre-training for Document AI with Unified Text and Image Masking

- [Model Release] March, 2022: [**EdgeFormer**](https://github.com/microsoft/unilm/tree/master/edgelm) - Parameter-efficient Transformer for On-device Seq2seq Generation

- [Model Release] March, 2022: [**DiT**](https://github.com/microsoft/unilm/tree/master/dit) - Self-supervised Document Image Transformer. Demos: [Document Layout Analysis](https://huggingface.co/spaces/nielsr/dit-document-layout-analysis), [Document Image Classification](https://huggingface.co/spaces/microsoft/document-image-transformer)

- January, 2022: [**BEiT**](https://openreview.net/forum?id=p-BhZSz59o4) was accepted by **ICLR 2022 as Oral presentation** (54 out of 3391).

- [Model Release] December 16th, 2021: [**TrOCR**](https://github.com/microsoft/unilm/tree/master/trocr) **small** models for handwritten and printed texts, with 3x inference speedup.

- November 24th, 2021: [**VLMo**](https://github.com/microsoft/unilm/tree/master/vlmo) as the new SOTA on the [VQA Challenge](https://eval.ai/web/challenges/challenge-page/830/leaderboard/2278)

- November, 2021: [Multilingual translation at scale: 10000 language pairs and beyond](https://www.microsoft.com/en-us/translator/blog/2021/11/22/multilingual-translation-at-scale-10000-language-pairs-and-beyond/)

- [Model Release] November, 2021: [**MarkupLM**](https://github.com/microsoft/unilm/tree/master/markuplm) - Pre-training for text and markup language (e.g. HTML/XML)

- [Model Release] November, 2021: [**VLMo**](https://github.com/microsoft/unilm/tree/master/vlmo) - Unified vision-language pre-training w/ [**BEiT**](https://github.com/microsoft/unilm/tree/master/beit)

- October, 2021: [**WavLM**](https://github.com/microsoft/unilm/tree/master/wavlm) Large achieves state-of-the-art performance on the [SUPERB](https://superbbenchmark.org/leaderboard) benchmark

- [Model Release] October, 2021: [**WavLM**](https://github.com/microsoft/unilm/tree/master/wavlm) - Large-scale self-supervised pre-trained models for speech. 

- [Model Release] October 2021: [**TrOCR**](https://huggingface.co/transformers/master/model_doc/trocr.html) is on [HuggingFace](https://github.com/huggingface/transformers)

- September 28th, 2021: T-ULRv5 (aka XLM-E/InfoXLM) as the SOTA on the XTREME leaderboard. // Blog

- [Model Release] September, 2021: [**LayoutLM-cased**](https://huggingface.co/microsoft/layoutlm-base-cased) are on [HuggingFace](https://github.com/huggingface/transformers)

- [Model Release] September, 2021: [**TrOCR**](https://github.com/microsoft/unilm/tree/master/trocr) - Transformer-based OCR w/ pre-trained [**BEiT**](https://github.com/microsoft/unilm/tree/master/beit) and RoBERTa models.

- August 2021: [**LayoutLMv2**](https://huggingface.co/transformers/master/model_doc/layoutlmv2.html) and [**LayoutXLM**](https://huggingface.co/transformers/master/model_doc/layoutxlm.html) are on [HuggingFace](https://github.com/huggingface/transformers)

- [Model Release] August, 2021: [**LayoutReader**](https://github.com/microsoft/unilm/tree/master/layoutreader) - Built with LayoutLM to improve general reading order detection.

- [Model Release] August, 2021: [**DeltaLM**](https://github.com/microsoft/unilm/tree/master/deltalm) - Encoder-decoder pre-training for language generation and translation.

- August 2021: [**BEiT**](https://huggingface.co/transformers/master/model_doc/beit.html) is on [HuggingFace](https://github.com/huggingface/transformers)

- [Model Release] July, 2021: [**BEiT**](https://github.com/microsoft/unilm/tree/master/beit) - Towards BERT moment for CV

- [Model Release] June, 2021: [**LayoutLMv2**](https://github.com/microsoft/unilm/tree/master/layoutlmv2), [**LayoutXLM**](https://github.com/microsoft/unilm/tree/master/layoutxlm), [**MiniLMv2**](https://github.com/microsoft/unilm/tree/master/minilm), and [**AdaLM**](https://github.com/microsoft/unilm/tree/master/adalm).

- May, 2021: [LayoutLMv2](https://github.com/microsoft/unilm/tree/master/layoutlmv2), InfoXLMv2, MiniLMv2, UniLMv3, and AdaLM were accepted by ACL 2021.

- April, 2021: [LayoutXLM](https://github.com/microsoft/unilm/tree/master/layoutxlm) is coming by extending the LayoutLM into multilingual support! A multilingual form understanding benchmark [XFUND](https://github.com/doc-analysis/XFUND) is also introduced, which includes forms with human labeled key-value pairs in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese).

- March, 2021: [InfoXLM](https://github.com/microsoft/unilm/tree/master/infoxlm) was accepted by NAACL 2021.

- December 29th, 2020: [LayoutLMv2](https://arxiv.org/abs/2012.14740) is coming with the new SOTA on a wide variety of document AI tasks, including [DocVQA](https://rrc.cvc.uab.es/?ch=17&com=evaluation&task=1) and [SROIE](https://rrc.cvc.uab.es/?ch=13&com=evaluation&task=3) leaderboard.

- October 8th, 2020: T-ULRv2 (aka [InfoXLM](https://arxiv.org/abs/2007.07834)) as the SOTA on the [XTREME](https://sites.research.google/xtreme) leaderboard. // [Blog](https://www.microsoft.com/en-us/research/blog/microsoft-turing-universal-language-representation-model-t-ulrv2-tops-xtreme-leaderboard/)

- September, 2020: [MiniLM](https://github.com/microsoft/unilm/tree/master/minilm) was accepted by NeurIPS 2020.

- July 16, 2020: [**InfoXLM** (Multilingual UniLM)](https://github.com/microsoft/unilm/tree/master/infoxlm) [arXiv](https://arxiv.org/pdf/2007.07834.pdf)

- June, 2020: [UniLMv2](https://github.com/microsoft/unilm/tree/master/unilm) was accepted by ICML 2020; [LayoutLM](https://github.com/microsoft/unilm/tree/master/layoutlm) was accepted by KDD 2020.

- April 5, 2020: [**Multilingual MiniLM**](https://github.com/microsoft/unilm/tree/master/minilm) released!

- September, 2019: [UniLMv1](https://github.com/microsoft/unilm/tree/master/unilm-v1) was accepted by NeurIPS 2019.

## License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Portions of the source code are based on the [transformers](https://github.com/huggingface/transformers) project.

[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)

### Contact Information

For help or issues using the pre-trained models, please submit a GitHub issue.

For other communications, please contact [Furu Wei](https://thegenerality.com) (`fuwei@microsoft.com`).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/microsoft/unilm

Awesome Lists containing this project

README