https://github.com/zai-org/GLM-V

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
https://github.com/zai-org/GLM-V

image2text reasoning video-understanding vlm

Last synced: 3 months ago
JSON representation

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Host: GitHub
URL: https://github.com/zai-org/GLM-V
Owner: zai-org
License: apache-2.0
Created: 2025-06-28T08:44:06.000Z (5 months ago)
Default Branch: main
Last Pushed: 2025-08-27T09:25:32.000Z (3 months ago)
Last Synced: 2025-08-27T18:48:00.039Z (3 months ago)
Topics: image2text, reasoning, video-understanding, vlm
Language: Python
Homepage:
Size: 28 MB
Stars: 1,537
Watchers: 13
Forks: 78
Open Issues: 10
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

StarryDivineSky - zai-org/GLM-V - org/GLM-V项目是基于GLM-4.5V和GLM-4.1V-Thinking模型开发的多模态推理系统，旨在通过可扩展的强化学习技术提升模型对视觉和语言信息的综合处理能力。该项目的核心特色在于结合了视觉模型与语言模型的双向交互机制，使系统能够同时解析图像内容并进行逻辑推理，例如在复杂场景中识别物体关系或生成基于图像的文本描述。其工作原理基于强化学习框架，通过大量多模态数据训练模型在视觉感知与语言生成之间建立动态反馈回路，优化模型对跨模态信息的关联理解能力，尤其在需要多步骤推理的任务中表现突出。项目团队通过设计可扩展的训练架构，允许用户根据具体需求调整强化学习的参数和奖励机制，从而适应不同应用场景的复杂度要求。此外，系统支持对视觉元素（如颜色、形状、位置）与文本语义的联合建模，能够处理包括图像描述生成、视觉问答、跨模态检索等任务。项目文档中特别强调了其在处理复杂视觉-语言关联任务时的稳定性与推理深度，例如在需要结合上下文信息进行多步骤推导的场景中，模型表现优于传统单模态系统。该系统已通过公开数据集验证其在多模态推理任务中的有效性，并提供预训练模型和可定制的强化学习模块，方便开发者根据实际需求进行扩展与优化。 (多模态大模型 / 资源传输下载)
ai-game-devtools - GLM-V - 4.1V-Thinking and GLM-4.5V: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning. |[arXiv](https://arxiv.org/abs/2507.01006) | | VLM | (<span id="visual">VLM (Visual)</span> / <span id="tool">LLM (LLM & Tool)</span>)

README

# GLM-V

[中文阅读.](./README_zh.md)

👋 Join our WeChat and Discord communities.

📖 Check out the paper.

📍 Try online or use the API.

## Introduction

Vision-language models (VLMs) have become a key cornerstone of intelligent systems. As real-world AI tasks grow increasingly complex, VLMs urgently need to enhance reasoning capabilities beyond basic multimodal perception — improving accuracy, comprehensiveness, and intelligence — to enable complex problem solving, long-context understanding, and multimodal agents.

Through our open-source work, we aim to explore the technological frontier together with the community while empowering more developers to create exciting and innovative applications.

**This open-source repository contains our `GLM-4.5V` and `GLM-4.1V` series models.** For performance and details, see [Model Overview](#model-overview). For known issues, see [Fixed and Remaining Issues](#fixed-and-remaining-issues).

## Project Updates

- 🔥 **News**: `2025/08/11`: We released **GLM-4.5V** with significant improvements across multiple benchmarks. We also open-sourced our handcrafted **desktop assistant app** for debugging. Once connected to GLM-4.5V, it can capture visual information from your PC screen via screenshots or screen recordings. Feel free to try it out or customize it into your own multimodal assistant. Click [here](https://huggingface.co/spaces/zai-org/GLM-4.5V-Demo-App) to download the installer or [build from source](examples/vllm-chat-helper/README.md)!
- **News**: `2025/07/16`: We have open-sourced the **VLM Reward System** used to train GLM-4.1V-Thinking.View the [code repository](glmv_reward) and run locally: `python examples/reward_system_demo.py`.
- **News**: `2025/07/01`: We released **GLM-4.1V-9B-Thinking** and its [technical report](https://arxiv.org/abs/2507.01006).

## Model Implementation Code

- GLM-4.5V model algorithm: see the full implementation in [transformers](https://github.com/huggingface/transformers/tree/main/src/transformers/models/glm4v_moe).
- GLM-4.1V-9B-Thinking model algorithm: see the full implementation in [transformers](https://github.com/huggingface/transformers/tree/main/src/transformers/models/glm4v).
- Both models share identical multimodal preprocessing, but use different conversation templates — please distinguish carefully.

## Model Downloads

| Model | Download Links | Type |
|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|
| GLM-4.5V | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5V)
[🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5V) | Hybrid Reasoning |
| GLM-4.5V-FP8 | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5V-FP8)
[🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5V-FP8) | Hybrid Reasoning |
| GLM-4.1V-9B-Thinking | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.1V-9B-Thinking)
[🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.1V-9B-Thinking) | Reasoning |
| GLM-4.1V-9B-Base | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.1V-9B-Base)
[🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.1V-9B-Base) | Base |

## Examples

### Grounding

GLM-4.5V equips precise grounding capabilities. Given a prompt that requests the location of a specific object, GLM-4.5V is able to reasoning step-by-step and identify the bounding boxes of the target object. The query prompt supports complex descriptions of the target object as well as specified output formats, for example:
>
> - Help me to locate in the image and give me its bounding boxes.
> - Please pinpoint the bounding box [[x1,y1,x2,y2], …] in the image as per the given description.

Here, `` is the description of the target object. The output bounding box is a quadruple $$[x_1,y_1,x_2,y_2]$$ composed of the coordinates of the top-left and bottom-right corners, where each value is normalized by the image width (for x) or height (for y) and scaled by 1000.

In the response, the special tokens `<|begin_of_box|>` and `<|end_of_box|>` are used to mark the image bounding box in the answer. The bracket style may vary ([], [[]], (), <>, etc.), but the meaning is the same: to enclose the coordinates of the box.

### GUI Agent

- `examples/gui-agent`: Demonstrates prompt construction and output handling for GUI Agents, including strategies for mobile, PC, and web. Prompt templates differ between GLM-4.1V and GLM-4.5V.

### Quick Demo

- `examples/vlm-helper`: A desktop assistant for GLM multimodal models (mainly GLM-4.5V, compatible with GLM-4.1V), supporting text, images, videos, PDFs, PPTs, and more. Connects to the GLM multimodal API for intelligent services across scenarios. Download the [installer](https://huggingface.co/spaces/zai-org/GLM-4.5V-Demo-App) or [build from source](examples/vlm-helper/README.md).

## Quick Start

### Environment Installation

```bash
pip install -r requirements.txt
```

+ vLLM and SGLang dependencies may conflict, so it is recommended to install only one of them in each environment.

### transformers

- `trans_infer_cli.py`: CLI for continuous conversations using `transformers` backend.
- `trans_infer_gradio.py`: Gradio web interface with multimodal input (images, videos, PDFs, PPTs) using `transformers` backend.
- `trans_infer_bench`: Academic reproduction script for `GLM-4.1V-9B-Thinking`. It forces reasoning truncation at length `8192` and requests direct answers afterward. Includes a video input example; modify for other cases.

### vLLM

```bash
vllm serve zai-org/GLM-4.5V \
--tensor-parallel-size 4 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.5v \
--allowed-local-media-path / \
--media-io-kwargs '{"video": {"num_frames": -1}}'
```

### SGlang

```shell
python3 -m sglang.launch_server --model-path zai-org/GLM-4.5V \
--tp-size 4 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--served-model-name glm-4.5v \
--port 8000 \
--host 0.0.0.0
```

Notes:
- We recommend using the `FA3` attention backend in SGLang for higher inference performance and lower memory usage:
`--attention-backend fa3 --mm-attention-backend fa3 --enable-torch-compile`
Without `FA3`, large video inference may cause out-of-memory (OOM) errors.
We also recommend increasing `SGLANG_VLM_CACHE_SIZE_MB` (e.g., `1024`) to provide sufficient cache space for video understanding.
- When using `vLLM` and `SGLang`, thinking mode is enabled by default. To disable the thinking switch, add:
`extra_body={"chat_template_kwargs": {"enable_thinking": False}}`

## Model Fine-tuning

[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) already supports fine-tuning for GLM-4.5V & GLM-4.1V-9B-Thinking models. Below is an example of dataset construction using two images. You should organize your dataset into `finetune.json` in the following format, This is an example for fine-tuning GLM-4.1V-9B.

```json
[
{
"messages": [
{
"content": "Who are they?",
"role": "user"
},
{
"content": "\nUser asked me to observe the image and find the answer. I know they are Kane and Goretzka from Bayern Munich.\nThey're Kane and Goretzka from Bayern Munich.",
"role": "assistant"
},
{
"content": "What are they doing?",
"role": "user"
},
{
"content": "\nI need to observe what these people are doing. Oh, they are celebrating on the soccer field.\nThey are celebrating on the soccer field.",
"role": "assistant"
}
],
"images": [
"mllm_demo_data/1.jpg",
"mllm_demo_data/2.jpg"
]
}
]
```

1. The content inside ` ... ` will **not** be stored as conversation history or in fine-tuning data.
2. The `` tag will be replaced with the corresponding image information.
3. For the GLM-4.5V model, the and tags should be removed.

Then, you can fine-tune following the standard LLaMA-Factory procedure.

## Model Overview

### GLM-4.5V

GLM-4.5V is based on ZhipuAI’s next-generation flagship text foundation model GLM-4.5-Air (106B parameters, 12B active).
It continues the technical approach of GLM-4.1V-Thinking, achieving SOTA performance among models of the same scale on 42 public vision-language benchmarks.
It covers common tasks such as image, video, and document understanding, as well as GUI agent operations.

![bench_45](resources/bench_45v.jpeg)

Beyond benchmark performance, GLM-4.5V focuses on real-world usability. Through efficient hybrid training, it can handle diverse types of visual content, enabling full-spectrum vision reasoning, including:
- **Image reasoning** (scene understanding, complex multi-image analysis, spatial recognition)
- **Video understanding** (long video segmentation and event recognition)
- **GUI tasks** (screen reading, icon recognition, desktop operation assistance)
- **Complex chart & long document parsing** (research report analysis, information extraction)
- **Grounding** (precise visual element localization)

The model also introduces a **Thinking Mode** switch, allowing users to balance between quick responses and deep reasoning. This switch works the same as in the `GLM-4.5` language model.

### GLM-4.1V-9B

Built on the [GLM-4-9B-0414](https://github.com/zai-org/GLM-4) foundation model, the **GLM-4.1V-9B-Thinking** model introduces a reasoning paradigm and uses RLCS (Reinforcement Learning with Curriculum Sampling) to comprehensively enhance model capabilities.
It achieves the strongest performance among 10B-level VLMs and matches or surpasses the much larger Qwen-2.5-VL-72B in 18 benchmark tasks.

We also open-sourced the base model **GLM-4.1V-9B-Base** to support researchers in exploring the limits of vision-language model capabilities.

![rl](resources/rl.jpeg)

Compared with the previous generation CogVLM2 and GLM-4V series, **GLM-4.1V-Thinking** brings:
1. The series’ first reasoning-focused model, excelling in multiple domains beyond mathematics.
2. **64k** context length support.
3. Support for **any aspect ratio** and up to **4k** image resolution.
4. A bilingual (Chinese/English) open-source version.

GLM-4.1V-9B-Thinking integrates the **Chain-of-Thought** reasoning mechanism, improving accuracy, richness, and interpretability.
It leads on 23 out of 28 benchmark tasks at the 10B parameter scale, and outperforms Qwen-2.5-VL-72B on 18 tasks despite its smaller size.

![bench](resources/bench.jpeg)

## Fixed and Remaining Issues

Since the release of GLM-4.1V, we have addressed many community-reported issues. In GLM-4.5V, common issues such as repetitive thinking and incorrect output formatting are alleviated.
However, some limitations remain:

1. In frontend code reproduction cases, the model may output raw HTML without proper markdown wrapping. There may also be character escaping issues, potentially causing rendering errors. We provide a [patch](inference/html_detector.py) to fix most cases.
2. Pure text Q&A capabilities still have room for improvement, as this release focused primarily on multimodal scenarios.
3. In some cases, the model may overthink or repeat content, especially for complex prompts.
4. Occasionally, the model may restate the answer at the end.
5. There are some perception issues, with room for improvement in tasks such as counting and identifying specific individuals.

We welcome feedback in the issue section and will address problems as quickly as possible.

## Citation

If you use this model, please cite the following paper:

```bibtex
@misc{vteam2025glm45vglm41vthinkingversatilemultimodal,
title={GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning},
author={V Team and Wenyi Hong and Wenmeng Yu and Xiaotao Gu and Guo Wang and Guobing Gan and Haomiao Tang and Jiale Cheng and Ji Qi and Junhui Ji and Lihang Pan and Shuaiqi Duan and Weihan Wang and Yan Wang and Yean Cheng and Zehai He and Zhe Su and Zhen Yang and Ziyang Pan and Aohan Zeng and Baoxu Wang and Bin Chen and Boyan Shi and Changyu Pang and Chenhui Zhang and Da Yin and Fan Yang and Guoqing Chen and Jiazheng Xu and Jiale Zhu and Jiali Chen and Jing Chen and Jinhao Chen and Jinghao Lin and Jinjiang Wang and Junjie Chen and Leqi Lei and Letian Gong and Leyi Pan and Mingdao Liu and Mingde Xu and Mingzhi Zhang and Qinkai Zheng and Sheng Yang and Shi Zhong and Shiyu Huang and Shuyuan Zhao and Siyan Xue and Shangqin Tu and Shengbiao Meng and Tianshu Zhang and Tianwei Luo and Tianxiang Hao and Tianyu Tong and Wenkai Li and Wei Jia and Xiao Liu and Xiaohan Zhang and Xin Lyu and Xinyue Fan and Xuancheng Huang and Yanling Wang and Yadong Xue and Yanfeng Wang and Yanzi Wang and Yifan An and Yifan Du and Yiming Shi and Yiheng Huang and Yilin Niu and Yuan Wang and Yuanchang Yue and Yuchen Li and Yutao Zhang and Yuting Wang and Yu Wang and Yuxuan Zhang and Zhao Xue and Zhenyu Hou and Zhengxiao Du and Zihan Wang and Peng Zhang and Debing Liu and Bin Xu and Juanzi Li and Minlie Huang and Yuxiao Dong and Jie Tang},
year={2025},
eprint={2507.01006},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.01006},
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zai-org/GLM-V

Awesome Lists containing this project

README