{"id":31756562,"url":"https://github.com/yjyddq/eoser-ass-rl","last_synced_at":"2025-10-09T19:19:23.190Z","repository":{"id":316941517,"uuid":"1065409749","full_name":"yjyddq/EOSER-ASS-RL","owner":"yjyddq","description":"Official Repository of \"Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step\"","archived":false,"fork":false,"pushed_at":"2025-10-07T20:32:18.000Z","size":15369,"stargazers_count":14,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-07T22:26:29.951Z","etag":null,"topics":["foundation-model","masked-diffusion-large-language-model","reinforcement-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yjyddq.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-27T17:07:11.000Z","updated_at":"2025-10-07T20:32:21.000Z","dependencies_parsed_at":"2025-09-27T19:36:34.959Z","dependency_job_id":null,"html_url":"https://github.com/yjyddq/EOSER-ASS-RL","commit_stats":null,"previous_names":["yjyddq/eoser-ass-rl"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/yjyddq/EOSER-ASS-RL","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yjyddq%2FEOSER-ASS-RL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yjyddq%2FEOSER-ASS-RL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yjyddq%2FEOSER-ASS-RL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yjyddq%2FEOSER-ASS-RL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yjyddq","download_url":"https://codeload.github.com/yjyddq/EOSER-ASS-RL/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yjyddq%2FEOSER-ASS-RL/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279001981,"owners_count":26083243,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-09T02:00:07.460Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["foundation-model","masked-diffusion-large-language-model","reinforcement-learning"],"created_at":"2025-10-09T19:19:20.483Z","updated_at":"2025-10-09T19:19:23.177Z","avatar_url":"https://github.com/yjyddq.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv  align=\"center\"\u003e\n    \u003ch1\u003e\u003ca href=\"https://arxiv.org/pdf/2509.23924\" target=\"_blank\"\u003eTaming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step\u003c/h1\u003e\n\n  \u003cspan style=\"color:red\"\u003e📢 \u003cstrong\u003e\u003ci\u003eIf you also engaged in the research of MDLMs or RL, we welcome your suggestions. And feel free to create an issue, when you have any questions about the code.\n  If you are interested in our work, please star ⭐ our repository, Thx 💕.\u003c/i\u003e\u003c/strong\u003e\u003c/span\u003e\n\n  \u003ch4\u003e\n    \u003cimg src=\"https://img.shields.io/badge/Version-1.0.0-blue.svg\" alt=\"Version\"\u003e \n    \u003cimg src=\"https://img.shields.io/badge/License-Apache_2.0-green.svg\" alt=\"License\"\u003e\n    \u003cimg src=\"https://visitor-badge.laobi.icu/badge?page_id=yjyddq.EOSER-ASS-RL\" /\u003e\n    \u003cimg src=\"https://img.shields.io/github/stars/yjyddq/EOSER-ASS-RL?style=flat-square\u0026logo=github\" alt=\"Stars\"\u003e\n    \u003cimg src=\"https://img.shields.io/github/issues/yjyddq/EOSER-ASS-RL?color=red\" alt=\"Issues\"\u003e\n  \u003c/h4\u003e\n\u003c/div\u003e\n\n\u003cp\u003eWe propose EOS Early Rejection (EOSER) decoding and Ascending Step-Size (ASS) scheduler, which unlock the potential of MDLMs to perform full diffusion-style decoding, achieving competitive performance with fewer decoding steps. Additionally, we introduce Consistency Trajectory Group Relative Policy Optimization (CJ-GRPO) for taming MDLMs, which emphasizes the consistency between rollout trajectory and optimization trajectory. The experimental results demonstrate that the proposed EOSER and ASS mechanisms, together with CJ-GRPO, hold significant promise for effectively and efficiently taming MDLMs.\u003c/p\u003e\n\n![Motivation](media/Motivation.jpg)\n\n\u003cdiv align=\"center\"\u003e\n  \u003chr width=\"100%\"\u003e\n\u003c/div\u003e\n\n## 📢 Updates\n\n* 09-30-2025: We released our [paper](https://arxiv.org/pdf/2509.23924).\n* 09-28-2025: We released the code of Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step.\n\n\u003cdiv align=\"center\"\u003e\n  \u003chr width=\"100%\"\u003e\n\u003c/div\u003e\n\n\n## ⚙️ Environment Setup\n\nTo setup the environment, run:\n```\ngit clone https://github.com/yjyddq/EOSER-ASS-RL.git\ncd EOSER-ASS-RL\n\nconda env create -f env.yml\nconda activate EOSER-ASS-RL\n```\n\n## 🚀 SFT\n\nThe code for Supervised Fine-Tuning (SFT) is sourced from the [d1](https://github.com/dllm-reasoning/d1/tree/main/SFT). sft results can be reproduced with the command:\n```bash\n# First go to the sft directory\ncd sft\n\nCUDA_VISIBLE_DEVICES=0,1 accelerate launch --config_file ddp_config.yaml --main_process_port 29500 --num_processes 2 sft_train.py --grad_accum_steps 4 --batch_size 1 --num_epochs 20 \n# this results in effective batch size of 8 = 1 * 2 * 4, where 2 is the number of gpus.\n```\n\n## 🚀 EOSER Decoding and ASS Scheduler \n\nThe code of EOSER and ASS is inside the `eval/generate.py`.\n\n![Decoding](media/Decoding.jpg)\n\n```\n# EOSER Decoding\ndef sampling(logits, x0, remasking, eos_id, dtype, step_idx=None, total_steps=None, eos_min_gamma=0.4, eos_max_gamma=1.0):\n        assert remasking == \"low_confidence\", f\"Expected remasking == low_confidence in sampling\"\n        p = F.softmax(logits.to(dtype), dim=-1)\n        x0_p = torch.squeeze(\n            torch.gather(p, dim=-1, index=torch.unsqueeze(x0, -1)), -1\n        )\n        # Soft, step-dependent EOS suppression\n        min_fac = eos_min_gamma \n        max_fac = eos_max_gamma\n        if step_idx is not None and total_steps is not None and total_steps \u003e 1:\n            t = min(max(step_idx, 0), total_steps - 1) / (total_steps - 1)\n            fac = min_fac + (max_fac - min_fac) * t  # linear schedule: strong suppression early -\u003e none late\n        else:\n            fac = max_fac\n        x0_p = torch.where(x0 == eos_id, x0_p * fac, x0_p)\n        return x0_p\n```\n```\n# ASS Scheduler\ndef get_num_transfer_tokens_ass(prompt, steps, block_sizes):\n    \"\"\"\n    Generate exponential token transfer schedule where step i transfers 2^i tokens.\n    Each block size is 2^s - 1, and total generation length is 2^L - 1.\n    \n    Args:\n        mask_index: Boolean mask indicating which tokens are masked\n        steps: Total number of diffusion steps per block\n        schedule: Not used in this function but kept for compatibility\n    \"\"\"\n    batch_size = prompt.shape[0]\n    \n    # Calculate number of tokens to transfer at each step: 2^i for step i\n    num_transfer_tokens = torch.zeros((batch_size, steps), device=prompt.device, dtype=torch.int64)\n    \n    for i in range(steps):\n        if i == steps - 1:\n            if len(block_sizes) == 1:\n                num_transfer_tokens[:, i] = block_sizes[-1] - (2 ** i - 1)\n            elif block_sizes[-1] \u003e 2 ** i:\n                num_transfer_tokens[:, i] = block_sizes[-1]\n        else:\n            tokens_to_transfer = 2 ** i\n            num_transfer_tokens[:, i] = tokens_to_transfer\n    \n    return num_transfer_tokens\n\n# Calculate the power L such that 2^S - 1 = gen_length\nS = int(torch.log2(torch.tensor(gen_length, dtype=torch.float32)).item())\n\n# Calculate dynamic block structure based on block_num\nblock_sizes = []\nblock_steps = []\nreminder = gen_length - 2 ** S\nremaining_length = gen_length\n        \nif block_num == 1:\n    block_size = remaining_length\n    remaining_length -= remaining_length\n    block_step = [s for s in range(S)]\n    block_sizes.insert(0, block_size)\n    block_steps.insert(0, block_step)\n    # Calculate the transfer tokens based on steps and block sizes\n    num_transfer_tokens = get_num_transfer_tokens_ass(prompt, steps, block_sizes)\nelse:\n    # Start from the largest possible power\n    current_power = S - 1  # Since gen_length = 2^S, largest single block is 2^(S-1)\n    for block_idx in range(block_num - 1, -1, -1):\n        if block_idx == block_num - 1:\n            block_size = 2 ** current_power + 1 + reminder\n            remaining_length -= block_size\n            block_step = [current_power]\n            current_power -= 1\n        elif block_idx == 0:\n            # Try to distribute the sum of remaining steps into one block\n            block_size = remaining_length\n            block_step = [s for s in range(current_power + 1)]  # Use all remaining steps\n            remaining_length -= block_size\n        else:\n            block_size = 2 ** current_power\n            remaining_length -= block_size\n            block_step = [current_power]\n            current_power -= 1           \n        block_sizes.insert(0, block_size)\n        block_steps.insert(0, block_step)\n    # Calculate the transfer tokens based on steps and block sizes\n    num_transfer_tokens = get_num_transfer_tokens_ass(prompt, steps, block_sizes)\nassert remaining_length == 0, \"remaining_length must be 0\"\n```\n```\n# EOSER combined with ASS\ndef ass_sampling(logits, x0, remasking, eos_id, dtype, step_idx=None, total_steps=None, eos_min_gamma=0.01, eos_max_gamma=1.0):\n        assert remasking == \"low_confidence\", f\"Expected remasking == low_confidence in sampling\"\n        p = F.softmax(logits.to(dtype), dim=-1)\n        x0_p = torch.squeeze(\n            torch.gather(p, dim=-1, index=torch.unsqueeze(x0, -1)), -1\n        )\n        # Soft, step-dependent EOS suppression\n        min_fac = eos_min_gamma \n        max_fac = eos_max_gamma\n        if step_idx is not None and total_steps is not None and total_steps \u003e 1:\n            t = (2 ** (step_idx+1) - 1) / 2 ** total_steps\n            fac = min_fac + (max_fac - min_fac) * t  # power-of-2 schedule: strong suppression early -\u003e none late\n        else:\n            fac = max_fac\n        x0_p = torch.where(x0 == eos_id, x0_p * fac, x0_p)\n        return x0_p\n```\n\n## 🚀 Consistency or InConsistency Trajectory GRPO\n\nThe code is inside the `cj-grpo` directory:\n\n- `cj-grpo/cj_grpo_trainer_xxx.py` contains the algorithm code of CJ-GRPO combined with various decoding strategy\n- `cj-grpo/slurm_scripts` contains the slurm scripts we used to run the CJ-GRPO experiments. Example bash script for running the experiment:\n  ```bash\n  cd cj-grpo\n  \n  CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash run.sh\n  ```\n\nThe code of one-step optimization starting from the partially masked prompt, i.e., d1 (from x'\u003csub\u003eS\u003c/sub\u003e to x\u003csub\u003eS\u003c/sub\u003e) is inside the `diffu-grpo` directory:\n\n- `diffu-grpo/slurm_scripts` contains the algorithm code of diffu-grpo\n- `diffu-grpo/slurm_scripts` contains the slurm scripts we used to run the one-step optimization experiments that starting from the partially masked prompt. Example bash script for running the experiment:\n  ```bash\n  cd diffu-grpo\n  \n  CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash run.sh\n  ```\n\nThe code of one-step optimization starting from the fully masked response (from x\u003csub\u003e0\u003c/sub\u003e to x\u003csub\u003eS\u003c/sub\u003e) is inside the `one-step-grpo` directory:\n\n- `one-step-grpo/one_step_grpo_trainer.py` contains the algorithm code of one-step-grpo\n- `one-step-grpo/slurm_scripts` contains the slurm scripts we used to run the experiments that one-step optimization starting from the fully masked response. Example bash script for running the experiment:\n  ```bash\n  cd one-step-grpo\n  \n  CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash run.sh\n  ```\n\n**The difference betwee MDLMs and AR LLMs during RL optimization (e.g., GRPO):**\n\n![CJ-GRPO](media/CJ-GRPO.jpg)\n\n**Algorithm Optimization Pipeline:**\n\n![Algorithm](media/Algorithm.jpg)\n\n\n## 🚀 Evaluation\n\nThe evaluation code is inside the `eval` directory:\n\n- Run with `bash run_eval.sh`\n- The evaluation file will only save the generations, use `python parse_and_get_acc.py` to print the accuracy.\n\n\n## 🔗 Citation\n\nIf this paper or code are useful for you, please consider citing our paper:\n\n```bibtex\n@misc{yang2025tamingmaskeddiffusionlanguage,\n      title={Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step}, \n      author={Jingyi Yang and Guanxu Chen and Xuhao Hu and Jing Shao},\n      year={2025},\n      eprint={2509.23924},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2509.23924}, \n}\n```\n\n## 🙏 Acknowledgements\n\nParts of the codes are borrowed from [d1](https://github.com/dllm-reasoning/d1). Sincere thanks to their wonderful works.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyjyddq%2Feoser-ass-rl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyjyddq%2Feoser-ass-rl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyjyddq%2Feoser-ass-rl/lists"}