{"id":25692708,"url":"https://github.com/janelu9/easyllm","last_synced_at":"2026-05-16T06:22:43.887Z","repository":{"id":239892404,"uuid":"663811954","full_name":"janelu9/EasyLLM","owner":"janelu9","description":"Running Large Language Model easily.","archived":false,"fork":false,"pushed_at":"2025-04-15T05:54:26.000Z","size":230566,"stargazers_count":8,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-24T05:49:40.849Z","etag":null,"topics":["deepseek","deepspeed","fine-tuning","llama","megatron-lm","npu","pretrain","qwen","qwen-vl"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/janelu9.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-07-08T06:51:08.000Z","updated_at":"2025-04-15T05:54:30.000Z","dependencies_parsed_at":"2024-08-03T07:49:19.046Z","dependency_job_id":"4c510bf9-56d0-4ed9-a242-388c2d248fd8","html_url":"https://github.com/janelu9/EasyLLM","commit_stats":{"total_commits":350,"total_committers":3,"mean_commits":"116.66666666666667","dds":0.4485714285714286,"last_synced_commit":"bbdd0a3f2ca47a7b9b933f4760cc6154303d01e4"},"previous_names":["janelu9/easyllm","janelu9/flash-finetuning"],"tags_count":14,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/janelu9%2FEasyLLM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/janelu9%2FEasyLLM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/janelu9%2FEasyLLM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/janelu9%2FEasyLLM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/janelu9","download_url":"https://codeload.github.com/janelu9/EasyLLM/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250573345,"owners_count":21452345,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deepseek","deepspeed","fine-tuning","llama","megatron-lm","npu","pretrain","qwen","qwen-vl"],"created_at":"2025-02-24T23:28:18.083Z","updated_at":"2026-02-11T03:13:38.530Z","avatar_url":"https://github.com/janelu9.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# *EasyLLM*\r\n\r\nTraining Large Language Model faster, easily and low-cost. \r\n\r\n✦ Both GPU and NPU are supported.\r\n\r\n✦ Directly training on whole big data of token ids converted by PySpark when pretrain.\r\n\r\n✦ Flash speed when fine-tuning because of  no redundant computation .\r\n\r\n✦ Make PCIE as fast as NVLinks under 20 billion level model.\r\n\r\n✦ Minimalist implementation of Sequence Parallelism (4D Parallelism for extra long context).\r\n\r\n✦ High performance of  Visual Language Model‘s full parameter fine-tuning.\r\n\r\n✦ Low communication and dynamic experts balance when Mixture of Experts training.\r\n\r\n✦ Flash speed of Reinforcement Learning benefitting from optimizations as asynchronous inference and training, etc.\r\n\r\n## Installation\r\n\r\n```shell\r\ngit clone --depth 1 https://github.com/janelu9/EasyLLM.git\r\ncd EasyLLM\r\npip wheel -e . --no-deps \u0026\u0026 pip install jllm-*-py3-none-any.whl\r\n```\r\n\r\n## Quick Start\r\n\r\n### Data Conversion\r\n\r\nConvert the raw data to token ids stored in parquet files.\r\n\r\n```shell\r\npython -m jllm.raw2ids \\\r\n    --tokenizer DeepSeek-R1 \\\r\n    -i dataset0.jsonl \\\r\n    -o dataset0_DeepSeek-R1 \\\r\n    --max_len 8193 -C\r\n```\r\n\r\n- **Pre-train** dataset's samples should be separated by *`'\\n\\n'`* in text files or be the value of  key *`'text'`* in jsonl files.\r\n- **Fine-tune**'s format should be *`[{'system':content},{'user':content},{'assistant':content},...] `* in each row of jsonl files, key *`'system'`* is not necessary. \r\n- **RLHF**'s format is like *`[index,{'user':content}] `.*  *`index`* is an ID of integer.\r\n\r\n**For Vision Language Model:**\r\n\r\n```shell\r\npython -m jllm.raw2ids \\\r\n    --tokenizer Qwen2.5-VL-7B-Instruct \\\r\n    -i dataset_vl.jsonl \\\r\n    --image_path images \\\r\n    --max_len 32769\r\n```\r\n\r\nFolder *`images`* stores all the images data.  Format of  *`dataset_vl.jsonl`* is like:\r\n\r\n*`[{'user':['Give a description of these pictures please.\\n \u003cimage\u003e....','image0.jpg',...]},{'assistant':'This is ....'}]`*\r\n\r\n### Model Training\r\n\r\n#### Large Language Model :\r\n\r\n```shell\r\nDISTRIBUTED_ARGS=(\r\n    --nproc_per_node $GPUS_PER_NODE \r\n    --nnodes $NUM_NODES \r\n    --master_addr $MASTER_ADDR \r\n    --master_port $MASTER_PORT\r\n)\r\n\r\ntorchrun ${DISTRIBUTED_ARGS[@]} \\\r\n    -m jllm.train_pipe \\\r\n    --model DeepSeek-R1 \\\r\n    --num_train_epochs 3 \\\r\n    --train_data dataset0_DeepSeek-R1 \\\r\n    --pipe_parallel_size 16 \\\r\n    --tensor_parallel_size 8 \\\r\n    --expert_parallel_size 2 \\\r\n    --micro_batch_size 1 \\\r\n    --global_batch_size 256 \\\r\n    --partition_method 9,5 \\\r\n    --only_ckpt_model \\\r\n    --max_num_checkpoints 2 \\\r\n    --learning_rate 1e-5 \\\r\n    --checkpoint checkpoint\r\n```\r\n\r\n#### **Vision Language Model**:\r\n\r\n```shell\r\ntorchrun ${DISTRIBUTED_ARGS[@]} \\\r\n    -m jllm.train_pipe \\\r\n    --model Qwen2.5-VL-7B-Instruct \\\r\n    --num_train_epochs 3 \\\r\n    --train_data dataset_vl_Qwen2.5-VL-7B-Instruct \\\r\n    --pipe_parallel_size 4 \\\r\n    --tensor_parallel_size 4 \\\r\n    --encoder_pipe_parallel_size 2 \\\r\n    --micro_batch_size 1 \\\r\n    --global_batch_size 64 \\\r\n    --only_ckpt_model \\\r\n    --max_num_checkpoints 2 \\\r\n    --partition_method fast \\\r\n    --no_pin_memory \\\r\n    --checkpoint_grad_interval 1 \\\r\n    --checkpoint checkpoint\r\n```\r\n\r\nYou can also submit training task by deepspeed mpi:\r\n\r\n```shell\r\nHOSTFILE= \"\"\"\r\n10.0.0.0 slots=$GPUS_PER_NODE\r\n10.0.0.1 slots=$GPUS_PER_NODE\r\n\"\"\"\r\ndeepspeed -H ${HOSTFILE} \\\r\n    --module jllm.train_pipe \\\r\n    ...\r\n```\r\n\r\nIf you are using a shared storage, model weights from HuggingFace will be converted automatically. You can also do this manually when your storage of each node is independent :\r\n\r\n```shell\r\npython -m jllm.hf2ds -p 16 -t 8 -e 4 --partition_method 8,6 -m DeepSeek-R1 -o trained_model\r\n```\r\n\r\n`--partition_method 8,6` denotes there's 8 sub-layers in first stage and 6 sub-layers in last pipeline stage. One decoder layer contains two sub-layers (one Aattention layer, one MLP or MoE layer) in my codes.\r\n\r\n***Note**: Arguments `train_data` and `eval_data` also support `jsonl` file. Run `python -m jllm.train_pipe -h ` for more arguments.* \r\n\r\nGenerally, every GPU process reads one piece of data, that means one node with 8 GPUs will need to allocate a total of 8x CPU memory for data.  But now they need just 1x if these GPUs belong to one pipeline under my special optimizations in this project . **I strongly recommend you to train your model with faster and low-cost Pipeline Parallelism** rather than ZERO. Pipeline engine could directly load and save model's weights in HuggingFace's format. It could also load weights from checkpoint. If you want to resume interruption, any configs related to training shouldn't be modified. \r\n\r\nThe engine was designed to save checkpoint through background process by default to save more time for training. **Don't save checkpoint too frequently** unless you disable checkpoint in background via the argument '`--background_executor none`' to avoid out of CPU memory.\r\n\r\nSetting `--partition_method` to be `fast` will always get a faster training when GPU memory are enough.\r\n\r\n#### **Reinforcement Learning** (GRPO):\r\n\r\n1. Define a reward function in a python file which should include a `reward_func`:\r\n\r\n```python\r\n# reward.py\r\nimport numpy as np\r\n\r\nwith open('truth.txt','r') as f:\r\n\ttruth = f.read().splitlines()\r\n\r\ndef reward_func(index, text=None, token_ids=None):\r\n    '''\r\n    Args:\r\n        index: int\r\n            Unique index of the training prompt.\r\n        text: List[ str ] (group_size,)\r\n            One group of responses generated by trained actor. \r\n        token_ids: List[ List[ int ] ] (group_size,)\r\n            One group of token ids corresponding to the responses.\r\n    return:\r\n    \tscores: ndarray[ float16|float32 ] (group_size,)\r\n    \t\tThe reward sorces of this group.\r\n    '''\r\n    ## For example ##:\r\n    print('responses:', text[0])\r\n    print('truth:', truth[index])\r\n    scores = np.random.rand(len(text))\r\n    return scores\r\n```\r\n\r\n2. Start inference engines and the GRPO training task according to node ranks.\r\n\r\n```shell\r\nNUM_NODES=5\r\nGPUS_PER_NODE=8\r\nMASTER_ADDR='ip of first node'\r\nMASTER_PORT=6000\r\nRAY_ADDR='ip of last node'\r\nINFER_NODES=1\r\nINFER_START_RANK=$((NUM_NODES - INFER_NODES))\r\nINFER_GPUS=$((INFER_NODES * GPUS_PER_NODE))\r\nVLLM_TP=4\r\n\r\nif [[ $NODE_RANK -eq $INFER_START_RANK ]]; then\r\n    echo \"Starting inference node (Rank $NODE_RANK)\"\r\n    ray start --head --port 6380\r\n    python -m jllm.sync_ray $INFER_NODES # waitting for ray's wokers.\r\n    python -m jllm.vllm --model Qwen3-32B \\\r\n        --max_model_len 4096 \\\r\n        --max_num_seqs 256 \\\r\n        --vllm_tp $VLLM_TP \\\r\n        --ray_gpus $INFER_GPUS \\\r\n        --vllm_mem 0.8\r\nelif [[ $NODE_RANK -gt $INFER_START_RANK ]]; then\r\n    python -m jllm.wait_port $RAY_ADDR 6380 # waitting for ray's master.\r\n    ray start --address=\"$RAY_ADDR:6380\"\r\nelse\r\n    export HCCL_IF_BASE_PORT=$((NODE_RANK * 16 + 20000)) # avoid ray's port range.\r\n    echo \"Starting training node (Rank $NODE_RANK)\"\r\n    echo \"Waiting for inference node to start...\"\r\n    python -m jllm.wait_port $RAY_ADDR 8000 # waitting for vllm to start.\r\n    \r\n    ray start --address=\"$RAY_ADDR:6380\" \\\r\n              --num-gpus=0 \\\r\n              --num-cpus=1 \\\r\n              --memory=$((1 * 1024**3)) \\\r\n              --object-store-memory=$((4 * 1024**3)) \\\r\n              --resources='{\"NPU\":0}'\r\n\r\n    TRAIN_NODES=$((NUM_NODES - INFER_NODES))\r\n    WORLD_SIZE=$((GPUS_PER_NODE * TRAIN_NODES))\r\n    DISTRIBUTED_ARGS=(\r\n        --nproc_per_node $GPUS_PER_NODE\r\n        --nnodes $TRAIN_NODES\r\n        --node_rank $NODE_RANK\r\n        --master_addr $MASTER_ADDR\r\n        --master_port $MASTER_PORT\r\n    )\r\n\r\n    echo \"Starting training with $TRAIN_NODES nodes\"\r\n    torchrun \"${DISTRIBUTED_ARGS[@]}\" \\\r\n        -m jllm.train_pipe \\\r\n        --model Qwen3-32B \\\r\n        --num_train_epochs 2 \\\r\n        --train_data rlhf_Qwen3-32B \\\r\n        --pipe_parallel_size 4 \\\r\n        --tensor_parallel_size 8 \\\r\n        --micro_batch_size 2 \\\r\n        --global_batch_size 2048 \\\r\n        --partition_method mem \\\r\n        --only_ckpt_model \\\r\n        --max_num_checkpoints 2 \\\r\n        --learning_rate 1e-5 \\\r\n        --checkpoint checkpoint \\\r\n        --checkpoint_grad_interval 4 \\\r\n        --rlhf \\\r\n        --num_generations 32 \\\r\n        --max_model_len 4096 \\\r\n        --vllm_sync_stage 1 \\\r\n        --ray_ip $RAY_ADDR \\\r\n        --reward_func reward.py \\\r\n        --num_vllm_engines $((INFER_GPUS / VLLM_TP))\r\n    if [[ $NODE_RANK -eq 0 ]]; then\r\n    \tpython -c \"import requests;requests.post('http://\"$RAY_ADDR\":8000/shutdown')\"\r\n    fi\r\nfi\r\n```\r\n\r\n\u003cdiv align=\"center\"\u003e\r\n  \u003cimg width=\"733\" height=\"500\" alt=\"image\" src=\"https://github.com/janelu9/EasyLLM/blob/main/periodic_async.png\" /\u003e\r\n  \u003cbr\u003e\r\n  \u003cem\u003eFigure 1. Comparison of training steps in synchronous and asynchronous systems.\u003c/em\u003e\r\n\u003c/div\u003e\r\n\r\n### Checkpoint Conversion\r\n\r\nIf argument `--only_ckpt_model`  is enabled , engine will directly only checkpoint model's weights with HF's format.\r\n\r\nYou can also convert model's weights from deepspeed's checkpoint to HF's format by `jllm.train_pipe`, such as:\r\n\r\n```shell\r\nDISTRIBUTED_ARGS=(\r\n    --nproc_per_node 8 \r\n    --nnodes 32\r\n    --master_addr $MASTER_ADDR \r\n    --master_port $MASTER_PORT\r\n)\r\n\r\ntorchrun ${DISTRIBUTED_ARGS[@]} \\\r\n    --module jllm.train_pipe \\\r\n    --model DeepSeek-R1 \\\r\n    --train_data dataset0_DeepSeek-R1 \\\r\n    --pipe_parallel_size 16 \\\r\n    --tensor_parallel_size 8 \\\r\n    --expert_parallel_size 2 \\\r\n    --partition_method 9,5 \\\r\n    --num_train_epochs 0 \\\r\n    --from_ckpt checkpoint --tag 1000 \\\r\n    --output_dir output_path\r\n```\r\n\r\nGiving number of devices that could cover one data parallel is enough.\r\n\r\n### Weight Merging\r\n\r\nTo concatenate the weights when ` tensor_parallel_size\u003e1`:\r\n\r\n```shell\r\npython -m jllm.cat2hf \\\r\n       -C checkpoint_model \\\r\n       -H huggingface_model\r\n```\r\n\r\n## Supported Models\r\n\r\n|                       Model                        | Training Speed (tokens/s) |\r\n| :------------------------------------------------: | :-----------------------: |\r\n|                  qwen3/qwen3-moe                   |             -             |\r\n| deepseek-v3-685b (includes multi-token prediction) |             -             |\r\n|                     qwen2.5-vl                     |             -             |\r\n|                      qwen2-vl                      |             -             |\r\n|                     internvl2                      |             -             |\r\n|                     internlm2                      |             -             |\r\n|                  qwen2/qwen2-moe                   |             -             |\r\n|                    ~~qwen-14b~~                    |     ~~80749.57(old)~~     |\r\n|                  ~~baichuan-13b~~                  |     ~~79765.50(old)~~     |\r\n|                     llama-13b                      |       92749.82(old)       |\r\n\r\n***Note**: The training speed of each model was measured on 64 NVIDIA A100-PCIE-40GB GPUs linked by 100Gb/s bandwidth of InfiniBand with data type of bfloat16 and batch token size of 2048\\*2048 (batch_size\\*sequence_length,  batch_size = micro_batch_size \\* gradient_accumulation_steps).*\r\n\r\n|  Model   | Training Speed (tokens/s) |\r\n| :------: | :-----------------------: |\r\n| llama-7b |         26335.232         |\r\n\r\n*8 NVIDIA A100-PCIE-40GB GPUs,  bfloat16, 2304\\*2048 tokens/batch.*\r\n\r\n|    Model    | Training Speed (tokens/s) |\r\n| :---------: | :-----------------------: |\r\n| Qwen2.5-72b |         125327.23         |\r\n\r\n*512 **Ascend-910B-64GB NPUs** of Air-cooled, bfloat16, 4096\\*4096 tokens/batch.*\r\n\r\n## Advanced Tutorial For Data Processing\r\n\r\nThis step is recommended especially when your data are too big to be loaded to CPU memory at once, such as during pretraining. Here are two methods.\r\n\r\n### Python\r\n\r\n#### Conversion \r\n\r\n```shell\r\npython -m jllm.raw2ids \\\r\n    --tokenizer DeepSeek-R1 \\\r\n    -i dataset0.jsonl \\\r\n    -o dataset0_DeepSeek-R1 \\\r\n    --max_len 4097 \\\r\n    --type pretain \\\r\n    -n 32768 \\\r\n    --stack\r\n```\r\n\r\n#### Shuffle\r\n\r\nIf you have multiple datasets, you shouldn't skip this step. It could shuffle all the datasets globally by rows like Spark doing. \r\n\r\nFirstly, move all the datasets stored in parquet folders into one directory. such as `datasets`:\r\n\r\n```shell\r\ndatasets\r\n├── dataset0_DeepSeek-R1\r\n│   ├── dataset0-00000-00000.gzip.parquet\r\n│   ├── dataset0-00000-00001.gzip.parquet\r\n│   ├── dataset0-00001-00000.gzip.parquet\r\n│   ├── dataset0-00001-00001.gzip.parquet\r\n│   └── dataset0_info.json\r\n└── dataset1_DeepSeek-R1\r\n    ├── dataset1-00000-00000.gzip.parquet\r\n    ├── dataset1-00000-00001.gzip.parquet\r\n    ├── dataset1-00001-00000.gzip.parquet\r\n    ├── dataset1-00001-00001.gzip.parquet\r\n    └── dataset1_info.json\r\n```\r\n\r\nThen run the following command to shuffle the rows inner each dataset and distribute them to new blocks.\r\n\r\n```shell\r\npython -m jllm.shuffle_datasets -d datasets -o shuffled_datasets -n 4\r\n```\r\n\r\nEvery dataset would be shuffled and placed in `shuffled_datasets` with several times of `num_block` parquet files:\r\n\r\n```shell\r\nshuffled_datasets/\r\n├── dataset0_DeepSeek-R1-00000-00000.gzip.parquet\r\n├── dataset0_DeepSeek-R1-00000-00001.gzip.parquet\r\n├── dataset0_DeepSeek-R1-00000-00002.gzip.parquet\r\n├── dataset0_DeepSeek-R1-00000-00003.gzip.parquet\r\n├── dataset1_DeepSeek-R1-00000-00000.gzip.parquet\r\n├── dataset1_DeepSeek-R1-00000-00001.gzip.parquet\r\n├── dataset1_DeepSeek-R1-00000-00002.gzip.parquet\r\n├── dataset1_DeepSeek-R1-00000-00003.gzip.parquet\r\n├── dataset0..._info.json\r\n└── dataset1..._info.json\r\n```\r\n\r\n### PySpark\r\n\r\nYou can also use **PySpark** to do these steps. jllm could directly read token ids from the parquets those write out by **[Spark]((https://spark.apache.org))** .\r\n\r\nShuffle and convert raw data of `jsonl` to token ids of `parquet` by pyspark:\r\n\r\n```shell\r\ntokenizer=\"DeepSeek-R1\"\r\nspark-submit \\\r\n    --master yarn \\\r\n    --deploy-mode cluster \\\r\n    --queue default \\\r\n    --archives hdfs://tokenizer.tgz#python_env \\\r\n    --num-executors 32 \\\r\n    --executor-memory 32G \\\r\n    --executor-cores 32 \\\r\n    --driver-memory 8G \\\r\n    --name 'raw2ids' \\\r\n    --conf spark.yarn.executor.memoryOverhead=128 \\\r\n    --conf spark.driver.maxResultSize=4G \\\r\n    --conf spark.memory.storageFraction=0.8 \\\r\n    --conf spark.sql.metadataCacheTTLSeconds=86400 \\\r\n    --conf spark.yarn.priority=100 \\\r\n    --conf spark.speculation=true \\\r\n    --conf spark.hadoop.hive.exec.dynamic.partition=true \\\r\n    --conf spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict \\\r\n    --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./python_env/tokenizer/bin/python \\\r\n    --files hdfs://${tokenizer}.tgz \\\r\n    --py-files hdfs://pyspark.zip \\\r\n    jllm.raw2ids_spark \\\r\n    --num_partitions 500 \\\r\n    --tokenizer ${tokenizer} \\\r\n    --max_seq_length 4097 \\\r\n    --input_path hdfs://localhost:9000/jsonl \\\r\n    --output_path hdfs://localhost:9000/parquet\r\n```\r\n\r\nThen transport the parquet files to your training cluster's storage. The train data should be:\r\n\r\n```shell\r\ntrain_data/\r\n├── part-00000-xxx.snappy.parquet\r\n├── part-00100-xxx.snappy.parquet\r\n│   ...\r\n└── data_info.json\r\n```\r\n\r\n`data_info.json` is a necessary file under the folder you should create manually:\r\n\r\n```shell\r\n{\r\n  \"num_samples\": ${num_samples},\r\n  \"max_len\":  ${max_seq_length},\r\n  \"max_num_blocks\": ${max_num_blocks},\r\n  \"fields\": [\r\n    \"input_ids\",\r\n    \"cu_seqlens\"\r\n  ]\r\n}\r\n```\r\n\r\nValues of `num_samples` and `max_num_blocks` will be printed at the last of yarn's logs once the spark tasks are completed successfully .\r\n\r\n## Citation\r\n\r\nIf you find EasyLLM useful or use EasyLLM's code  in your research, please cite it in your publications.\r\n\r\n```bibtex\r\n@misc{EasyLLM,\r\n  author       = {Jian Lu},\r\n  title        = {EasyLLM: Training Large Language Model faster, easily and low-cost.},\r\n  year         = {2023},\r\n  publisher    = {GitHub},\r\n  journal      = {GitHub repository},\r\n  howpublished = {\\url{https://github.com/janelu9/EasyLLM.git}},\r\n}\r\n@misc{lu2025periodicasynchronyeffectivemethod,\r\n      title={Periodic Asynchrony: An Effective Method for Accelerating On-Policy Reinforcement Learning}, \r\n      author={Jian Lu},\r\n      year={2025},\r\n      eprint={2511.18871},\r\n      archivePrefix={arXiv},\r\n      primaryClass={cs.LG},\r\n      url={https://arxiv.org/abs/2511.18871}, \r\n}\r\n```\r\n## Acknowledgment\r\n\r\nThis repository benefits from [DeepSpeed](https://github.com/microsoft/DeepSpeed), [Flash-Attention](https://github.com/Dao-AILab/flash-attention.git), [vLLM](https://github.com/vllm-project/vllm),  [megatron_core](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core/tensor_parallel).\r\n\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjanelu9%2Feasyllm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjanelu9%2Feasyllm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjanelu9%2Feasyllm/lists"}