{"id":31551297,"url":"https://github.com/epfml/llm-optimizer-benchmark","last_synced_at":"2026-02-14T03:37:52.584Z","repository":{"id":312803502,"uuid":"1025808379","full_name":"epfml/llm-optimizer-benchmark","owner":"epfml","description":"Benchmarking Optimizers for LLM Pretraining","archived":false,"fork":false,"pushed_at":"2025-12-21T11:20:06.000Z","size":501,"stargazers_count":45,"open_issues_count":0,"forks_count":2,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-12-23T03:05:49.518Z","etag":null,"topics":["benchmarking","llm","optimizers"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2509.01440","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/epfml.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-07-24T20:47:51.000Z","updated_at":"2025-12-21T11:20:10.000Z","dependencies_parsed_at":"2025-09-02T05:38:14.214Z","dependency_job_id":"f21cbd90-2bf5-45f6-8ee0-5b146e439604","html_url":"https://github.com/epfml/llm-optimizer-benchmark","commit_stats":null,"previous_names":["epfml/llm-optimizer-benchmark"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/epfml/llm-optimizer-benchmark","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Fllm-optimizer-benchmark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Fllm-optimizer-benchmark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Fllm-optimizer-benchmark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Fllm-optimizer-benchmark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/epfml","download_url":"https://codeload.github.com/epfml/llm-optimizer-benchmark/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Fllm-optimizer-benchmark/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29434408,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-14T03:34:37.767Z","status":"ssl_error","status_checked_at":"2026-02-14T03:34:09.092Z","response_time":53,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmarking","llm","optimizers"],"created_at":"2025-10-04T18:05:25.753Z","updated_at":"2026-02-14T03:37:52.575Z","avatar_url":"https://github.com/epfml.png","language":"Python","readme":"# Codebase: \"Benchmarking Optimizers for Large Language Model Pretraining\"\n[![arXiv](https://img.shields.io/badge/arXiv-2401.06766-b31b1b.svg)](https://arxiv.org/abs/2509.01440)\n[![BibTeX](https://img.shields.io/badge/BibTeX-Citation-green)](#contact--reference)\n\nThe code is largely based on our framework [llm-baselines](https://github.com/epfml/llm-baselines) to do research on training LLMs as an extension of [nanoGPT](https://github.com/karpathy/nanogpt).\nSee the updates regarding our codebase and repo [here](#news-).\n\nThis code comes jointly with reference:\n\n\u003e Andrei Semenov, Matteo Pagliardini, Martin Jaggi.\n\nDate: September 2025\n\n**Abstract:**\n\u003e The recent development of Large Language Models (LLMs) has been accompanied by an effervescence of novel ideas and methods to better optimize the loss of deep learning models. Claims from those methods are myriad: from faster convergence to removing reliance on certain hyperparameters. However, the diverse experimental protocols used to validate these claims make direct comparisons between methods challenging. This study presents a comprehensive evaluation of recent optimization techniques across standardized LLM pretraining scenarios, systematically varying model size, batch size, and training duration. Through careful tuning of each method, we provide guidance to practitioners on which optimizer is best suited for each scenario. For researchers, our work highlights promising directions for future optimization research. Finally, by releasing our code and making all experiments fully reproducible, we hope our efforts can help the development and rigorous benchmarking of future methods.\n\n## News 🔔\n\n* **12/2025:** The EurIPS 2025 poster is available [here](https://andron00e.github.io/uploads/llm-optimizer-benchmark-eurips25.pdf).\n* **11/2025:** Added muP (for both GPT and Llama configurations), uniform and exponential weight averaging, special initialization of the MoE router. More informative logging during training, including RMS and angular updates of different layers. Added an option to train models with untied embeds. Added benchmarks for evaluating downstream performance, e.g., hellaswag, arc_challenge, gsm8k...\n* **10/2025:** [@Andron00e](https://github.com/Andron00e) will present this work at [EurIPS 2025](https://eurips.cc/) and at the workshop on [Benchmarking in AI](https://sites.google.com/view/benchmarking-and-evaluating-ai) in Copenhagen.\n\n## Quickstart \n\nCreate a conda environment and install dependencies:\n\n```\nconda create -n env python=3.10\nconda activate env\npip install -r requirements.txt\n```\n\nRun a simple training on the SlimPajama 6B dataset:\n\n```sh\npython ./src/main.py --config_format base --model llama\n```\n\nThe above command trains a 123.59M parameters model with the Llama-style architecture. To train the MoE model, use the ```---moe``` flag.\n\n## Reproducibility\n\nWe [present](https://github.com/epfml/llm-optimizer-benchmark/tree/dev/scripts) scripts for reproducing our benchmarking results for 124M, 210M, 720M dense Llama-based models, and 520M MoEs.\nSet the [wandb logging](#using-wandb) and run those scripts to obtain the results as below.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/720m_losses_1.png\" alt=\"SF, Signum, Lion, Sophia\" width=\"30%\" style=\"display:inline-block; margin: 5px;\"/\u003e\n  \u003cimg src=\"assets/720m_losses_2.png\" alt=\"Prod, ADOPT, SOAP, AdamW\" width=\"30%\" style=\"display:inline-block; margin: 5px;\"/\u003e\n  \u003cimg src=\"assets/720m_losses_3.png\" alt=\"Top 3\" width=\"30%\" style=\"display:inline-block; margin: 5px;\"/\u003e\n\u003c/p\u003e\n\n**Figure:** results for 720M Llama-style models trained with a batch size of 1M tokens.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/moe_losses_1.png\" alt=\"Sophia, SF, Signum, MARS\" width=\"30%\" style=\"display:inline-block; margin: 5px;\"/\u003e\n  \u003cimg src=\"assets/moe_losses_2.png\" alt=\"Lion, Prod, AdamW, ADOPT\" width=\"30%\" style=\"display:inline-block; margin: 5px;\"/\u003e\n  \u003cimg src=\"assets/moe_losses_3.png\" alt=\"Top 3 MoE\" width=\"30%\" style=\"display:inline-block; margin: 5px;\"/\u003e\n\u003c/p\u003e\n\n**Figure:** results for 520M MoE models trained with a batch size of 130k tokens.\n\n## Less quick start\n\nHere are the possible parameters you can use (copypasted from `config/base.py`):\n\n```python\n# General training params\nparser.add_argument('--batch_size', default=32, type=int)\nparser.add_argument('--acc_steps', default=4, type=int)\nparser.add_argument('--seed', default=0, type=int) # random seed for the parameters\nparser.add_argument('--data_seed', default=1337, type=int) # random seed defining the data ordering\nparser.add_argument('--eval_interval', default=200, type=int)\nparser.add_argument('--full_eval_at', nargs=\"+\", type=int)\nparser.add_argument('--eval_batches', default=64, type=int)\nparser.add_argument('--device', default='cuda:0', type=str) # see below to run on multiple GPUs\nparser.add_argument('--iterations', default=25000, type=int) # total number of training iterations\nparser.add_argument('--warmup_steps', default=300, type=int)\nparser.add_argument('--lr', default=1e-3, type=float)\nparser.add_argument('--wsd_final_lr_scale', default=0.0, type=float) # wsd scheduler\nparser.add_argument('--wsd_fract_decay', default=0.1, type=float) # wsd scheduler \nparser.add_argument('--decay_type', default='linear', choices=['linear', 'cosine', 'exp', 'miror_cosine', 'square', 'sqrt'])\nparser.add_argument('--weight_decay', default=0.1, type=float) # I recommend you keep this value, else instabilities might arise\nparser.add_argument('--beta1', default=0.9, type=float) # adam parameter\nparser.add_argument('--beta2', default=0.95, type=float) # adam parameter\nparser.add_argument('--scheduler', default='cos', choices=['linear', 'cos', 'wsd', 'cos_inf', 'none'])\nparser.add_argument('--final_div_factor', default=1, type=float) # cosine and linear schedulers\nparser.add_argument('--cos_inf_steps', default=0, type=int) # cos_inf scheduler\nparser.add_argument('--opt', default='adamw', choices=['adamw', 'sgd', 'muon', 'soap', 'ademamix', 'lion', 'sf-adamw', 'sf-sgd', 'signsgd', 'signum', 'prodigy', 'sophiag', 'adopt', 'mars', 'adafactor', 'lamb', 'scion', 'scion-light', 'd-muon', 'muon-pytorch'])\nparser.add_argument('--eval_freq', default=200, type=int) # in iterations\nparser.add_argument('--results_base_folder', default=\"./exps\", type=str) # where the checkpoints will be saved\nparser.add_argument('--grad_clip', default=0.0, type=float) # default value is 1.0 in nanoGPT\nparser.add_argument('--momentum', default=0.9, type=float)\nparser.add_argument('--shampoo_beta', default=-1.0, type=float)\nparser.add_argument('--precondition_frequency', default=10, type=int) #for SOAP and Sophia\nparser.add_argument('--max_precond_dim', default=10000, type=int)\nparser.add_argument('--merge_dims', default=False, type=bool) # merge dimensions till the product of the dimensions is less than or equal to max_precond_dim\nparser.add_argument('--precondition_1d', default=False, type=bool)\nparser.add_argument('--normalize_grads', default=False, type=bool)\nparser.add_argument('--soap_data_format', default='channels_first', type=str)\nparser.add_argument('--correct_bias', default=True, type=bool)\nparser.add_argument('--nesterov', default=False, type=bool) # whether to use Nesterov-style momentum \nparser.add_argument('--muon_ns_steps', default=5, type=int) # the number of steps to use in the newton schulz, if it is iterative\nparser.add_argument('--muon_lr_factor', default=0.02, type=float) # a factor by which to reduce the lr for muon\nparser.add_argmunet('--adema_beta3', default=0.9, type=float) # beta3 in AdEMAMix\nparser.add_argument('--adema_alpha', default=2.0, type=float) # alpha in AdEMAMix\nparser.add_argument('--adema_beta3_warmup', default=None, type=int) # AdEMAMix hyperparameter\nparser.add_argument('--adema_alpha_warmup', default=None, type=int) # AdEMAMix hyperparameter\nparser.add_argument('--schedulefree_r', defalut=0.0, type=float) # schedulefree hyperparameter\nparser.add_argument('--weight_lr_power', default=2.0, type=float) # schedulefree hyperparameter\nparser.add_argument('--log_interval', default=50, type=int)\nparser.add_argument('--dampening', default=0.0, type=float)\nparser.add_argument('--prodigy_beta3', default=None, type=float) # coefficients for computing the Prodidy stepsize using running averages\nparser.add_argument('--prodigy_decouple', default=True, type=bool) # Use AdamW style decoupled weight decay\nparser.add_argument('--prodigy_use_bias_correction', default=False, type=bool)\nparser.add_argument('--prodigy_safeguard_warmup', default=False, type=bool) # Remove lr from the denominator of D estimate to avoid issues during warm-up stage. Off by default.\nparser.add_argument('--prodigy_fsdp_in_use', default=False, type=bool)\nparser.add_argument('--sophia_rho', default=0.04, type=float)\nparser.add_argument('--mars_type', default='mars-adamw', choices=['mars-adamw', 'mars-lion', 'mars-shampoo'],)\nparser.add_argument('--mars_vr_gamma', default=0.025, type=float)\nparser.add_argument('--mars_is_approx', default=True, type=float)\nparser.add_argument('--mars_lr', default=3e-3, type=float)\nparser.add_argument('--mars_beta1', default=0.95, type=float)\nparser.add_argument('--mars_beta2', default=0.99, type=float)\nparser.add_argument('--adafactor_decay_rate', default=-0.8, type=float)\nparser.add_argument('--lamb_use_bias_correction', default=False, type=bool)\nparser.add_argument('--adopt_decouple', default=True, type=bool)\nparser.add_argument('--adopt_eps', default=1e-6, type=float)\nparser.add_argument('--scion_lmh_scale', default=10.0, type=float)\nparser.add_argument('--scion_emb_scale', default=1.0, type=float)\nparser.add_argument('--scion_tr_scale', default=3.0, type=float)\nparser.add_argument('--weight_average', action='store_true') # uniform weight averaging (or SWA)\nparser.add_argument('--wa_interval', default=5, type=int, help='How often to take the average (every k steps). Must divide wa-horizon.')\nparser.add_argument('--wa_horizon', default=500, type=int, help='How frequently we save uniform model averages. Should divide '\n+ 'latest-ckpt-interval, otherwise some points may not be saved ' + 'correctly.')\nparser.add_argument('--wa_dtype', default='float32', type=str, choices=['float32', 'float64'])\nparser.add_argument('--wa_use_temp_dir', action='store_true')\nparser.add_argument('--wa_sweep_horizon', action='store_true')\nparser.add_argument('--max_num_wa_sweeps', default=5, type=int)\nparser.add_argument('--exponential_weight_average', action='store_true') # EMA of weights\nparser.add_argument('--ewa_interval', default=10, type=int, help='How often to take the EWA average (every k steps).')\nparser.add_argument('--ewa_decay', default=0.95, type=float, help='EWA decay parameter (between 0.9 and 1).')\nparser.add_argument('--ewa_after_warmup', action='store_true', help='Start EWA after warmup steps.')\n# Dataset params\nparser.add_argument('--dataset', default='slimpajama', choices=['slimpajama', 'wikitext', 'shakespeare-char', 'arxiv', 'arxiv2000', 'arxiv+wiki', 'openwebtext2', 'redpajama', 'redpajamav2', 'fineweb', 'finewebedu', 'c4', 'arc_easy', 'arc_challenge', 'hellaswag', 'logiqa', 'piqa', 'sciq', 'humaneval', 'gsm8k', 'kodcode', 'mathqa', 'medqa'])\nparser.add_argument('--tokenizer', default='gpt2', type=str, choices=['gpt2', 'mistral'])\nparser.add_argument('--vocab_size', default=50304, type=int)\nparser.add_argument('--data_in_ram', action='store_true') # force the data to RAM, you most likely do not need this  \n# Model params\nparser.add_argument('--model', default='base', choices=['base', 'llama', 'mup_gpt', 'mup_llama',])\nparser.add_argument('--parallel_block', action='store_true')\nparser.add_argument('--use_pretrained', default='none', type=str) # 'none', 'gpt2' or a path to the pretraind model\nparser.add_argument('--from_dense', action='store_true')\nparser.add_argument('--init_std', default=0.02, type=float)\nparser.add_argument('--dropout', default=0.0, type=float) # keep to 0 unless in low data regime (e.g. wikitext)\nparser.add_argument('--n_head', default=12, type=int)\nparser.add_argument('--n_layer', default=12, type=int) # depth in (att + ff) blocks\nparser.add_argument('--n_embd', default=768, type=int) # hidden size ... \nparser.add_argument('--sequence_length', default=512, type=int)\nparser.add_argument('--dtype', default='bfloat16', type=str, choices=['float32', 'float16', 'bfloat16'],)\nparser.add_argument('--bias', default=False, type=bool)\nparser.add_argument('--compile', action='store_true') # if true then model is compiled\nparser.add_argument('--untied_embeds', action='store_true') # disables weight tying between lm_head.weight and wte.weight\nparser.add_argument('--rmsnorm_eps', default=1e-5, type=float) # used by the llama model\nparser.add_argument('--multiple_of', default=256, type=int) # used by the llama model make SwiGLU hidden layer size multiple of large power of 2\nparser.add_argument('--moe', action='store_true')\nparser.add_argument('--moe_routing', default='standard_gating', type=str, choices=['standard_gating', 'expert_choice'],)\nparser.add_argument('--moe_num_experts', default=8, type=int)\nparser.add_argument('--capacity_factor', default=2.0, type=float) # only used for expert choice routing\nparser.add_argument('--moe_num_shared_experts', default=0, type=int) # deepseek routing, experts that are always active\nparser.add_argument('--moe_router_loss', default='load_balancing_z_loss', type=str, choices=['entropy', 'load_balancing_only', 'load_balancing_z_loss'],)\nparser.add_argument('--moe_num_experts_per_tok', default=2, type=int)\nparser.add_argument('--moe_entropy_loss_factor', default=0.01, type=float)\nparser.add_argument('--moe_aux_loss_factor', default=0.1, type=float)\nparser.add_argument('--moe_z_loss_factor', default=0.01, type=float)\nparser.add_argument('--moe_softmax_order', type=str, default='topk_softmax', choices=['softmax_topk', 'topk_softmax'],)\nparser.add_argument('--plot_router_logits', action='store_true')\nparser.add_argument('--scale_emb', default=10, type=int) # mup arguments --- the base model width that mup has been configured on\nparser.add_argument('--scale_base_model', default=256, type=int)\nparser.add_argument('--scale_depth', default=1.4, type=float)\n# Checkpointing\nparser.add_argument('--results_base_folder', default='./exps', type=str)\nparser.add_argument('--permanent_ckpt_interval', default=0, type=int)\nparser.add_argument('--latest_ckpt_interval', default=0, type=int)\nparser.add_argument('--resume_from', default=None, type=str)\nparser.add_argument('--resume_from_swa', default=None, type=str)\nparser.add_argument('--auto_resume', default=True)\n# logging params (WandB)\nparser.add_argument('--wandb', action='store_true') # whether to use wandb or not\nparser.add_argument('--wandb_project', default='my-project', type=str)\nparser.add_argument('--wandb_entity', default=None, type=none_or_str) # for the team projects\nparser.add_argument('--wandb_run_prefix', default='none', type=str) # is added before the autogenerated experiment name\nparser.add_argument('--eval_seq_prefix', default=\"Once upon a time\", type=str) # prefix used to generate sequences\nparser.add_argument('--log_dynamics', action='store_true')\nparser.add_argument('--dynamics_logger_cfg', default='./src/logger/rotational_logger.yaml', type=str)\nparser.add_argument('--log_parameter_norms', action='store_true') # logs the L2 norm of the parameters\nparser.add_argument('--norm_order', default=2) # order of the model norm to log\n# Distributed args\nparser.add_argument('--distributed_backend', default=None, type=str, required=False,\n                    choices=distributed.registered_backends())  # distributed backend type (e.g. nccl)\n```\n\n## Using WandB\n\nYou need to give your wandb authorize key in order to send the data to your wandb account. If you start jobs on a server without access to prompt, then you can set the `WANDB_API_KEY` variable within your script:\n\n```bash\n# this is a script that could be executed on a server\npip install -r requirements.txt # install req.\nexport WANDB_API_KEY=\"put your authorize key here, to find it: https://wandb.ai/authorize\"\npython ./src/main.py --config_format base --wandb --wandb_project \"my awesome project\" --n_layer 7 --model llama --seed 123\n```\n\n## How to add your own transformer architecture? \n\nThe structure of the project is the following: \n\n```sh\nsrc/\n    main.py         # pick the right data, model, optimizer, and training function\n    config/\n        __init__.py # contains CONFIG_FORMAT_TO_MODULE_MAP mapping the name given to the --config_format flag with a python conf file\n        base.py     # config for the base model\n    data/\n        utils.py    # contains the get_dataset function\n        fineweb.py # load/process fineweb\n        fineweb_edu.py    # load/process fineweb edu\n        shakespeare.py # load/process the Shakespeare dataset\n        benchmarks.py # load/process benchs, e.g., hellaswag, gsm8k, arc_challenge\n        c4.py # load/process the c4 dataset\n        slimpajama.py\n        ...\n    models/\n        utils.py    # contains the get_model function\n        base.py     # contains the standard transformer base architecture\n        llama.py    # llama architecture\n        mup.py # implementation of muP\n        mup_llama.py # muP-styled llama architecture\n    optim/\n        utils.py    # contains eval and get_batch functions\n        base.py     # training function for the base and llama models\n        ...\n    distributed/\n        # code to enable simple distributed training\n```\n\nGiven the above structure, to add your own model, you can just fork the `./src/models/base.py` file, do your modifications, then if necessary fork the `./src/optim/base.py` in case you need some custom training loop or evaluation. You also need to fork the `./src/config/base.py` file to add your own parameters, which imply adding your new config to the mapping `CONFIG_FORMAT_TO_MODULE_MAP` in `./src/config/__init__.py`. To add a new dataset, create a new file in the `data` folder, check `wikitext.py` for the expected format. \n\n**Note:** we use [black](https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html) and [isort](https://pycqa.github.io/isort/) for all pull requests. Before committing your code, simply run ```black . \u0026\u0026 isort .``` and you will be fine.\n\n## Multi-GPU training\n\nGiven a multi-GPU machine with e.g. 4 GPUs, one can distribute the training using data-parallelism:\n\n```sh\ntorchrun --nproc_per_node=4 ./src/main.py --config_format base --distributed_backend nccl --dataset slimpajama --model base\n```\n\nWhen using multiple GPUs, the data will be distributed among the GPUs by dividing the number of accumulation steps by the number of nodes. For instance if we train with a batch size of 32 and 4 accumulation steps, then each GPU will process batches of 32 elements and do 1 accumulation steps. For this reason we require `acc_steps` to be a multiple of the number of GPUs. \n\n## Experimenting locally on your device with CPU\nIf do not have access to a GPU or just want to try the code locally on your device, you can try the Shakespeare dataset with character-level tokens:\n\n```sh\npython ./src/main.py --n_layer=2 --n_head=4 --n_embd=128 --sequence_length=256 --dataset=shakespeare-char --device=cpu --vocab_size=96\n```\n\n**We believe the details provided are clear enough to reproduce the main findings of our paper.**\n\n\n## Contact \u0026 Reference\n\nPlease do not hesitate to reach out to us if you have questions. And feel free to open an [issue](https://github.com/epfml/llm-optimizer-benchmark/issues).\n\n```bib\n@article{semenov2025benchmarking,\n  title={Benchmarking {O}ptimizers for {L}arge {L}anguage {M}odel {P}retraining},\n  author={Semenov, Andrei and Pagliardini, Matteo and Jaggi, Martin},\n  journal={arXiv preprint arXiv:2509.01440},\n  url={https://arxiv.org/abs/2509.01440},\n  year={2025}\n}\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepfml%2Fllm-optimizer-benchmark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fepfml%2Fllm-optimizer-benchmark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepfml%2Fllm-optimizer-benchmark/lists"}