{"id":19066170,"url":"https://github.com/epfml/llm-baselines","last_synced_at":"2025-04-28T12:25:02.411Z","repository":{"id":145511200,"uuid":"613863162","full_name":"epfml/llm-baselines","owner":"epfml","description":"nanoGPT-like codebase for LLM training","archived":false,"fork":false,"pushed_at":"2025-04-02T09:05:09.000Z","size":661,"stargazers_count":93,"open_issues_count":10,"forks_count":28,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-04-18T16:15:45.479Z","etag":null,"topics":["llms","pretraining"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/epfml.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-03-14T12:26:27.000Z","updated_at":"2025-04-14T06:33:11.000Z","dependencies_parsed_at":null,"dependency_job_id":"39968c59-9204-4f7f-ab0c-54204ee22163","html_url":"https://github.com/epfml/llm-baselines","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Fllm-baselines","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Fllm-baselines/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Fllm-baselines/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Fllm-baselines/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/epfml","download_url":"https://codeload.github.com/epfml/llm-baselines/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251312260,"owners_count":21569200,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llms","pretraining"],"created_at":"2024-11-09T00:55:00.920Z","updated_at":"2025-04-28T12:25:02.387Z","avatar_url":"https://github.com/epfml.png","language":"Python","readme":"# LLM-baselines\n\nA modular codebase to experiment with transformers, inspired by NanoGPT. \n\n## Quickstart \n\nInstall dependencies: \n\n```\npip install -r requirements.txt\n```\n\nRun a simple training on the Slimpajama dataset ([6B subset](https://huggingface.co/datasets/DKYoon/SlimPajama-6B), 24GBs decompressed, takes a few minutes to download):\n\n```sh\npython ./src/main.py --config_format base\n```\n\nThe above command trains a 123.59M parameters model. It trains for 25k iterations with a batch size of 128=32x4 (4 gradient accumulation steps), using a cosine schedule with a maximum learning rate of 1e-3 that is reduced to 1e-4 at the end of training. The model is saved in the `./exps` folder.\n\nThis training takes roughly ~3h on a single A100 (80GB) GPU. The plot of the training and validation loss should look roughly like this:\n\n\u003cimg src=\"./assets/loss_slimpajama.png\" alt=\"Loss on SlimPajama\" width=\"500\"/\u003e\n\u003cimg src=\"./assets/pplx_slimpajama.png\" alt=\"Perplexity on SlimPajama\" width=\"500\"/\u003e\n\nYou can check out the wandb run for yourself [here](https://wandb.ai/haeggee/llm-lauzhack/runs/lm2obqy9?nw=nwuserhaeggee).\n\n\n## Less quick start\n\nHere are the possible parameters you can use (copypasted from `config/base.py`):\n\n```python\n# General training params\nparser.add_argument('--batch_size', default=32, type=int)\nparser.add_argument('--acc_steps', default=4, type=int)\nparser.add_argument('--seed', default=0, type=int) # random seed for the parameters\nparser.add_argument('--data_seed', default=1337, type=int) # random seed defining the data ordering\nparser.add_argument('--device', default='cuda:0', type=str) # see below to run on multiple GPUs\nparser.add_argument('--iterations', default=25000, type=int) # total number of training iterations\nparser.add_argument('--lr', default=1e-3, type=float) \nparser.add_argument('--warmup_percent', default=0.05, type=float) # the total number of warmup steps is iterations * warmup_percent\nparser.add_argument('--weight_decay', default=0.1, type=float) # I recommend you keep this value, else instabilities might arise\nparser.add_argument('--beta1', default=0.9, type=float) # adam parameter\nparser.add_argument('--beta2', default=0.95, type=float) # adam parameter\nparser.add_argument('--scheduler', default='cos', choices=['linear', 'cos', 'none'])\nparser.add_argument('--opt', default='adamw', choices=['adamw', 'sgd'])\nparser.add_argument('--eval_freq', default=200, type=int) # in iterations\nparser.add_argument('--results_base_folder', default=\"./exps\", type=str) # where the checkpoints will be saved\nparser.add_argument('--grad_clip', default=0.0, type=float) # default value is 1.0 in NanoGPT\n# Dataset params\nparser.add_argument('--dataset', default='slimpajama', choices=['slimpajama', 'wikitext', \"shakespeare-char\", 'arxiv', \"arxiv2000\", \"arxiv+wiki\", 'openwebtext2'])\nparser.add_argument('--vocab_size', default=50304, type=int)\nparser.add_argument('--data_in_ram', action='store_true') # force the data to RAM, you most likely do not need this  \n# Model params\nparser.add_argument('--model', default='base', choices=['base', 'llama2'])\nparser.add_argument('--use_pretrained', default=\"none\", type=str) # 'none', 'gpt-2' or a path to the pretraind model\nparser.add_argument('--dropout', default=0.0, type=float) # keep to 0 unless in low data regime (e.g. wikitext)\nparser.add_argument('--n_head', default=12, type=int)\nparser.add_argument('--n_layer', default=12, type=int) # depth in (att + ff) blocks\nparser.add_argument('--n_embd', default=768, type=int) # hidden size ... \nparser.add_argument('--sequence_length', default=512, type=int)\nparser.add_argument('--dtype', default=torch.bfloat16, type=torch.dtype)\nparser.add_argument('--bias', default=False, type=bool)\nparser.add_argument('--compile', action='store_true') # if true then model is compiled \nparser.add_argument('--rmsnorm_eps', default=1e-5, type=float) # used by the llama model\nparser.add_argument('--multiple_of', default=256, type=int) # used by the llama model make SwiGLU hidden layer size multiple of large power of 2\n# logging params (WandB)\nparser.add_argument('--wandb', action='store_true') # whether to use wandb or not\nparser.add_argument('--wandb_project', default=\"my-project\", type=str)\nparser.add_argument('--wandb_run_prefix', default=\"none\", type=str) # is added before the autogenerated experiment name\nparser.add_argument('--eval_seq_prefix', default=\"Once upon a time\", type=str) # prefix used to generate sequences\n# Distributed args\nparser.add_argument('--distributed_backend', default=None, type=str, required=False,\n                    choices=distributed.registered_backends())  # distributed backend type (e.g. nccl)\nparser.add_argument('--save_checkpoint_freq', default=None, type=int, required=False)\n```\n\n## Using WandB\n\nYou need to give your wandb authorize key in order to send the data to your wandb account. If you start jobs on a server without access to prompt, then you can set the `WANDB_API_KEY` variable within your script:\n\n```bash\n# this is a script that could be executed on a server\npip install -r requirements.txt # install req.\nexport WANDB_API_KEY=\"put your authorize key here, to find it: https://wandb.ai/authorize\"\npython ./src/main.py --config_format base --wandb --wandb_project \"my awesome project\" --n_layer 7 --model base --seed 123\n```\n\n## How to add your own transformer architecture? \n\nThe structure of the project is the following: \n\n```sh\nsrc/\n    main.py         # pick the right data, model, and training function\n    config/\n        __init__.py # contains CONFIG_FORMAT_TO_MODULE_MAP mapping the name given to the --config_format flag with a python conf file\n        base.py     # config for the base model\n    data/\n        utils.py    # contains the get_dataset function\n        wikitext.py # load/process wikitext\n        arxiv.py    # load/process arxiv\n        shakespeare.py # load/process the Shakespeare dataset\n        slimpajama.py\n        ...\n    models/\n        utils.py    # contains the get_model function\n        base.py     # contains the standard transformer base architecture\n        llama.py    # llama architecture\n    optim/\n        utils.py    # contains eval and get_batch functions\n        base.py     # training function for the base and llama models\n    distributed/\n        # code to enable simple distributed training\n```\n\nGiven the above structure, to add your own model, you can just fork the `./src/models/base.py` file, do your modifications, then if necessary fork the `./src/optim/base.py` in case you need some custom training loop or evaluation. You also need to fork the `./src/config/base.py` file to add your own parameters, which imply adding your new config to the mapping `CONFIG_FORMAT_TO_MODULE_MAP` in `./src/config/__init__.py`. To add a new dataset, create a new file in the `data` folder, check `wikitext.py` for the expected format. \n\n## Multi-GPU training\n\nGiven a multi-GPU machine with e.g. 4 GPUs, one can distribute the training using data-parallelism:\n\n```sh\ntorchrun --nproc_per_node=4 ./src/main.py --config_format base --distributed_backend nccl --dataset slimpajama --model base\n```\n\nWhen using multiple GPUs, the data will be distributed among the GPUs by dividing the number of accumulation steps by the number of nodes. For instance if we train with a batch size of 32 and 4 accumulation steps, then each GPU will process batches of 32 elements and do 1 accumulation steps. For this reason we require `acc_steps` to be a multiple of the number of GPUs.    \n\n\n## Experimenting locally on your device with CPU\nIf do not have access to a GPU or just want to try the code locally on your device, you can try the Shakespeare dataset with character-level tokens:\n\n```sh\npython ./src/main.py --n_layer=2 --n_head=4 --n_embd=128 --sequence_length=256 --dataset=shakespeare-char --device=cpu --vocab_size=96\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepfml%2Fllm-baselines","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fepfml%2Fllm-baselines","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepfml%2Fllm-baselines/lists"}