{"id":29884643,"url":"https://github.com/kempnerinstitute/systems-scaling","last_synced_at":"2026-03-04T18:03:49.681Z","repository":{"id":264404586,"uuid":"892896107","full_name":"KempnerInstitute/systems-scaling","owner":"KempnerInstitute","description":null,"archived":false,"fork":false,"pushed_at":"2025-09-15T20:21:29.000Z","size":57164,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-14T08:49:57.874Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/KempnerInstitute.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-11-23T02:02:53.000Z","updated_at":"2025-10-08T01:02:04.000Z","dependencies_parsed_at":"2024-11-24T03:19:38.800Z","dependency_job_id":"21aeef8d-1cb0-4495-bc76-1be4c80b30bc","html_url":"https://github.com/KempnerInstitute/systems-scaling","commit_stats":null,"previous_names":["hither1/greedy-sharding","hither1/systems-scaling","kempnerinstitute/systems-scaling"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/KempnerInstitute/systems-scaling","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KempnerInstitute%2Fsystems-scaling","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KempnerInstitute%2Fsystems-scaling/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KempnerInstitute%2Fsystems-scaling/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KempnerInstitute%2Fsystems-scaling/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/KempnerInstitute","download_url":"https://codeload.github.com/KempnerInstitute/systems-scaling/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KempnerInstitute%2Fsystems-scaling/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30088343,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-04T15:40:14.053Z","status":"ssl_error","status_checked_at":"2026-03-04T15:40:13.655Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-07-31T15:05:34.208Z","updated_at":"2026-03-04T18:03:49.661Z","avatar_url":"https://github.com/KempnerInstitute.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# [Characterization and Mitigation of Training Instabilities in Microscaling Formats](https://arxiv.org/abs/2506.20752)\n\n[Chloe Su](https://x.com/Huangyu58589918)*, [Mujin Kwun](https://x.com/MJK12341234), Stephanie Gil, [Sham Kakade](https://x.com/ShamKakade6), [Nikhil Anand](https://x.com/nikhil_anand91)\\*\n\n![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)\n[![arXiv](https://img.shields.io/badge/arXiv-2506.20752-B31B1B.svg)](https://arxiv.org/abs/2506.20752)\n[![Blog Post](https://img.shields.io/badge/arXiv-2506.20752-B31B1B.svg)]()\n\n\n### Setup\n```\nmodule load cuda/12.4.1-fasrc01\nexport CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:${HOME}/cuda-12.0/targets/x86_64-linux/include\nmodule load gcc/12.2.0-fasrc01\n```\nCreate a Conda environment and Install dependencies\n```\ngit clone git@github.com:Hither1/systems-scaling.git\ncd systems-scaling\nconda create -n scaling python=3.10\nconda activate scaling\npip install -e .[all]\n```\n### Usage\n#### Language model training\nIn `olmo/` folder, run\n```\nsbatch scripts/launch_sweep_scale.sh configs/base.yaml configs/sweeps/scale.yaml\n```\n\n#### Synthetic training\n\nCode for the student-teacher model experiments on synthetic data can be found under `olmo/synthetic`.  For example, you can run \n\n```\npython synthetic/student_teacher.py --depth 4 --width 512 --batch 2048 --lr_max 6e-4 --wandb_project \u003cYOUR PROJECT NAME\u003e --store_full_gradients --log_weight_clipping --val_every 100 --steps 9000 --save_checkpoints --checkpoint_window_center 14100 --checkpoint_window_size 200 --checkpoint_every 20\n```\n\nwhich will run the synthetic experiment and save checkpoints at starting at global step 13900 every 20 steps.  Note that `steps` above refers to how many steps both the FP32 and MX training loops take, so 9000 `steps` equates to a total of 18000 global steps.  If you plan on using deterministic torch algorithms (on by default), you may need to export the CuBLAS work config by `export CUBLAS_WORKSPACE_CONFIG=:16:8`  or `:4096:8`.\n\nTo run an intervention experiment, for example returning to full precision at the intervention point, you can run\n\n```\npython synthetic/student_teacher.py --run_intervention --intervention_checkpoint \u003cPATH TO YOUR CHECKPOINT\u003e --depth 4 --width 512 --batch 2048 --lr_max 6e-4 --steps_total 500 --store_full_gradients --log_weight_clipping --val_every 100 --wandb_project \u003cYOUR PROJECT NAME\u003e --dont_inject_mx_ops\n```\n\n#### Small edits in microxcaling. \n* Minor changes were made to the MX Pytorch Simulation Library in order to experiment with selectively quantizing different parts of the network, e.g., turning off quantization for LayerNorm affine parameters.  By default, all of these modifications are turned off and so the library will behave as expected.\n\n\n### Contents\n```\nsystems-scaling/             \n│── olmo/\n│   │── mx/             # microxcaling library (with our modificaitons)\n│   │    ├── activations.py                      \n│   │    ├── layernorm.py\n│   │    ├── mx_mapping.py\n│   │    ├── mx_ops.py\n│   │    ├── norm_utils.py\n│   │    ├── specs.py              \n│   │    └── vector_ops.py\n│   │            \n│   │── synthetic\n│   │   └── student_teacher.py\n│   │\n│   └── olmo/               # OLMo (model) code\n│        ├── main.py               # Configuration files\n│        ├── configs/              #  functions\n│        ├── scripts/              # \n│        └── __init__.py           \n│── plot/                # Jupyter notebooks/scripts for plots and analysis\n│    ├── accuracy_relationships.ipynb\n│    └── curve_fitting_and_instability.ipynb    # Scaling law plots\n│   \n│── nanoGPT/                  # nanoGPT code (not used in our paper finally)\n│   ├── main.py               # train script\n│   ├── models/               \n│   └── __init__.py\n│        \n│── requirements.txt          # Dependencies\n│── README.md                 # Project documentation\n│── .gitignore            \n```\n\n\n### Citation\n\nIf you use this code in your research, please cite the following papers and repositories:\n\nOur paper:\n```\n@misc{su2025characterizationmitigationtraininginstabilities,\n      title={Characterization and Mitigation of Training Instabilities in Microscaling Formats}, \n      author={Huangyuan Su and Mujin Kwun and Stephanie Gil and Sham Kakade and Nikhil Anand},\n      year={2025},\n      eprint={2506.20752},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      url={https://arxiv.org/abs/2506.20752}, \n}\n```\n\nOLMo and the Pytorch MX Simulator\n```\n@article{groeneveld2024olmo,\n  title={Olmo: Accelerating the science of language models},\n  author={Groeneveld, Dirk and Beltagy, Iz and Walsh, Pete and Bhagia, Akshita and Kinney, Rodney and Tafjord, Oyvind and Jha, Ananya Harsh and Ivison, Hamish and Magnusson, Ian and Wang, Yizhong and others},\n  journal={arXiv preprint arXiv:2402.00838},\n  year={2024}\n}\n\n@misc{mx_library,\n  author = {{Microsoft}},\n  title = {MX Pytorch Emulation Library},\n  year = {2024},\n  url = {https://github.com/microsoft/microxcaling/tree/main},\n  urldate = {2025-05-15}\n}\n```\n\n(Optional) Downstream evaluation\n```bash\n    from olmo.eval.downstream import *\n    tokenizer = Tokenizer.from_file(\"olmo/tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special.json\")\n    for x in label_to_task_map.values():\n        print(x)\n        kwargs = {}\n        if isinstance(x, tuple):\n            x, kwargs = x\n        x(tokenizer=tokenizer, **kwargs)\n    ```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkempnerinstitute%2Fsystems-scaling","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkempnerinstitute%2Fsystems-scaling","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkempnerinstitute%2Fsystems-scaling/lists"}