{"id":23906398,"url":"https://github.com/exo-explore/gym","last_synced_at":"2025-07-11T10:07:15.750Z","repository":{"id":271039943,"uuid":"908004346","full_name":"exo-explore/gym","owner":"exo-explore","description":"EXO Gym is an open-source Python toolkit that facilitates distributed AI research. ","archived":false,"fork":false,"pushed_at":"2025-07-10T18:27:50.000Z","size":2557,"stargazers_count":31,"open_issues_count":3,"forks_count":10,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-07-10T23:04:14.555Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://blog.exolabs.net/day-9/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/exo-explore.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-12-24T20:56:51.000Z","updated_at":"2025-06-27T11:02:46.000Z","dependencies_parsed_at":null,"dependency_job_id":"3ed90934-b7a5-47a5-b968-1298eee92c12","html_url":"https://github.com/exo-explore/gym","commit_stats":null,"previous_names":["exo-explore/gym"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/exo-explore/gym","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/exo-explore%2Fgym","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/exo-explore%2Fgym/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/exo-explore%2Fgym/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/exo-explore%2Fgym/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/exo-explore","download_url":"https://codeload.github.com/exo-explore/gym/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/exo-explore%2Fgym/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264781087,"owners_count":23662790,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-05T02:01:06.157Z","updated_at":"2025-07-11T10:07:15.743Z","avatar_url":"https://github.com/exo-explore.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# EXO Gym\n\nOpen source framework for simulated distributed training methods.\nInstead of training with multiple ranks, we simulate the distributed training process by running multiple nodes on a single machine.\n\n## Supported Devices\n\n- CPU\n- CUDA\n- MPS (CPU-bound for copy operations, see [here](https://github.com/pytorch/pytorch/issues/141287))\n\n## Supported Methods\n\n- AllReduce (Equivalent to PyTorch [DDP](https://arxiv.org/abs/2006.15704))\n- [FedAvg](https://arxiv.org/abs/2311.08105)\n- [DiLoCo](https://arxiv.org/abs/2311.08105)\n- [SPARTA](https://openreview.net/forum?id=stFPf3gzq1)\n- [DeMo](https://arxiv.org/abs/2411.19870)\n\n\n## Installation\n\n### Basic Installation\nInstall with core dependencies only:\n```bash\npip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ exogym\n```\n\n### Installation with Optional Features\n\nOptional feature flags allowed are:\n\n```bash\nwandb,gpt,demo,examples,all,dev\n```\n\nFor example, `pip install exogym[demo]`\n\n### Development Installation\n\nTo install for development:\n```bash\ngit clone https://github.com/exo-explore/gym.git exogym\ncd exogym\npip install -e \".[dev]\"\n```\n\n## Usage\n\n### Example Scripts\n\nMNIST comparison of DDP, DiLoCo, and SPARTA:\n\n```bash\npython run/mnist.py\n```\n\nNanoGPT Shakespeare DiLoCo:\n\n```bash\npython run/nanogpt_diloco.py --dataset shakespeare\n```\n\n### Custom Training\n\n```python\nfrom exogym import LocalTrainer\nfrom exogym.strategy import DiLoCoStrategy\n\ntrain_dataset, val_dataset = ...\nmodel = ... # model.forward() expects a batch, and returns a scalar loss\n\ntrainer = LocalTrainer(model, train_dataset, val_dataset)\n\n# Strategy for optimization \u0026 communication\nstrategy = DiLoCoStrategy(\n  inner_optim='adam',\n  H=100\n)\n\ntrainer.fit(\n  strategy=strategy,\n  num_nodes=4,\n  device='mps'\n)\n```\n\n## Codebase Structure\n\n- `Trainer`: Builds simulation environment. `Trainer` will spawn multiple `TrainNode` instances, connect them together, and starts the training run.\n- `TrainNode`: A single node (rank) running its own training loop. At each train step, instead of calling `optim.step()`, it calls `strategy.step()`.\n- `Strategy`: Abstract class for an optimization strategy, which both defines **how the nodes communicate** with each other and **how model weights are updated**. Typically, a gradient strategy will include an optimizer as well as a communication step. Sometimes (eg. DeMo), the optimizer step is comingled with the communication.\n\n## Technical Details\n\nEXO Gym uses pytorch multiprocessing to spawn a subprocess per-node, which are able to communicate with each other using regular operations such as `all_reduce`.\n\n### Model\n\nThe model is expected in a form that takes a `batch` (the same format as `dataset` outputs), and returns a scalar loss over the entire batch. This ensures the model is agnostic to the format of the data (eg. masked LM training doesn't have a clear `x`/`y` split).\n\n### Dataset\n\nRecall that when we call `trainer.fit()`, $K$ subprocesses are spawned to handle each of the virtual workers. There are two options for creating dataset:\n\n#### PyTorch `Dataset`\n\nInstantiate a single `Dataset`. The `dataset` object is passed to every subprocess, and a `DistributedSampler` will be used to select which datapoints are sampled per-node (to ensure each datapoint is only used once by each node). If the dataset is entirely loaded into memory, this memory will be duplicated per-node - be careful not to run out of memory! If the dataset is larger, it should be lazily loaded.\n\n#### `dataset_factory` function\n\nIn place of the dataset object, pass a function with the following signature:\n\n```python\ndef dataset_factory(rank: int, num_nodes: int, train_dataset: bool) -\u003e torch.utils.data.Dataset\n```\n\nThis will be called within each rank to build the dataset. Instead of each node storing the whole dataset and subsampling datapoints, each node only loads the necessary datapoints.\n\n\n\u003c!-- For further information, see individual pages on:\n\n- [Dataset](./docs/dataset.md) --\u003e","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexo-explore%2Fgym","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fexo-explore%2Fgym","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexo-explore%2Fgym/lists"}