{"id":24811857,"url":"https://github.com/instadeepai/sebulba","last_synced_at":"2026-03-14T14:42:11.339Z","repository":{"id":201439498,"uuid":"686949517","full_name":"instadeepai/sebulba","owner":"instadeepai","description":"🪐 The Sebulba architecture to scale reinforcement learning on Cloud TPUs in JAX","archived":false,"fork":false,"pushed_at":"2023-10-23T10:39:28.000Z","size":1789,"stargazers_count":57,"open_issues_count":0,"forks_count":5,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-04-06T11:24:27.417Z","etag":null,"topics":["ai","deep-learning","hpc","jax","machine-learning","podracer","ppo","reinforcement-learning","sebulba","tpu"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/instadeepai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-09-04T09:38:44.000Z","updated_at":"2025-03-09T16:42:02.000Z","dependencies_parsed_at":"2023-10-31T23:00:21.796Z","dependency_job_id":null,"html_url":"https://github.com/instadeepai/sebulba","commit_stats":{"total_commits":2,"total_committers":1,"mean_commits":2.0,"dds":0.0,"last_synced_commit":"abc60f972755d8592e8a77155f36687d30f294ad"},"previous_names":["instadeepai/sebulba"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/instadeepai/sebulba","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/instadeepai%2Fsebulba","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/instadeepai%2Fsebulba/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/instadeepai%2Fsebulba/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/instadeepai%2Fsebulba/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/instadeepai","download_url":"https://codeload.github.com/instadeepai/sebulba/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/instadeepai%2Fsebulba/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279015312,"owners_count":26085684,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-13T02:00:06.723Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","deep-learning","hpc","jax","machine-learning","podracer","ppo","reinforcement-learning","sebulba","tpu"],"created_at":"2025-01-30T13:16:34.101Z","updated_at":"2025-10-13T13:31:44.549Z","avatar_url":"https://github.com/instadeepai.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003e\n    \u003cp\u003e👾 Sebulba\u003c/p\u003e\n\u003c/h1\u003e\n\n\n\u003ch2 align=\"center\"\u003e\n    \u003cp\u003eAn Implementation of the Sebulba Distributed RL Architecture\u003c/p\u003e\n\u003c/h2\u003e\n\n\u003cdiv align=\"center\"\u003e\n    \u003ca  href=\"\"\u003e\n        \u003cimg src=\"https://img.shields.io/badge/python-3.9-blue\" alt=\"python version 3.9\"/\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/psf/black\"\u003e\n        \u003cimg src=\"https://img.shields.io/badge/code%20style-black-000000.svg\" alt=\"format using black\"/\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/instadeepai/sebulba/actions\"\u003e\n       \u003cimg src=\"https://github.com/instadeepai/sebulba/actions/workflows/tests_and_linters.yaml/badge.svg\"/\u003e\n    \u003c/a\u003e\n    \u003ca href=\"http://mypy-lang.org/\"\u003e\n        \u003cimg src=\"http://www.mypy-lang.org/static/mypy_badge.svg\"/\u003e\n    \u003c/a\u003e\n    \u003ca  href=\"https://opensource.org/license/apache-2-0/\"\u003e\n        \u003cimg src=\"https://img.shields.io/badge/License-Apache%202.0-orange.svg\" alt=\"apache 2.0 license\"/\u003e\n    \u003c/a\u003e\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n    \u003ch3\u003e\n      \u003ca href=\"https://cloud.google.com/blog/products/compute/instadeep-performs-reinforcement-learning-on-cloud-tpus\"\u003eBlog Post\u003c/a\u003e |\n      \u003ca href=\"#quickstart-\"\u003eQuickstart\u003c/a\u003e |\n      \u003ca href=\"#architecture-\"\u003eArchitecture\u003c/a\u003e |\n      \u003ca href=\"#benchmarks-\"\u003eBenchmarks\u003c/a\u003e |\n      \u003ca href=\"#acknowledgements-\"\u003eAcknowledgements\u003c/a\u003e |\n      \u003ca href=\"#citation-%EF%B8%8F\"\u003eCitation\u003c/a\u003e\n    \u003c/h3\u003e\n\u003c/div\u003e\n\nWe provide an implementation of Sebulba, introduced in Google DeepMind's [Podracer](https://arxiv.org/pdf/2104.06272.pdf) paper.\nSebulba uses an Actor-Learner decomposition to generate and learn from experience.\nIt supports arbitrary environments and co-locates acting and learning on a single TPU machine. Our\nimplementation of Sebulba uses the [PPO](https://arxiv.org/pdf/1707.06347.pdf) algorithm, but Sebulba can be augmented to use many popular\nRL algorithms. This repo is intended to be a starting point for researchers to experiment with and\nuse to begin scaling their own RL agents. To get started, fork the repo, edit as\nneeded and run your experiments using the scripts and Makefile provided.\nFeel free to star ⭐ the repo to help support the project!\n\nThis repo provides the following key features:\n- 🏅 High quality implementation of Sebulba with modular components.\n- 🧩 Highly configurable implementation, allowing you to adapt the system for your experiments\n- 📊 Benchmarks illustrating Sebulba's performance and ability to scale.\n- 📜 Logging support for both Tensorflow and Neptune.\n- ☁️ Containerised to easily get up and running both locally and on GCP.\n\n## Quickstart 🚀\n\n### Local Quickstart\nYou can run Sebulba locally on your machine using docker and setting XLA flags to fake multiple devices.\nThis allows us to test and experiment with Sebulba on a single machine.\n```bash\nmake docker_build_cpu\n\n# fake 8 XLA devices\nmake docker_run DOCKER_VARS_TO_PASS=\"-e XLA_FLAGS='--xla_force_host_platform_device_count=8'\" command=\"python experiments/sebulba_ppo_atari.py +experiment=ppo-pong +accelerators=local_8_device\"\n```\n\nThen you can check your metrics on tensorboard\n```\ntensorboard --logdir=./logs\n```\n\n### GPU Quickstart\nWe use a similar setup for our runs on GPU VMs. Note that we have only carried out runs\non a single machine with A100 GPUs.\n\n```bash\nmake docker_build_gpu\n\nmake docker_run command=\"python experiments/sebulba_ppo_atari.py +experiment=ppo-pong +accelerators=gpu_8_a100\"\n```\n\nThen you can check your metrics on tensorboard\n```\ntensorboard --logdir=./logs\n```\n\n### TPU Quickstart\nHere we outline our setup for running Sebulba on TPU VMs on GCP.\n\n\u003e We asume you have [gcloud](https://cloud.google.com/sdk/docs/install) installed and configured.\n\nSet these environment variables with the desired TPU configuration.\n```bash\nPROJECT=my-gcp-project\nZONE=us-central1-f\nACCELERATOR_TYPE=v2-8\nRUNTIME_VERSION=v2-alpha\nNAME=my-sebulba-vm\n```\n\nAnd then use the following commands\n\n```bash\n# create the TPU POD\nmake create_vm\n# clone repo on the TPU POD and build the image\nmake setup\n# kill existing container, pull and run\nmake kill_pull_run\n# start tensorboard\nmake start_tensorboard\n# port forward tensorboard to your machine\nmake port_forward_tensorboard\n# delete your vm\nmake delete\n```\n\n## Architecture 🏗\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"./static/architecture.gif\"\u003e\n        \u003cimg src=\"./static/architecture.gif\" alt=\"Gif of Sebulba Architecture\" width=30%\"/\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n    Fig1: Animation of the Sebulba architecture showing how data is transferred across the main\n    components which are the Environments(Green) placed on the CPUs, the Actors(Red) placed on\n    a subset of the TPUs and the Learners(Yellow) placed on the remaining TPU cores.\n\u003c/p\u003e\n\nSebulba splits the available 8 TPU cores into two disjoint sets: 𝐴 cores are used exclusively to\nact, and the remaining 8 − 𝐴 cores are used to learn. Each Python thread steps an entire batch of\nenvironments in parallel and feeds the resulting batch of observations to a TPU core, to\nperform inference of that batch of observations and select the next batch of actions. The batches\nof observations, actions and rewards are accumulated on the TPU cores and are then\npassed to the Learners.\n\nThe learner thread executes the same update function on all the TPU cores dedicated to learning\nusing JAX’s pmap primitive, and parameter updates can be averaged across all participating\nlearner cores using JAX’s pmean/psum primitives. This computation can be scaled up via\nreplication across multiple TPU VMs of a POD.\n\n### Stoppable Component\nStoppableComponent represents a component running on its own thread which can be stopped.\nIt is designed to be subclassed, and the _run method should be overridden in the subclass to\ndefine the specific behavior of the component. This is used for many of Sebulba's components\nsuch as the Actor, Learner, Parameter Source and Logger.\n\n### Actor\n\nThe Actor is a component that runs on its own thread and samples trajectories from a vectorized\nenvironment. It uses an act function to generate actions, collects trajectory information,\nand puts the trajectories into a pipeline for further processing by the Learner.\n\n### Learner\n\nThe Learner is a component that runs on its own thread and performs\nlearning iterations using a pipeline of trajectories. It takes trajectory batches from the pipeline,\napplies a step function across multiple devices, updates the learner's state, and logs metrics.\n\n### Pipeline\n\nThe Pipeline component shards trajectories across a list of learner devices.\nTrajectories are put into a queue by sharding them across the learner devices, and they are\nretrieved from the queue in their original form.\n\n### Params Source\n\nThe ParamsSource class is a component that runs on its own thread and serves as a means of passing\nparameters between Learner and Actor components. It ensures that the parameters given to the\nActor are ready for use. The class initializes with an initial parameter value and a JAX device,\nand it provides methods to update and retrieve the current parameter value.\nThe update method allows for setting new parameter values, and the get method returns\nthe current parameter value.\n\n## Benchmarks 📈\n\n### Breakout Convergence\n\n\u003cp align=\"center\"\u003e\n    \u003ca  href=\"./static/breakout_convergence_scaling_cores_episode_return.svg\" align=\"center\"\u003e\n        \u003cimg src=\"./static/breakout_convergence_scaling_cores_episode_return.svg\" alt=\"Breakout convergence plot\"/\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n    Fig2: Learning curves of Sebulba PPO on Atari Breakout scaling TPU cores [8-128].\n\u003c/p\u003e\nIn Figure 2 we can see the effect of scaling the number of TPU cores on the convergence of our PPO\nagent on the Atari Breakout environment. Not only does our agent converge faster with more cores,\ndue to the increase of the effective batch size, our agents training becomes more stable and\ncontinues to push past the learning plateau that we see with fewer cores.\n\n\u003cp align=\"center\"\u003e\n    \u003ca  href=\"./static/breakout_convergence_scaling_cores_fps.svg\" align=\"center\"\u003e\n        \u003cimg src=\"./static/breakout_convergence_scaling_cores_fps.svg\" alt=\"Breakout scaling by TPU cores\"/\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n    Fig3: Effect of scaling TPU cores on the Frames Per Second of Sebulba PPO for Atari Breakout.\n\u003c/p\u003e\nFigure 3 shows the impressive linear scaling of the Sebulba architecture as we increase the number of\nTPU cores.\n\n### Scaling batch size\n\n\u003cp align=\"center\"\u003e\n    \u003ca  href=\"./static/breakout_performance_scaling_batch_size_fps.svg\" align=\"center\"\u003e\n        \u003cimg src=\"./static/breakout_performance_scaling_batch_size_fps.svg\" alt=\"Breakout scaling by batch size\"/\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n    Fig4: Effect of scaling batch size on the Frames Per Second of Sebulba PPO for Atari Breakout across different hardware.\n\u003c/p\u003e\nFigure 4 also shows how Sebulba PPO scales with batch size across different hardware. However, we can\nsee that scaling batch size can only go so far and begins to plateau or simply grow to large to fit\non memory of specific hardware. At this point it becomes more viable to scale the number of TPU replicas\nor increase model capacity for further improvements.\n\n### Maximising Throughput\n\n\u003cp align=\"center\"\u003e\n    \u003ca  href=\"./static/breakout_performance_scaling_cores_fps.svg\" align=\"center\"\u003e\n        \u003cimg src=\"./static/breakout_performance_scaling_cores_fps.svg\" alt=\"Scaling TPU while maximising performance\"/\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n    Fig5: Effect of scaling batch size on the Frames Per Second of Sebulba PPO for Atari Breakout\nwhen we optimise system parameters for throughput instead of balancing for convergence.\n\u003c/p\u003e\nAs well our results maximising wall clock time of model convergence, we found that we could\nfurther increase our max throughput of Sebulba by tuning system parameters for throughput instead\nof convergence. Doing this, we can increase our max throughput by almost 33%, going from 3.8M FPS to 4.8M FPS!\n\n### Hardware Comparison\n\n\u003cp align=\"center\"\u003e\n    \u003ca  href=\"\" align=\"center\"\u003e\n        \u003cimg src=\"./static/compare_hardware.svg\" alt=\"Comparison of Sebulba across hardware.\"/\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n    Fig6: Comparison of max FPS achieved on each hardware platform. For A100x8 we used a a2-highgpu-8g instance on GCP.\n\u003c/p\u003e\n\nIn Figure 6 we compare the max FPS achieved on each hardware platform. These show the max FPS achieved\non each platform with the 8 device configuration. For A100x8 we used a a2-highgpu-8g instance on GCP.\nFor TPU v2, v3 and v4 we use only 8 cores for a fairer comparison.\n\n### Reproduction\n\nTo reproduce our benchmarck you follow the on TPU quickstart and set the `EXPERIMENT_NAME` environment variable to one of the experiment in `experiments/config/experiment`.\n\n```bash\nEXPERIMENT_NAME=tpu-v4-max-convergence\nmake kill_pull_run\n```\n\n\u003e Our benchmark are performed on a single NUMA node for stability.\n\u003e You can edit the USE_ONLY_NUMA_NODE0 variable on the makefile to use all the CPUs available.\n\n## Acknowledgements 🙏\n\nWe thank [Google's TPU Research Cloud Program](https://sites.research.google/trc/about/) for supporting this work and providing access\nto TPU hardware 🌩️. We also like to thank [CleanRL](https://github.com/vwxyzjn/cleanrl/tree/master) 🧽\nfor providing a fantastic implementation of [PPO](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari_envpool.py)\nwhich was used as a refernece and adapted for our implementation of PPO.\n\n## Citation ✏️\n\n```bibtex\n@misc{picard2023,\n  author = {Armand Picard and Donal Byrne and Alexandre Laterre},\n  title = {Sebulba: Scaling reinforcement learning on cloud TPUs in JAX},\n  year = {2023},\n  publisher = {GitHub},\n  url = {https://github.com/instadeepai/sebulba}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finstadeepai%2Fsebulba","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Finstadeepai%2Fsebulba","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finstadeepai%2Fsebulba/lists"}