{"id":16603696,"url":"https://github.com/thejj/ceph-balancer","last_synced_at":"2025-04-05T14:07:35.383Z","repository":{"id":51164335,"uuid":"296763747","full_name":"TheJJ/ceph-balancer","owner":"TheJJ","description":"Efficient Ceph placement optimization, aiming for maximum storage capacity through equal OSD utilization.","archived":false,"fork":false,"pushed_at":"2025-03-12T15:31:01.000Z","size":692,"stargazers_count":119,"open_issues_count":4,"forks_count":33,"subscribers_count":13,"default_branch":"master","last_synced_at":"2025-03-29T13:10:00.290Z","etag":null,"topics":["ceph","ceph-balancer","cluster-analysis","optimization-problem","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TheJJ.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-09-19T01:33:16.000Z","updated_at":"2025-03-12T15:31:05.000Z","dependencies_parsed_at":"2023-10-30T13:33:34.028Z","dependency_job_id":"ca4a77d7-2701-4a01-8a98-9c72239a0e54","html_url":"https://github.com/TheJJ/ceph-balancer","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TheJJ%2Fceph-balancer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TheJJ%2Fceph-balancer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TheJJ%2Fceph-balancer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TheJJ%2Fceph-balancer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TheJJ","download_url":"https://codeload.github.com/TheJJ/ceph-balancer/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247345853,"owners_count":20924102,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ceph","ceph-balancer","cluster-analysis","optimization-problem","python"],"created_at":"2024-10-12T00:52:18.920Z","updated_at":"2025-04-05T14:07:35.365Z","avatar_url":"https://github.com/TheJJ.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"JJ's Ceph Balancer\n==================\n\n[Ceph](https://ceph.io)'s \"main\" design issue is equal data placement.\nOne mitigation is the [mgr balancer](https://docs.ceph.com/en/latest/rados/operations/balancer/).\n\nThis is an alternative Ceph balancer implementation.\n* The upstream balancer optimizes for (weighted) equal number of PGs on each OSD for each pool.\n* This balancer optimizes for equal OSD storage utilization and PG placement across all pools.\n\nFor most clusters, the `mgr balancer` works well.\nFor heterogeneous clusters with a lot of device and server capacity variance _and_ many pools, placement may be very bad - the reason this balancer was created.\n\n\n## How?\n\n* [Ceph Overview](https://docs.ceph.com/en/latest/start/intro/)\n* [Ceph Cheatsheet](https://github.com/TheJJ/ceph-cheatsheet)\n\n\n### Why balancing?\n\nA [data pool](https://docs.ceph.com/en/latest/rados/operations/pools/) is split into placement groups (PGs).\nThe number of PGs is configured via the `pg_num` pool attribute.\nHence, one PG roughly has size `pg_size = pool_size/pg_num`.\n\nPGs are placed on OSDs (a disk), using constraints defined via [CRUSH](https://docs.ceph.com/en/latest/rados/operations/crush-map/).\nUsually PGs are spread across servers, so that when one server goes down, enough disks are in other servers so the data remains available.\nThe number of OSDs for one PG is configured via the pools `size` property.\n\nThere are two kind of pools: replica and [erasure coded](https://en.wikipedia.org/wiki/Erasure_code).\n\nThe utilized space on one OSD is the sum of all PG shards that are stored on it.\nA PG shard is the data of one of the participating OSDs in one PG.\nFor replica-pools, the shard size equals `pg_size / pool_size` - one full copy.\nFor EC-pools, the shard size equals `pg_size / (pool_size * k)`, k is the number of data chunks in the EC profile.\n\nCRUSH organizes all the datacenters, racks, servers, OSDs in a tree structure.\nEach subtree usually has the weight of all the OSDs below it, and PGs are now distributed evenly, weighted by the sub-tree size at each tree level.\nThat way big servers/racks/datacenters/... get more data than small ones, but the relative amount is the same.\n\nIn theory, each OSD should thus be filled exactly the same relative amount, all are e.g. 30% full.\nIn practice, not so much:\n\nThe cluster, which was the motivation to create this balancer, had devices (same device class, weighted at 1.0) ranging from 55% to 80% size utilization.\n\nThe reason is this: The cluster has many pools of different sizes (at time of writing 46), OSD sizes vary from 1T to 14T, 4 to 40 OSDs per server.\nAnd the `mgr balancer` can't handle this.\n\n\n### `mgr balancer`\n\nCeph's included balancer optimizes by PG count on devices.\nIt does so by analyzing each pool independently, and then tries to move each pool's PGs so that each participating device has equal normalized PG counts.\nNormalized means placing double the PGs on a double-sized OSD.\n\nExample: Placing 600 PGs on a 2T and 4T OSD means each 1T gets `600PGs/(4T+2T) = 100PGs/T`, thus the 2T OSD gets 200PGs, the 4T OSD 600.\n\nPG counts are powers of two, so distributing them really equally will almost never work.\n\nBecause of this, the best possible solution is some OSDs having an offset of 1 PG to the ideal count.\nAs a PG-distribution-optimization is done per pool, without checking other pool's distribution at all, some devices will be the `+1` more often than others.\nAt worst one OSD is the `+1` **for each** pool in the cluster.\n\nThis OSD can then end up, for example, being 80% full, since it is a `+1` for 20 times.\nThe one that never is the `+1` is 50% full.\nThat's bad and not balanced at all.\n\nAdditionally, the shard sizes of PGs are not equal - they shrink and grow with pool usage, whereas the PG count will remain exactly the same, which the mgr-balancer uses, so it does nothing.\n\nTo make things worse, if there's a huge server in the cluster which is so big, CRUSH can't place data often enough on it to fill it to the same level as any other server, the balancer will fail moving PGs across servers that actually would have space.\nThis happens since it sees only this server's OSDs as \"underfull\", but each PG has one shard on that server already, so no data can be moved to it. Even though there are likely other OSDs in the cluster (which may have variations from 90% to 50% full, and determine the available space because one of them is the fullest overall), these other OSDs are not balanced - the only considered balancing target is the big empty server.\n\nSo we have two main issues:\n* If you have multiple bigger pools, the `+1`-PG placement does not consider the global view (i.e. where other pools place the +1 PGs)\n* If there's too-empty buckets (which can't be filled more because of crush constraints), other buckets are no longer balanced\n\n\n### jj-balancer\n\nTo solve this, the main optimization goal is equal OSD utilization:\n\nGenerate candidate PG movements and validate them against the crush constraints, PG counts, utilization estimations, ...\nTo get PG candidates, order all OSDs by utilization (optionally only for one crush root).\nUtilization is estimated from all PGs where the OSD is in the `up` set (due to ongoing partial PG transfers).\nFrom the fullest OSD, try to move a \"suitable\" PG shard on it to the least-utilized OSD.\nIf this violates constraints, try the next least-utilized OSD, and so on, or try a different PG.\n\nOnce a suitable OSD is found, check if the new placement decreases the cluster utilization variance.\nIf this is the case, record the PG movement and try to move another PG with the same approach.\n\nThat way the balancer generates new \"upmap items\", i.e. movement instructions for a PG from some OSDs to better ones, which you can apply if you're satisfied with the results.\n\nIf this is done forever, all OSDs will have very little utilization variance, or CRUSH constraints prevent us from doing more PG movements.\n\nSimplified pseudo code:\n\n```python\n# given an OSD we want to empty, which pg do we select?\ndef get_pg_move_candidates(sourceosd):\n    def compare(pg, otherpg):\n        return (pg.remapped \u003e otherpg.remapped or\n                pg.upmap_item_count \u003c otherpg.upmap_item_count or\n                pg.size \u003e otherpg.size)\n    return sort(pgs_on_osd(sourceosd), compare)\n\ntry_limit = 1\nmax_movements = 10\nwhile True:\n    next_move:\n    if len(movements) \u003e= max_movements:\n        stop()\n\n    # try to empty the fullest OSDs\n    for i, from_osd in enumerate(osds_by_utilization_asc(crushroot)):\n        if i \u003e try_limit:\n            finish('could not empty {try_limit} fullest devices, stopping')\n            stop()\n\n        # try a suitable PG to move it away from a full osd\n        for pg in get_pg_move_candidates(from_osd):\n            # move it to an empty osd\n            for to_osd in osds_by_utilization_desc(candidate_osds_for(pg)):\n\n                # only move if constraints allow it\n                if (is_crush_move_valid(pg, from_osd, to_osd) and\n                    respects_pool_balance(pg, from_osd, to_osd) and\n                    cluster_utilization_variance_is_better(pg, from_osd, to_osd)):\n\n                    movements.append((pg, from_osd, to_osd))\n                    goto next_move\n\n# resulting movements\nfor movement in movements:\n    print(f\"ceph osd pg-upmap-items {generate_upmap(movement)}\")\n```\n\nRuntime:\n* Worst-case (the fullest OSD can't be emptied more): `O(OSDs * PGs)`\n* If, after that, we tried the second-to fullest, third, and all others, it would be: `O(OSDs²*PGs)`\n\nLikely this can be optimized further.\n\n\n## Usage\n\n### Balancing\n\n```\n# to generate max 10 pg movements:\n./placementoptimizer.py -v balance --max-pg-moves 10 | tee /tmp/balance-upmaps\n\n# -\u003e if you're satisfied, run: $ bash /tmp/balance-upmaps\n\n# but it can do more than balance!\n# there's some examples below.\n./placementoptimizer.py --help\n```\n\n### Cluster Information Display\n\nYou can use the balancer's size predictions to view how much space your cluster actually has free.\nCeph's integrated \"free space\" number is rather off, it doesn't consider the real pg placement and resulting space limits at all.\n\nPool utilization:\n```\n./placementoptimizer.py show                # pool free sapace for the current cluster state (in the acting-state)\n./placementoptimizer.py show --pgstate up   # when all movements are done (i.e. the up-state is reached)\n```\n\nOSD utilization:\n```\n./placementoptimizer.py show --osds --sort-utilization\n./placementoptimizer.py show --osds --per-pool-count\n./placementoptimizer.py show --osds --per-pool-count --sort-utilization --only-crushclass hdd\n```\n\nOngoing movement status\n```\n./placementoptimizer.py showremapped\n./placementoptimizer.py showremapped --by-osd\n./placementoptimizer.py showremapped --by-osd --osds 13337,4242\n```\n\n### Dumping and Importing Cluster State\n\nFor debugging (or archiving), you can store all needed cluster state in a `.xz` file.\n```\n./placementoptimizer.py -v gather /tmp/lolfile.xz\n```\n\nTo use this file instead of the \"live\" cluster state, use `--state` in all the usual commands:\n\n```\n./placementoptimizer.py -v show --state /tmp/lolfile.xz\n./placementoptimizer.py showremapped --state /tmp/lolfile.xz\n./placementoptimizer.py balance --state /tmp/lolfile.xz\n```\n\n\n## Contributions\n\nThe script is not the prettiest (yet), but produces balancing-improvement movements.\n\nIdeally, with some further improvements and tuning, it could be integrated in upstream-Ceph as an alternative balancer implementation.\n\nSo if you have any idea and suggestion how to improve things, please submit issues and [pull requests](https://github.com/TheJJ/ceph-balancer/pulls).\n\n\n## Contact\n\nIf you have questions, suggestions, encounter any problem,\nplease join our chat and ask!\n\n* Matrix Chat: [`#sfttech:matrix.org`](https://matrix.to/#/#sfttech:matrix.org)\n\nOf course, create [issues](https://github.com/TheJJ/ceph-balancer/issues)\nand [pull requests](https://github.com/TheJJ/ceph-balancer/pulls).\n\n\n### License\n\nReleased under the **GNU General Public License** version 3 or later,\nsee [COPYING](COPYING) and [LICENSE](LICENSE) for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthejj%2Fceph-balancer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthejj%2Fceph-balancer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthejj%2Fceph-balancer/lists"}