{"id":36688542,"url":"https://github.com/converged-computing/fluxnetes","last_synced_at":"2026-01-12T11:17:12.440Z","repository":{"id":247892336,"uuid":"826598123","full_name":"converged-computing/fluxnetes","owner":"converged-computing","description":"Fluxion in-tree scheduler for Kubernetes (under development)","archived":false,"fork":false,"pushed_at":"2024-12-08T02:28:57.000Z","size":12637,"stargazers_count":0,"open_issues_count":3,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-09-10T05:36:56.758Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/converged-computing.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":"COPYRIGHT","agents":null,"dco":null,"cla":null}},"created_at":"2024-07-10T02:52:32.000Z","updated_at":"2024-10-11T01:16:58.000Z","dependencies_parsed_at":"2024-08-08T03:26:44.890Z","dependency_job_id":"2fd6156e-8946-4ebc-81c3-55e1f7e4ce4e","html_url":"https://github.com/converged-computing/fluxnetes","commit_stats":null,"previous_names":["converged-computing/fluxnetes"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/converged-computing/fluxnetes","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/converged-computing%2Ffluxnetes","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/converged-computing%2Ffluxnetes/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/converged-computing%2Ffluxnetes/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/converged-computing%2Ffluxnetes/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/converged-computing","download_url":"https://codeload.github.com/converged-computing/fluxnetes/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/converged-computing%2Ffluxnetes/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28338970,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-12T10:58:46.209Z","status":"ssl_error","status_checked_at":"2026-01-12T10:58:42.742Z","response_time":98,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-12T11:17:11.672Z","updated_at":"2026-01-12T11:17:12.430Z","avatar_url":"https://github.com/converged-computing.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Fluxnetes\n\n![docs/images/fluxnetes.png](docs/images/fluxnetes.png)\n\nFluxnetes enables is a combination of Kubernetes and [Fluence](https://github.com/flux-framework/flux-k8s), both of which use the HPC-grade pod scheduling [Fluxion scheduler](https://github.com/flux-framework/flux-sched) to schedule pod groups to nodes. For our queue, we use [river](https://riverqueue.com/docs) backed by a Postgres database. The database is deployed alongside fluence and could be customized to use an operator instead.\n\n**Important** This is an experiment, and is under development. I will change this design a million times - it's how I tend to learn and work. I'll share updates when there is something to share. It deploys but does not work yet!\nSee the [docs](docs) for some detail on design choices.\n\n## Design\n\nFluxnetes builds three primary containers:\n\n - `ghcr.io/converged-computing/fluxnetes`: contains a custom kube-scheduler build with flux as the primary scheduler.\n - `ghcr.io/converged-computing/fluxnetes-sidecar`: provides the fluxion service, queue for pods and groups, and a second service that will expose a kubectl command for inspection of state.\n - `ghcr.io/converged-computing/fluxnetes-postgres`: holds the worker queue and provisional queue tables\n\nThe overall design is an experiment to blow up the internal \"single pod\" queue, and replace with using the fluxion (Flux Framework scheduler) instead. For this prototype, we will implement a queue service alongside Fluxion, and the main `schedule_one.go` logic interacts with this setup to assemble groups and submit them to the queue manager, ultimately to be run on the Kubernetes cluster. \n\n## Deploy\n\nCreate a kind cluster. You need more than a control plane.\n\n```bash\nkind create cluster --config ./examples/kind-config.yaml\n```\n\nInstall the certificate manager:\n\n```bash\nkubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.1/cert-manager.yaml\n```\n\nThen you can deploy as follows:\n\n```bash\n./hack/quick-build-kind.sh\n```\nYou'll then have the fluxnetes service running, a postgres database (for the job queue), along with the scheduler plugins controller, which we\ncurrently have to use PodGroup.\n\n```bash\n$ kubectl get pods\nNAME                                            READY   STATUS    RESTARTS   AGE\nfluxnetes-6954cdcf64-gv7s7                      2/2     Running   0          87s\npostgres-c8d55999c-t6dtt                        1/1     Running   0          87s\nscheduler-plugins-controller-8676df7769-jvtwp   1/1     Running   0          87s\n```\n\nYou can then create a job:\n\n```bash\nkubectl apply -f examples/job.yaml\n```\n\nwhich will create each of a PodGroup and the associated job, which will run:\n\n```bash\n$ kubectl logs job-n8sfg \n```\n```console\npotato\n```\n\nAnd complete.\n\n```bash\n$ kubectl get pods\nNAME                                            READY   STATUS      RESTARTS      AGE\nfluxnetes-7bbb588944-4bq6w                      1/2     Running     2 (20m ago)   21m\njob-8xsqr                                       0/1     Completed   0             19m\npostgres-597db46977-srnln                       1/1     Running     0             21m\nscheduler-plugins-controller-8676df7769-wg5xl   1/1     Running     0             21m\n```\n\nYou can also try submitting a bath of jobs:\n\n```bash\nkubectl apply -f examples/batch/\n```\n```console\nkubectl get pods\nNAME                                            READY   STATUS      RESTARTS      AGE\nfluxnetes-7bbb588944-49skf                      1/2     Running     3 (58s ago)   81s\njob1-k4v22                                      0/1     Completed   0             17s\njob2-8bmf7                                      0/1     Completed   0             17s\njob2-fmhrb                                      0/1     Completed   0             17s\njob3-7gp4n                                      0/1     Completed   0             17s\njob3-kfrt2                                      0/1     Completed   0             17s\npostgres-597db46977-sl4cm                       1/1     Running     0             81s\nscheduler-plugins-controller-8676df7769-zgzsj   1/1     Running     0             81s\n```\n\nAnd that's it! This is fully working, but this only means that we are going to next work on the new design.\nSee [docs](docs) for notes on that.\n\n## Development\n\n### Viewing Logs\n\nYou can view the scheduler logs as follows:\n\n```bash\n$ kubectl logs fluxnetes-7bbb588944-ss4jn -c scheduler\n```\n\n\u003cdetails\u003e\n\n\u003csummary\u003eFluxnetes Logs\u003c/summary\u003e\n\n```console\nI0730 01:51:17.791122       1 serving.go:386] Generated self-signed cert in-memory\nW0730 01:51:17.795420       1 client_config.go:659] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.\nI0730 01:51:19.965133       1 server.go:154] \"Starting Kubernetes Scheduler\" version=\"v0.0.0-master+$Format:%H$\"\nI0730 01:51:19.965205       1 server.go:156] \"Golang settings\" GOGC=\"\" GOMAXPROCS=\"\" GOTRACEBACK=\"\"\nI0730 01:51:19.973277       1 secure_serving.go:213] Serving securely on [::]:10259\nI0730 01:51:19.973402       1 requestheader_controller.go:172] Starting RequestHeaderAuthRequestController\nI0730 01:51:19.973485       1 shared_informer.go:313] Waiting for caches to sync for RequestHeaderAuthRequestController\nI0730 01:51:19.973702       1 tlsconfig.go:243] \"Starting DynamicServingCertificateController\"\nI0730 01:51:19.975425       1 configmap_cafile_content.go:205] \"Starting controller\" name=\"client-ca::kube-system::extension-apiserver-authentication::client-ca-file\"\nI0730 01:51:19.975539       1 shared_informer.go:313] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file\nI0730 01:51:19.975596       1 configmap_cafile_content.go:205] \"Starting controller\" name=\"client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file\"\nI0730 01:51:19.975628       1 shared_informer.go:313] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file\nI0730 01:51:20.073842       1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController\nI0730 01:51:20.073943       1 scheduler.go:464] \"[FLUXNETES]\" Starting=\"queue\"\nI0730 01:51:20.075687       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file\nI0730 01:51:20.076874       1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file\nI0730 01:51:20.183696       1 client.go:773] \"River client started\" id=\"Fluxnetes\" client_id=\"fluxnetes-7bbb588944-4bq6w_2024_07_30T01_51_20_074419\"\nI0730 01:51:25.184220       1 producer.go:541] \"producer: Heartbeat\" id=\"Fluxnetes\" num_completed_jobs=0 num_jobs_running=0 queue=\"default\"\nI0730 01:51:30.184836       1 producer.go:541] \"producer: Heartbeat\" id=\"Fluxnetes\" num_completed_jobs=0 num_jobs_running=0 queue=\"default\"\n...\nI0730 01:52:10.183816       1 producer.go:541] \"producer: Heartbeat\" id=\"Fluxnetes\" num_completed_jobs=0 num_jobs_running=0 queue=\"default\"\nI0730 01:52:13.277200       1 queue.go:131] Pod job-8xsqr has Group job (1) created at 2024-07-30 01:52:13 +0000 UTC\nE0730 01:52:13.277836       1 provisional.go:58] Did not find pod job-8xsqr in group \u0026{job %!s(int32=1) 2024-07-30 01:52:13 +0000 UTC} in table\nI0730 01:52:13.286840       1 provisional.go:89] GROUP NAMES [job]\nI0730 01:52:13.286890       1 provisional.go:112] GET select group_name, group_size, podspec from pods_provisional where group_name in ('job');\nI0730 01:52:13.288115       1 provisional.go:98] DELETE delete from pods_provisional where group_name in ('job');\nI0730 01:52:13.298948       1 queue.go:156] [Fluxnetes] Schedule inserted 1 jobs\nI0730 01:52:13.334900       1 workers.go:54] [WORKER] JobStatus Running for group job\nI0730 01:52:13.335322       1 resources.go:79] [Jobspec] Pod spec: CPU 1, memory 0, GPU 0, storage 0\nI0730 01:52:13.335354       1 workers.go:67] Prepared pod jobspec id:\"job\"  container:\"job\"  cpu:1\nI0730 01:52:13.349266       1 workers.go:96] Fluxion response %spodID:\"job\"  nodelist:{nodeID:\"kind-worker\"  tasks:1}  jobID:1\nI0730 01:52:13.382286       1 workers.go:128] [Fluxnetes] nodes allocated kind-worker for flux job id 0\nI0730 01:52:13.470426       1 scheduler.go:501] Got job with state completed and nodes: [kind-worker]\nI0730 01:52:13.471628       1 scheduler.go:548] Pod {{ } {job-8xsqr job- default  79cb0091-beb0-4093-b97d-f473f4729efa 16335\n...\nI0730 01:52:15.184586       1 producer.go:541] \"producer: Heartbeat\" id=\"Fluxnetes\" num_completed_jobs=1 num_jobs_running=0 queue=\"default\"\n```\n\n\u003c/details\u003e\n\n### Debugging Postgres\n\nIt is often helpful to shell into the postgres container to see the database directly:\n\n```bash\nkubectl exec -it postgres-597db46977-9lb25 bash\npsql -U postgres\n\n# Connect to database \n\\c\n\n# list databases\n\\l\n\n# show tables\n\\dt\n\n# test a query\nSELECT group_name, group_size from pods_provisional;\n```\n\n### TODO\n\n- [ ] kubectl plugin to get fluxion state?\n- [ ] Figure out how In-tree registry plugins (that are related to resources) should be run to inform fluxion\n   - we likely want to move assume pod outside of that schedule function, or ensure pod passed matches.\n- [ ] Optimize queries.\n- [ ] Restarting with postgres shouldn't have crashloopbackoff when the database isn't ready yet\n- [ ] The queue should inherit (and return) the start time (when the pod was first seen) \"start\" in scheduler.go\n- Testing:\n  - [ ] need to test duration / completion time works (run job with short duration, should be cancelled/cleaned up)\n  - [ ] spam submission and test reservations (and cancel)\n- [ ] implement other queue strategies (fcfs and backfill with \u003e 1 reservation depth)\n  - fcfs can work by only adding one job (first in provisional) to the worker queue at once, only when it's empty! lol.\n- [ ] In cleanup do we need to handle [BlockOwnerDeletion](https://github.com/kubernetes/kubernetes/blob/dbc2b0a5c7acc349ea71a14e49913661eaf708d2/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L319). I don't yet understand the cases under which this is used, but likely we want to delete the child object and allow the owner to do whatever is the default (create another pod, etc.)\n\nThinking:\n\n- We can allow trying to schedule jobs in the future, although I'm not sure about that use case (add label to do this)\n- What should we do if a pod is updated, and the group is removed?\n- fluxion is deriving the nodes on its own, but we might get updated nodes from the scheduler. It might be good to think about how to use the fluxion-service container instead.\n- more efficient to retrieve podspec from kubernetes instead of putting into database?\n\nTODO:\n\n- test job that has too many resources and won't pass (it should not make it to provisional or pending_queue)\n  - can we do a satisfies first?\n  - we probably need a unique on the insert...\n- when that works, a pod that is completed / done needs to be removed from pending\n\n## License\n\nHPCIC DevTools is distributed under the terms of the MIT license.\nAll new contributions must be made under this license.\n\nSee [LICENSE](https://github.com/converged-computing/cloud-select/blob/main/LICENSE),\n[COPYRIGHT](https://github.com/converged-computing/cloud-select/blob/main/COPYRIGHT), and\n[NOTICE](https://github.com/converged-computing/cloud-select/blob/main/NOTICE) for details.\n\nSPDX-License-Identifier: (MIT)\n\nLLNL-CODE- 842614\n\n### Fluence\n\nThe original fluence code (for which some partial is here) is covered under [LICENSE](.github/LICENSE.fluence):\n\nSPDX-License-Identifier: Apache-2.0\n\nLLNL-CODE-764420\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fconverged-computing%2Ffluxnetes","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fconverged-computing%2Ffluxnetes","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fconverged-computing%2Ffluxnetes/lists"}