https://github.com/converged-computing/fluxnetes

Fluxion in-tree scheduler for Kubernetes (under development)
https://github.com/converged-computing/fluxnetes

Last synced: 5 months ago
JSON representation

Fluxion in-tree scheduler for Kubernetes (under development)

Host: GitHub
URL: https://github.com/converged-computing/fluxnetes
Owner: converged-computing
License: mit
Created: 2024-07-10T02:52:32.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-12-08T02:28:57.000Z (over 1 year ago)
Last Synced: 2025-09-10T05:36:56.758Z (9 months ago)
Language: Go
Size: 12.1 MB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE
- Notice: NOTICE
- Copyright: COPYRIGHT

Awesome Lists containing this project

README

# Fluxnetes

![docs/images/fluxnetes.png](docs/images/fluxnetes.png)

Fluxnetes enables is a combination of Kubernetes and [Fluence](https://github.com/flux-framework/flux-k8s), both of which use the HPC-grade pod scheduling [Fluxion scheduler](https://github.com/flux-framework/flux-sched) to schedule pod groups to nodes. For our queue, we use [river](https://riverqueue.com/docs) backed by a Postgres database. The database is deployed alongside fluence and could be customized to use an operator instead.

**Important** This is an experiment, and is under development. I will change this design a million times - it's how I tend to learn and work. I'll share updates when there is something to share. It deploys but does not work yet!
See the [docs](docs) for some detail on design choices.

## Design

Fluxnetes builds three primary containers:

- `ghcr.io/converged-computing/fluxnetes`: contains a custom kube-scheduler build with flux as the primary scheduler.
- `ghcr.io/converged-computing/fluxnetes-sidecar`: provides the fluxion service, queue for pods and groups, and a second service that will expose a kubectl command for inspection of state.
- `ghcr.io/converged-computing/fluxnetes-postgres`: holds the worker queue and provisional queue tables

The overall design is an experiment to blow up the internal "single pod" queue, and replace with using the fluxion (Flux Framework scheduler) instead. For this prototype, we will implement a queue service alongside Fluxion, and the main `schedule_one.go` logic interacts with this setup to assemble groups and submit them to the queue manager, ultimately to be run on the Kubernetes cluster.

## Deploy

Create a kind cluster. You need more than a control plane.

```bash
kind create cluster --config ./examples/kind-config.yaml
```

Install the certificate manager:

```bash
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.1/cert-manager.yaml
```

Then you can deploy as follows:

```bash
./hack/quick-build-kind.sh
```
You'll then have the fluxnetes service running, a postgres database (for the job queue), along with the scheduler plugins controller, which we
currently have to use PodGroup.

```bash
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
fluxnetes-6954cdcf64-gv7s7 2/2 Running 0 87s
postgres-c8d55999c-t6dtt 1/1 Running 0 87s
scheduler-plugins-controller-8676df7769-jvtwp 1/1 Running 0 87s
```

You can then create a job:

```bash
kubectl apply -f examples/job.yaml
```

which will create each of a PodGroup and the associated job, which will run:

```bash
$ kubectl logs job-n8sfg
```
```console
potato
```

And complete.

```bash
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
fluxnetes-7bbb588944-4bq6w 1/2 Running 2 (20m ago) 21m
job-8xsqr 0/1 Completed 0 19m
postgres-597db46977-srnln 1/1 Running 0 21m
scheduler-plugins-controller-8676df7769-wg5xl 1/1 Running 0 21m
```

You can also try submitting a bath of jobs:

```bash
kubectl apply -f examples/batch/
```
```console
kubectl get pods
NAME READY STATUS RESTARTS AGE
fluxnetes-7bbb588944-49skf 1/2 Running 3 (58s ago) 81s
job1-k4v22 0/1 Completed 0 17s
job2-8bmf7 0/1 Completed 0 17s
job2-fmhrb 0/1 Completed 0 17s
job3-7gp4n 0/1 Completed 0 17s
job3-kfrt2 0/1 Completed 0 17s
postgres-597db46977-sl4cm 1/1 Running 0 81s
scheduler-plugins-controller-8676df7769-zgzsj 1/1 Running 0 81s
```

And that's it! This is fully working, but this only means that we are going to next work on the new design.
See [docs](docs) for notes on that.

## Development

### Viewing Logs

You can view the scheduler logs as follows:

```bash
$ kubectl logs fluxnetes-7bbb588944-ss4jn -c scheduler
```

Fluxnetes Logs

```console
I0730 01:51:17.791122
W0730 01:51:17.795420
I0730 01:51:19.965133
I0730 01:51:19.965205
I0730 01:51:19.973277
I0730 01:51:19.973402
I0730 01:51:19.973485
I0730 01:51:19.973702
I0730 01:51:19.975425
I0730 01:51:19.975539
I0730 01:51:19.975596
I0730 01:51:19.975628
I0730 01:51:20.073842
I0730 01:51:20.073943
I0730 01:51:20.075687
I0730 01:51:20.076874
I0730 01:51:20.183696
I0730 01:51:25.184220
I0730 01:51:30.184836
...
I0730 01:52:10.183816
I0730 01:52:13.277200
E0730 01:52:13.277836
I0730 01:52:13.286840
I0730 01:52:13.286890
I0730 01:52:13.288115
I0730 01:52:13.298948
I0730 01:52:13.334900
I0730 01:52:13.335322
I0730 01:52:13.335354
I0730 01:52:13.349266
I0730 01:52:13.382286
I0730 01:52:13.470426
I0730 01:52:13.471628
...
I0730 01:52:15.184586
``` 1 serving.go:386] Generated self-signed cert in-memory 1 client_config.go:659] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. 1 server.go:154] "Starting Kubernetes Scheduler" version="v0.0.0-master+$Format:%H$" 1 server.go:156] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK="" 1 secure_serving.go:213] Serving securely on [::]:10259 1 requestheader_controller.go:172] Starting RequestHeaderAuthRequestController 1 shared_informer.go:313] Waiting for caches to sync for RequestHeaderAuthRequestController 1 tlsconfig.go:243] "Starting DynamicServingCertificateController" 1 configmap_cafile_content.go:205] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file" 1 shared_informer.go:313] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file 1 configmap_cafile_content.go:205] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" 1 shared_informer.go:313] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file 1 shared_informer.go:320] Caches are synced for RequestHeaderAuthRequestController 1 scheduler.go:464] "[FLUXNETES]" Starting="queue" 1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file 1 shared_informer.go:320] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file 1 client.go:773] "River client started" id="Fluxnetes" client_id="fluxnetes-7bbb588944-4bq6w_2024_07_30T01_51_20_074419" 1 producer.go:541] "producer: Heartbeat" id="Fluxnetes" num_completed_jobs=0 num_jobs_running=0 queue="default" 1 producer.go:541] "producer: Heartbeat" id="Fluxnetes" num_completed_jobs=0 num_jobs_running=0 queue="default" 1 producer.go:541] "producer: Heartbeat" id="Fluxnetes" num_completed_jobs=0 num_jobs_running=0 queue="default" 1 queue.go:131] Pod job-8xsqr has Group job (1) created at 2024-07-30 01:52:13 +0000 UTC 1 provisional.go:58] Did not find pod job-8xsqr in group &{job %!s(int32=1) 2024-07-30 01:52:13 +0000 UTC} in table 1 provisional.go:89] GROUP NAMES [job] 1 provisional.go:112] GET select group_name, group_size, podspec from pods_provisional where group_name in ('job'); 1 provisional.go:98] DELETE delete from pods_provisional where group_name in ('job'); 1 queue.go:156] [Fluxnetes] Schedule inserted 1 jobs 1 workers.go:54] [WORKER] JobStatus Running for group job 1 resources.go:79] [Jobspec] Pod spec: CPU 1, memory 0, GPU 0, storage 0 1 workers.go:67] Prepared pod jobspec id:"job" container:"job" cpu:1 1 workers.go:96] Fluxion response %spodID:"job" nodelist:{nodeID:"kind-worker" tasks:1} jobID:1 1 workers.go:128] [Fluxnetes] nodes allocated kind-worker for flux job id 0 1 scheduler.go:501] Got job with state completed and nodes: [kind-worker] 1 scheduler.go:548] Pod {{ } {job-8xsqr job- default 79cb0091-beb0-4093-b97d-f473f4729efa 16335 1 producer.go:541] "producer: Heartbeat" id="Fluxnetes" num_completed_jobs=1 num_jobs_running=0 queue="default"

### Debugging Postgres

It is often helpful to shell into the postgres container to see the database directly:

```bash
kubectl exec -it postgres-597db46977-9lb25 bash
psql -U postgres

# Connect to database
\c

# list databases
\l

# show tables
\dt

# test a query
SELECT group_name, group_size from pods_provisional;
```

### TODO

- [ ] kubectl plugin to get fluxion state?
- [ ] Figure out how In-tree registry plugins (that are related to resources) should be run to inform fluxion
- we likely want to move assume pod outside of that schedule function, or ensure pod passed matches.
- [ ] Optimize queries.
- [ ] Restarting with postgres shouldn't have crashloopbackoff when the database isn't ready yet
- [ ] The queue should inherit (and return) the start time (when the pod was first seen) "start" in scheduler.go
- Testing:
- [ ] need to test duration / completion time works (run job with short duration, should be cancelled/cleaned up)
- [ ] spam submission and test reservations (and cancel)
- [ ] implement other queue strategies (fcfs and backfill with > 1 reservation depth)
- fcfs can work by only adding one job (first in provisional) to the worker queue at once, only when it's empty! lol.
- [ ] In cleanup do we need to handle [BlockOwnerDeletion](https://github.com/kubernetes/kubernetes/blob/dbc2b0a5c7acc349ea71a14e49913661eaf708d2/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L319). I don't yet understand the cases under which this is used, but likely we want to delete the child object and allow the owner to do whatever is the default (create another pod, etc.)

Thinking:

- We can allow trying to schedule jobs in the future, although I'm not sure about that use case (add label to do this)
- What should we do if a pod is updated, and the group is removed?
- fluxion is deriving the nodes on its own, but we might get updated nodes from the scheduler. It might be good to think about how to use the fluxion-service container instead.
- more efficient to retrieve podspec from kubernetes instead of putting into database?

TODO:

- test job that has too many resources and won't pass (it should not make it to provisional or pending_queue)
- can we do a satisfies first?
- we probably need a unique on the insert...
- when that works, a pod that is completed / done needs to be removed from pending

## License

HPCIC DevTools is distributed under the terms of the MIT license.
All new contributions must be made under this license.

See [LICENSE](https://github.com/converged-computing/cloud-select/blob/main/LICENSE),
[COPYRIGHT](https://github.com/converged-computing/cloud-select/blob/main/COPYRIGHT), and
[NOTICE](https://github.com/converged-computing/cloud-select/blob/main/NOTICE) for details.

SPDX-License-Identifier: (MIT)

LLNL-CODE- 842614

### Fluence

The original fluence code (for which some partial is here) is covered under [LICENSE](.github/LICENSE.fluence):

SPDX-License-Identifier: Apache-2.0

LLNL-CODE-764420

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/converged-computing/fluxnetes

Awesome Lists containing this project

README