Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/oxidecomputer/omicron

Omicron: Oxide control plane
https://github.com/oxidecomputer/omicron

api

Last synced: 2 months ago
JSON representation

Omicron: Oxide control plane

Awesome Lists containing this project

README

        

:showtitle:
:toc: left
:icons: font

= Oxide Control Plane

This repo houses the work-in-progress Oxide Rack control plane.

image::https://github.com/oxidecomputer/omicron/workflows/Rust/badge.svg[]

Omicron is open-source. But we're pretty focused on our own goals for the foreseeable future and not able to help external contributors. Please see xref:CONTRIBUTING.md[] for more information.

== Documentation

https://docs.oxide.computer/api[Docs are automatically generated for the public (externally-facing) API] based on the OpenAPI spec that itself is automatically generated from the server implementation. You can generate your own docs for either the public API or any of the internal APIs by feeding the corresponding OpenAPI specs (in link:./openapi[]) into an OpenAPI doc generator.

There are some internal design docs in the link:./docs[] directory. You might start with link:./docs/control-plane-architecture.adoc[].

For more design documentation and internal Rust API docs, see the https://rust.docs.corp.oxide.computer/omicron/[generated Rust documentation]. You can generate this yourself with:

[source,text]
----
$ cargo doc --document-private-items
----

Note that `--document-private-items` is configured by default, so you can actually just use `cargo doc`.

Folks with access to Oxide RFDs may find RFD 48 ("Control Plane Requirements") and other control plane RFDs relevant. These are not currently publicly available.

== Build and run

Omicron has two modes of operation: "simulated" and "non-simulated".

The simulated version of Omicron allows the high-level control plane logic to run without
actually managing any sled-local resources. This version can be executed on Linux, Mac, and illumos.
This mode of operation is provided for development and testing only.

To build and run the simulated version of Omicron, see: xref:docs/how-to-run-simulated.adoc[].

The non-simulated version of Omicron actually manages sled-local resources, and may only
be executed on hosts running Helios.
This mode of operation will be used in production.

To build and run the non-simulated version of Omicron, see: xref:docs/how-to-run.adoc[].

=== Run tests with nextest

The supported way to run tests is via https://nexte.st/[cargo-nextest].

NOTE: `cargo test` will not work for many of our tests, since they rely on nextest-specific features.

If you don't already have nextest installed, get started by https://nexte.st/book/pre-built-binaries[downloading a pre-built binary] or installing nextest via your package manager. Nextest has pre-built binaries for Linux, macOS and illumos.

Then, run tests with:

[source,text]
----
$ cargo nextest run
----

Nextest https://github.com/nextest-rs/nextest/issues/16[does not support doctests]. Run doctests separately with `cargo test --doc`.

Similarly, you can run tests inside a https://github.com/oxidecomputer/falcon[Falcon] based VM. This is described in the `test-utils` https://github.com/oxidecomputer/omicron/tree/main/test-utils[README].

There's also a xref:./live-tests/README.adoc[`live-tests`] test suite that can be run by hand in a _deployed_ Omicron system.

=== rustfmt and clippy

You can **format the code** using `cargo fmt`. Make sure to run this before pushing changes. The CI checks that the code is correctly formatted.

You can **run the https://github.com/rust-lang/rust-clippy[Clippy linter]** using `cargo xtask clippy`. CI checks that code is clippy-clean.

== Working in Omicron

Omicron is a pretty large repo containing a bunch of related components. (Why? See xref:docs/repo.adoc[].) If you just build the whole thing with `cargo build` or `cargo nextest run`, it can take a while, even for incremental builds. Since most people are only working on a few of these components at a time, it's helpful to be know about Cargo's tools for working with individual packages in a workspace.

NOTE: This section assumes you're already familiar with the prerequisites and environment setup needed to do _any_ work on Omicron. See xref:docs/how-to-run-simulated.adoc[] or xref:docs/how-to-run.adoc[] for more on that.

=== Key tips

* Use `cargo check` when you just want to know if your code compiles. It's _much_ faster than `cargo build` or `cargo nextest run`.
* When using Cargo's check/build/test/clippy commands, you can use the `-p PACKAGE` flag to only operate on a specific package. This often saves a lot of time for incremental builds.
* When using Cargo's check/build/clippy commands, use `--all-targets` to make sure you're checking or building the test code, too.

These are explained a bit more below, along with some common pitfalls.

Here's an example workflow. Suppose you're working on some changes to the Nexus database model (`nexus-db-model` package, located at `nexus/db-model` from the root). While you're actively writing and checking code, you might run:

```
cargo check --all-targets
```

_without_ any `-p` flag. Running this incrementally is pretty fast even on the whole workspace. This also uncovers places where your changes have broken code that uses this package. (If you're making big changes, you might not want that right away. In that case, you might choose to use `-p nexus-db-model` here.)

When you're ready to test the changes you've made, start with building and running tests for the most specific package you've changed:

```
cargo nextest run -p nexus-db-model
```

Once that works, check the tests for the next package up:

```
cargo nextest run -p omicron-nexus
```

When you're happy with things and want to make sure you haven't missed something, test everything:

```
cargo nextest run
```

=== Adding a new system library dependency

We check that certain system library dependencies are not leaked outside of their intended binaries via `cargo xtask verify-libraries` in CI. If you are adding a new dependency on a illumos/helios library it is recommended that you update xref:.cargo/xtask.toml[] with an allow list of where you expect the dependency to show up. For example some libraries such as `libnvme.so.1` are only available in the global zone and therefore will not be present in any other zone. This check is here to help us catch any leakage before we go to deploy on a rack. You can inspect a compiled binary in the target directory for what it requires by using `elfedit` - for example `elfedit -r -e 'dyn:tag NEEDED' /path/to/omicron/target/debug/sled-agent`.

=== Checking feature flag combinations

To ensure that varying combinations of features compile, run `cargo xtask check-features`, which executes the https://github.com/taiki-e/cargo-hack[`cargo hack`] subcommand under the hood.

This `xtask` is run in CI using the `--ci` parameter , which automatically exludes certain `image-*` features that purposefully cause compiler errors if set and uses a pre-built binary.

If `cargo hack` is not already installed in omicron's `out/` directory, a pre-built binary will be installed automatically depending on your operating system and architecture.

To limit the max number of simultaneous feature flags combined for checking, run the `xtask` with the `--depth ` flag:

[source,text]
----
$ cargo xtask check-features --depth 2
----

=== Rust packages in Omicron

NOTE: The term "package" is overloaded: most programming languages and operating systems have their own definitions of a package. On top of that, Omicron bundles up components into our own kind of "package" that gets delivered via the install and update systems. These are described in the `package-manifest.toml` file in the root of the repo. In this section, we're just concerned with Rust packages.

NOTE: There's also confusion in the Rust world about the terms https://doc.rust-lang.org/book/ch07-01-packages-and-crates.html["packages" and "crates"]. _Packages_ are the things that have a Cargo.toml file. (Workspaces like Omicron itself have Cargo.toml files, too.) Packages are also the things that you publish to crates.io (confusingly). One package might have a library, a standalone executable binary, several examples, integration tests, etc. that are all compiled individually and produce separate artifacts. These are what Rust calls _crates_. We're generally just concerned with packages here, not crates.

Here are some of the big components in the control plane that live in this repo:

[cols="1,1,4",options="header"]
|===
|Main rust package
|Component
|Description

|omicron-nexus
|Nexus
|Service responsible for handling external API requests and orchestrating the rest of the control plane.

|omicron-sled-agent
|Sled Agent
|Service that runs on each compute sled (server) to manage resources on that Sled

|dns-server
|Internal DNS server, External DNS server
|DNS server component used for both internal service discovery and external DNS

|omicron-gateway
|Management Gateway Service
|Connects Nexus (and other control plane services) to services on the rack management network (e.g., service processors)

|oximeter/oximeter
|Oximeter
|Collects telemetry from other services and stores it into Clickhouse

|wicket/wicketd
|Wicket
|CLI interface made available to operators on the rack technician port for rack setup and recovery

|===

For those with access to Oxide RFDs, RFD 61 discusses the organization principles and key components in more detail.

Many of these components themselves are made up of other packages (e.g., `nexus-db-model` is under `omicron-nexus`). There are also many more top-level packages than what's mentioned above. These are used for common code, clients, tools, etc. For more, see the Rustdoc for each module. (Where docs are missing or incomplete, please contribute!)

Use Cargo's `-p PACKAGE` to check/build/test only the package you're working on. Since people are usually only working on one or two components at a time, you can usually iterate faster this way.

=== Workspace management

Omicron uses `cargo-hakari` to ensure that all workspace dependencies enable the same set of features. This dramatically improves compilation time when switching between different subsets of packages (e.g. `-p wicket` or `-p nexus-db-model`), because the sets of enabled features remain consistent.

`cargo hakari` status is checked in CI; if the CI check fails, then update the configuration locally with

```
cargo install cargo-hakari --locked # only needed on the first run
cargo hakari generate
cargo hakari manage-deps
```

=== Why am I getting compile errors after I thought I'd already built everything?

Say you're iterating on code, running `cargo build -p nexus-db-model` to build just that package. You work through lots of compiler errors until finally it works. Now you run tests: `cargo nextest run -p nexus-db-model`. Now you see a bunch of compiler errors again! What gives?

By default, Cargo does not operate on the tests. Cargo's check/build/clippy commands ignore them. This is another reason we suggest using `--all-targets` most of the time.

=== Generated Service Clients and Updating

Each service is a Dropshot server that presents an HTTP API. The description of
that API is serialized as an
https://github.com/OAI/OpenAPI-Specification[OpenAPI] document which we store
in link:./openapi[`omicron/openapi`] and check in to this repo. Checking in
these generated files allows us:

. To catch accidental changes as test failures.
. To explicitly observe API changes during code review (and in the git history).

We also use these OpenAPI documents as the source for the clients we generate
using https://github.com/oxidecomputer/progenitor[Progenitor]. Clients are
automatically updated when the coresponding OpenAPI document is modified.

OpenAPI documents are tracked by the `cargo xtask openapi` command.

* To regenerate all OpenAPI documents, run `cargo xtask openapi generate`.
* To check whether all OpenAPI documents are up-to-date, run `cargo xtask
openapi check`.

For more information, see the documentation in
link:./dev-tools/openapi-manager[`dev-tools/openapi-manager`].

Note that Omicron contains a nominally circular dependency:

* Nexus depends on the Sled Agent client
* The Sled Agent client is derived from the OpenAPI document emitted by Sled Agent
* Sled Agent depends on the Nexus client
* The Nexus client is derived from the OpenAPI document emitted by Nexus

We effectively "break" this circular dependency by virtue of the OpenAPI
documents being checked in.

=== Resolving merge conflicts in Cargo.lock

When pulling in new changes from upstream "main", you may find conflicts in Cargo.lock. The easiest way to deal with these is usually to take the upstream changes as-is, then trigger any Cargo operation that updates the lockfile. `cargo metadata` is a quick one. Here's an example:

```
# Pull in changes from upstream "main"
$ git fetch
$ git merge origin/main

# Oh no! We've got conflicts in Cargo.lock. First, let's just take what's upstream:
$ git show origin/main:Cargo.lock > Cargo.lock

# Now, run any command that causes Cargo to update the lock file as needed.
$ cargo metadata > /dev/null
```

When you do this, Cargo makes only changes to Cargo.lock that are necessary based on the various Cargo.toml files in the workspace and dependencies.

Here are things you _don't_ want to do to resolve this conflict:

* Run `cargo generate-lockfile` to generate a new lock file from scratch.
* Remove `Cargo.lock` and let Cargo regenerate it from scratch.

Both of these will cause Cargo to make many more changes (relative to "main") than necessary because it's choosing the latest version of all dependencies in the whole tree. You'll be inadvertently updating all of Omicron's transitive dependencies. (You might conceivably want that. But usually we update dependencies either as-needed for a particular change or via individual PRs via dependabot, not all at once because someone had to merge Cargo.lock.)

You can also resolve conflicts by hand. It's tedious and error-prone.

=== Configuring ClickHouse

The ClickHouse binary uses several sources for its configuration. The binary expects an XML
config file, usually named `config.xml` to be available, or one may be specified with the
`-C` command-line flag. The binary also includes a minimal configuration _embedded_ within
it, which will be used if no configuration file is given or present in the current directory.
The server also accepts command-line flags for overriding the values of the configuration
parameters.

The packages downloaded by `cargo xtask download clickhouse` include a `config.xml` file with them.
You should probably run ClickHouse via the `ch-dev` tool, but if you decide to run it
manually, you can start the server with:

[source,text]
$ /path/to/clickhouse server --config-file /path/to/config.xml

The configuration file contains a large number of parameters, but most of them are described
with comments in the included `config.xml`, or you may learn more about them
https://clickhouse.tech/docs/en/operations/server-configuration-parameters/settings/[here]
and https://clickhouse.tech/docs/en/operations/settings/[here]. Parameters may be updated
in the `config.xml`, and the server will automatically reload them. You may also specify
many of them on the command-line with:

[source,text]
$ /path/to/clickhouse server --config-file /path/to/config.xml -- --param_name param_value ...