Block multiple sled reservations with the same gen by jmpesp · Pull Request #10479 · oxidecomputer/omicron

jmpesp · 2026-05-21T16:56:39Z

If multiple instance-start sagas are concurrently attempting to allocate for the same instance, this temporarily results in multiple rows in sled_resource_vmm with different propolis ids for the same instance id. One of the instance-start sagas will succeed, where the other(s) will unwind (due to an "instance changed state before it could be started" error from sis_move_to_starting), and remove the sled_resource_vmm record that they added by matching on that saga's propolis id.

There's never been a uniqueness constraint for instance id in the sled_resource_vmm table, because there can't be, otherwise we'd never be able to migrate an instance (which makes a new record on a different sled for the same instance).

For an instance start that performs any new local storage allocation, this is a problem: the latent assumption in inserting / updating local storage related records is that this type of duplication could not occur, that if the insert succeeded then it means the allocation will only be performed once. Because this is not true the CTE will happily stomp all over the local storage allocation related records and that leads to the orphaning seen in the linked issue.

The fix is to add a uniqueness constraint to sled_resource_vmm that ensures only one record for a given instance id plus the instance state generation number exists. This will not affect migration because the instance state generation is bumped in that case.

This commit also changes the local storage related unit tests to clearly specify the ncpus and memory for the fake instances, as inspecting the sled_resource_vmm records produced by the test showed the resources didn't match the instance specification.

Fixes oxidecomputer/customer-support#1184.

If multiple instance-start sagas are concurrently attempting to allocate for the same instance, this temporarily results in multiple rows in `sled_resource_vmm` with different propolis ids for the same instance id. One of the instance-start sagas will succeed, where the other(s) will unwind (due to an "instance changed state before it could be started" error from `sis_move_to_starting`), and remove the `sled_resource_vmm` record that they added by matching on that saga's propolis id. There's never been a uniqueness constraint for instance id in the `sled_resource_vmm` table, because there can't be, otherwise we'd never be able to migrate an instance (which makes a new record on a different sled for the same instance). For an instance start that performs any new local storage allocation, this is a problem: the latent assumption in inserting / updating local storage related records is that this type of duplication could not occur, that if the insert succeeded then it means the allocation will only be performed once. Because this is not true the CTE will happily stomp all over the local storage allocation related records and that leads to the orphaning seen in the linked issue. The fix is to add a uniqueness constraint to `sled_resource_vmm` that ensures only one record for a given instance id plus the instance state generation number exists. This will not affect migration because the instance state generation is bumped in that case. This commit also changes the local storage related unit tests to clearly specify the ncpus and memory for the fake instances, as inspecting the `sled_resource_vmm` records produced by the test showed the resources didn't match the instance specification. Fixes oxidecomputer/customer-support#1184.

hawkw · 2026-05-22T17:37:55Z

 }

+#[derive(Clone, Debug)]
+pub enum SledResourceVmmInstanceStateGeneration {


super annoying obnoxious nitpick: man this is a long name...I suppose including the SledResourceVmm prefix is necessary because this type is currently re-exported by a pub use sled_resource_vmm::*; so we can't just expect callers to refer to it as sled_resource_vmm::InstanceStateGeneration. which...I dunno if it's worth trying to fix that. I guess this is fine, it just makes me feel a certain type of way!

hawkw · 2026-05-22T17:44:25Z

            .sled_reservation_create(
                &opctx,
                instance_id,
+                nexus_db_model::Generation::new(),


nit, may not be important: I wonder if rather than always inserting the reservation at generation 1, we ought to change this function to take an &Instance rather than an InstanceUuid, and use the generation from the instance record. Clearly, there aren't currently any tests which are calling this helper multiple times for the same InstanceUuid, or else they would have already broken, but it seems like it could be potentially annoying if someone were to start adding a new test that does so and was surprised to discover this doesn't actually use the generation from the instance record.

on the other hand, maybe updating all the test code that uses this to pass an &Instance is going to be too painful.

hawkw · 2026-05-22T17:55:45Z

        instance_id: InstanceUuid,
+        instance_state_generation: db::model::Generation,


hmm, so, as written, this allows the caller to attempt to create the reservation at any instance state generation, regardless of what they believe the instance's current generation to be (and, in fact, it seems like we currently have a bunch of tests which are always providing generation 1 no matter what). I wonder if it might be a bit more misuse-resistant to change this function to instead take an &Instance, and always use the state generation from the instance model. That way, we're sure that these come from the same snapshot of the instance state.

Of course, this will require changing the callsites, and it may not be worth the effort, but I felt like it was worth mentioning...

hawkw · 2026-05-22T21:53:25Z

+                    Err(diesel::result::Error::DatabaseError(
+                        diesel::result::DatabaseErrorKind::UniqueViolation,
+                        error_info,
+                    )) if error_info.constraint_name()
+                        == Some(SINGLE_RESERVATION_CONSTRAINT) =>


nit, take it or leave it: i think it might be nice if this logic was stuffed into a function, so you could just say

Suggested change

Err(diesel::result::Error::DatabaseError(

diesel::result::DatabaseErrorKind::UniqueViolation,

error_info,

)) if error_info.constraint_name()

== Some(SINGLE_RESERVATION_CONSTRAINT) =>

Err(e) if is_single_reservation_constraint_violation(e) =>

or something, here and later on?

hawkw · 2026-05-22T21:57:34Z

         group membership."
    )]
    RequiredAffinitySledNotValid,
+    #[error("Instance reservation already made for generation {generation}")]


the rest of the errors here are intended to be user facing (which is why they're kinda big blocks of multi-sentence text). I don't think "Instance reservation already made for generation 69" is something that's particularly meaningful to a user. If this is going to bubble up, could it say something a little less obscure? Maybe "This instance is already running or starting on another sled."?

hawkw · 2026-05-22T21:57:53Z

    // Finally, perform the INSERT if it's still valid.
    query.sql("
-        INSERT INTO sled_resource_vmm (id, sled_id, hardware_threads, rss_ram, reservoir_ram, instance_id)
+        INSERT INTO sled_resource_vmm (id, sled_id, hardware_threads, rss_ram, reservoir_ram, instance_id, instance_state_generation)


can we line wrap this?

smklein · 2026-05-27T18:32:41Z

 );

+-- Allow a single VMM reservation per instance state generation
+CREATE UNIQUE INDEX IF NOT EXISTS single_vmm_reservation_per_generation ON omicron.public.sled_resource_vmm (


This fix seems reasonable to me. And arguably it's protective against other types of bugs, even unrelated to local storage.

That being said:

Are we 100% certain we don't currently have deployed systems with "more than one sled_resource_vmm at instance_id = x?

If such a system exists, it'll get an "instance_state_generation" value of NULL, and then this constraint would fail to add, which would block the migration.

Okay, looking into this more: https://www.cockroachlabs.com/docs/stable/null-handling.html

I think I'm wrong here. Apparently if we insert multiple values of:

instance_id = x + instance_state_generation = NULL

That would be allowed by this UNIQUE constraint, because:

NULL values are not considered unique. Therefore, if a table has a UNIQUE constraint on one or more columns that are optional (nullable), it is possible to insert multiple rows with NULL values in those columns

smklein · 2026-05-27T18:38:29Z

+    instance_id UUID,
+
+    -- The instance's state generation number at the moment of reservation
+    instance_state_generation INT


Have we considered back-filling this and dropping the nullability here? Not necessarily asking for this, but it's always odd to have a nullable field for values that should be known in all future cases

hmm, how are you imagining we would backfill this?

We mostly care about this as a concurrency-control mechanism for competing sagas, right?

So, while it might not be the generation at the moment of generation:

UPDATE omicron.public.sled_resource_vmm AS srv SET instance_state_generation = i.state_generation FROM omicron.public.instance AS i WHERE srv.instance_id = i.id AND srv.instance_state_generation IS NULL;

Could probably work, if we assume migrations aren't in-flight during upgrade time? And I guess this also relies on instance_id not being NULL either.

Alternately, we could just insert a default value of like "1" or something? Though I suppose that also relies on the constraint that "no migrations are in-progress".

(talking through this, it might be more complicated than it's worth? Just might be a long time that this columns stays nullable)

hm, I suppose you're right that it doesn't really have to actually be backfilled with the state generation that it actually was when the instance got added to the table, we can just use the current one, provided that there aren't two entries for that instance (in which case we have to figure this out somehow)

smklein · 2026-05-27T18:42:27Z

-        if !old_resource.is_empty() {
-            info!(&log, "sled reservation already occurred, returning");
-            return Ok(old_resource[0].clone());
+        if let Some(existing_resource) = existing_resource {


Is it important that we confirm the generation is the same as the value we supplied?

What if it was different?

jmpesp requested a review from hawkw May 21, 2026 16:56

hawkw requested a review from smklein May 22, 2026 17:11

hawkw reviewed May 22, 2026

View reviewed changes

smklein reviewed May 27, 2026

View reviewed changes

		instance_id: InstanceUuid,
		instance_state_generation: db::model::Generation,

Conversation

jmpesp commented May 21, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants