Block multiple sled reservations with the same gen#10479
Conversation
If multiple instance-start sagas are concurrently attempting to allocate for the same instance, this temporarily results in multiple rows in `sled_resource_vmm` with different propolis ids for the same instance id. One of the instance-start sagas will succeed, where the other(s) will unwind (due to an "instance changed state before it could be started" error from `sis_move_to_starting`), and remove the `sled_resource_vmm` record that they added by matching on that saga's propolis id. There's never been a uniqueness constraint for instance id in the `sled_resource_vmm` table, because there can't be, otherwise we'd never be able to migrate an instance (which makes a new record on a different sled for the same instance). For an instance start that performs any new local storage allocation, this is a problem: the latent assumption in inserting / updating local storage related records is that this type of duplication could not occur, that if the insert succeeded then it means the allocation will only be performed once. Because this is not true the CTE will happily stomp all over the local storage allocation related records and that leads to the orphaning seen in the linked issue. The fix is to add a uniqueness constraint to `sled_resource_vmm` that ensures only one record for a given instance id plus the instance state generation number exists. This will not affect migration because the instance state generation is bumped in that case. This commit also changes the local storage related unit tests to clearly specify the ncpus and memory for the fake instances, as inspecting the `sled_resource_vmm` records produced by the test showed the resources didn't match the instance specification. Fixes oxidecomputer/customer-support#1184.
| } | ||
|
|
||
| #[derive(Clone, Debug)] | ||
| pub enum SledResourceVmmInstanceStateGeneration { |
There was a problem hiding this comment.
super annoying obnoxious nitpick: man this is a long name...I suppose including the SledResourceVmm prefix is necessary because this type is currently re-exported by a pub use sled_resource_vmm::*; so we can't just expect callers to refer to it as sled_resource_vmm::InstanceStateGeneration. which...I dunno if it's worth trying to fix that. I guess this is fine, it just makes me feel a certain type of way!
| .sled_reservation_create( | ||
| &opctx, | ||
| instance_id, | ||
| nexus_db_model::Generation::new(), |
There was a problem hiding this comment.
nit, may not be important: I wonder if rather than always inserting the reservation at generation 1, we ought to change this function to take an &Instance rather than an InstanceUuid, and use the generation from the instance record. Clearly, there aren't currently any tests which are calling this helper multiple times for the same InstanceUuid, or else they would have already broken, but it seems like it could be potentially annoying if someone were to start adding a new test that does so and was surprised to discover this doesn't actually use the generation from the instance record.
on the other hand, maybe updating all the test code that uses this to pass an &Instance is going to be too painful.
| instance_id: InstanceUuid, | ||
| instance_state_generation: db::model::Generation, |
There was a problem hiding this comment.
hmm, so, as written, this allows the caller to attempt to create the reservation at any instance state generation, regardless of what they believe the instance's current generation to be (and, in fact, it seems like we currently have a bunch of tests which are always providing generation 1 no matter what). I wonder if it might be a bit more misuse-resistant to change this function to instead take an &Instance, and always use the state generation from the instance model. That way, we're sure that these come from the same snapshot of the instance state.
Of course, this will require changing the callsites, and it may not be worth the effort, but I felt like it was worth mentioning...
| Err(diesel::result::Error::DatabaseError( | ||
| diesel::result::DatabaseErrorKind::UniqueViolation, | ||
| error_info, | ||
| )) if error_info.constraint_name() | ||
| == Some(SINGLE_RESERVATION_CONSTRAINT) => |
There was a problem hiding this comment.
nit, take it or leave it: i think it might be nice if this logic was stuffed into a function, so you could just say
| Err(diesel::result::Error::DatabaseError( | |
| diesel::result::DatabaseErrorKind::UniqueViolation, | |
| error_info, | |
| )) if error_info.constraint_name() | |
| == Some(SINGLE_RESERVATION_CONSTRAINT) => | |
| Err(e) if is_single_reservation_constraint_violation(e) => |
or something, here and later on?
| group membership." | ||
| )] | ||
| RequiredAffinitySledNotValid, | ||
| #[error("Instance reservation already made for generation {generation}")] |
There was a problem hiding this comment.
the rest of the errors here are intended to be user facing (which is why they're kinda big blocks of multi-sentence text). I don't think "Instance reservation already made for generation 69" is something that's particularly meaningful to a user. If this is going to bubble up, could it say something a little less obscure? Maybe "This instance is already running or starting on another sled."?
| // Finally, perform the INSERT if it's still valid. | ||
| query.sql(" | ||
| INSERT INTO sled_resource_vmm (id, sled_id, hardware_threads, rss_ram, reservoir_ram, instance_id) | ||
| INSERT INTO sled_resource_vmm (id, sled_id, hardware_threads, rss_ram, reservoir_ram, instance_id, instance_state_generation) |
| ); | ||
|
|
||
| -- Allow a single VMM reservation per instance state generation | ||
| CREATE UNIQUE INDEX IF NOT EXISTS single_vmm_reservation_per_generation ON omicron.public.sled_resource_vmm ( |
There was a problem hiding this comment.
This fix seems reasonable to me. And arguably it's protective against other types of bugs, even unrelated to local storage.
That being said:
- Are we 100% certain we don't currently have deployed systems with "more than one
sled_resource_vmmatinstance_id = x?
If such a system exists, it'll get an "instance_state_generation" value of NULL, and then this constraint would fail to add, which would block the migration.
There was a problem hiding this comment.
Okay, looking into this more: https://www.cockroachlabs.com/docs/stable/null-handling.html
I think I'm wrong here. Apparently if we insert multiple values of:
instance_id = x + instance_state_generation = NULL
That would be allowed by this UNIQUE constraint, because:
NULL values are not considered unique. Therefore, if a table has a UNIQUE constraint on one or more columns that are optional (nullable), it is possible to insert multiple rows with NULL values in those columns
| instance_id UUID, | ||
|
|
||
| -- The instance's state generation number at the moment of reservation | ||
| instance_state_generation INT |
There was a problem hiding this comment.
Have we considered back-filling this and dropping the nullability here? Not necessarily asking for this, but it's always odd to have a nullable field for values that should be known in all future cases
There was a problem hiding this comment.
hmm, how are you imagining we would backfill this?
There was a problem hiding this comment.
We mostly care about this as a concurrency-control mechanism for competing sagas, right?
So, while it might not be the generation at the moment of generation:
UPDATE omicron.public.sled_resource_vmm AS srv
SET instance_state_generation = i.state_generation
FROM omicron.public.instance AS i
WHERE srv.instance_id = i.id
AND srv.instance_state_generation IS NULL;Could probably work, if we assume migrations aren't in-flight during upgrade time? And I guess this also relies on instance_id not being NULL either.
Alternately, we could just insert a default value of like "1" or something? Though I suppose that also relies on the constraint that "no migrations are in-progress".
(talking through this, it might be more complicated than it's worth? Just might be a long time that this columns stays nullable)
There was a problem hiding this comment.
hm, I suppose you're right that it doesn't really have to actually be backfilled with the state generation that it actually was when the instance got added to the table, we can just use the current one, provided that there aren't two entries for that instance (in which case we have to figure this out somehow)
| if !old_resource.is_empty() { | ||
| info!(&log, "sled reservation already occurred, returning"); | ||
| return Ok(old_resource[0].clone()); | ||
| if let Some(existing_resource) = existing_resource { |
There was a problem hiding this comment.
Is it important that we confirm the generation is the same as the value we supplied?
What if it was different?
If multiple instance-start sagas are concurrently attempting to allocate for the same instance, this temporarily results in multiple rows in
sled_resource_vmmwith different propolis ids for the same instance id. One of the instance-start sagas will succeed, where the other(s) will unwind (due to an "instance changed state before it could be started" error fromsis_move_to_starting), and remove thesled_resource_vmmrecord that they added by matching on that saga's propolis id.There's never been a uniqueness constraint for instance id in the
sled_resource_vmmtable, because there can't be, otherwise we'd never be able to migrate an instance (which makes a new record on a different sled for the same instance).For an instance start that performs any new local storage allocation, this is a problem: the latent assumption in inserting / updating local storage related records is that this type of duplication could not occur, that if the insert succeeded then it means the allocation will only be performed once. Because this is not true the CTE will happily stomp all over the local storage allocation related records and that leads to the orphaning seen in the linked issue.
The fix is to add a uniqueness constraint to
sled_resource_vmmthat ensures only one record for a given instance id plus the instance state generation number exists. This will not affect migration because the instance state generation is bumped in that case.This commit also changes the local storage related unit tests to clearly specify the ncpus and memory for the fake instances, as inspecting the
sled_resource_vmmrecords produced by the test showed the resources didn't match the instance specification.Fixes oxidecomputer/customer-support#1184.