Skip to content

Commit e25e56e

Browse files
committed
nvme: CQ db_buf EventIdx should slightly lag CQ tail
Doorbell buffers want the CQ event index to roughly track with the tail, because most CQ doorbells are not interesting. If there are ten CQ entries, space for 54 more, and the guest has acknowledged one, we don't really care anything happened. If there are 64 CQ entries, 0 available, and one SQ entry that we're notified for, we'll try to reserve space in a full CQ, see it's at capacity, and the SQ will idle with the CQ aware that that SQ is corked. Without doorbell buffers, at some point the guest will process a CQE, ring a CQ doorbell, uncork the CQ, and we'll nudge the parked SQ worker thread into operation again. With doorbell buffers, if the event index matches the CQ tail we'll never get a doorbell when the CQ has been drained. If the guest never submits another I/O on the corked SQ, the pending I/O may never get processed. A more careful treatment would only have the event index lag if we are at risk of having a worker thread defer an I/O because this CQ would be full. This would look something like not letting EventIdx advance so far forward that there is less than one entry available for each SQ pointed at this CQ. Eating the unnecessary doorbell cost at least gets doorbell buffers correct in this case and we can turn it back on. The more careful treatment of EventIdx can come later.
1 parent ffcc27b commit e25e56e

1 file changed

Lines changed: 28 additions & 10 deletions

File tree

lib/propolis/src/hw/nvme/queue.rs

Lines changed: 28 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -343,18 +343,36 @@ impl QueueGuard<'_, CompQueueState> {
343343
/// Write update to the EventIdx in Doorbell Buffer page, if possible
344344
fn db_buf_write(&mut self, devq_id: u64, mem: &MemCtx) {
345345
if let Some(db_buf) = self.state.db_buf {
346-
probes::nvme_cq_dbbuf_write!(|| (devq_id, self.state.tail));
347-
// Keep EventIdx populated with the position of the CQ tail. We are
348-
// not especially concerned with receiving timely (doorbell) updates
349-
// from the guest about where the head pointer sits. We keep our
350-
// own tally of how many entries are in the CQ are available for
351-
// completions to land.
346+
// Update EventIdx as far as we're willing to forego doorbell
347+
// updates about the CQ head. We are not especially concerned with
348+
// receiving timely doorbell updates from the guest about the head.
349+
// We keep our own tally of how many entries in the CQ are available
350+
// for completions to land. In the typical case, we will read the
351+
// shadow doorbell from db_buf JIT to update the available CQ space.
352352
//
353-
// When checking for available space before issuing a Permit, we can
354-
// perform our own JIT read from the db_buf to stay updated on the
355-
// true space available.
353+
// However, simply keeping EventIdx matching the queue tail means we
354+
// opt out of *any* notification that a completion is posted. Then,
355+
// if an I/O was submitted, not processed immediately due to
356+
// insufficient CQ space, but the guest submits no further I/Os on
357+
// that queue, we would never notice that there is space in the CQ.
358+
// The I/O would go unfulfilled forever.
359+
//
360+
// For now, leave EventIdx one before the actual CQ tail, so that
361+
// making the CQ empty requires a doorbell notify. This is
362+
// excessive; there are many cases where we don't care if the CQ is
363+
// to be emptied.
364+
if self.avail_occupied() <= 1 {
365+
// The queue was empty and just became non-empty. So `tail` has
366+
// not advanced far enough that we can actually advance
367+
// EventIdx.
368+
return;
369+
}
370+
371+
let next_evtidx = self.idx_sub(self.state.tail, 1);
372+
373+
probes::nvme_cq_dbbuf_write!(|| (devq_id, next_evtidx));
356374
fence(Ordering::Release);
357-
mem.write(db_buf.eventidx, &self.state.tail);
375+
mem.write(db_buf.eventidx, &next_evtidx);
358376
}
359377
}
360378

0 commit comments

Comments
 (0)