Skip to content

fix: add retry with || true for post-snapshot invoke in test-04#736

Open
Slambot01 wants to merge 5 commits into
hyperledger-labs:mainfrom
Slambot01:fix/e2e-snapshot-ccaas-retry
Open

fix: add retry with || true for post-snapshot invoke in test-04#736
Slambot01 wants to merge 5 commits into
hyperledger-labs:mainfrom
Slambot01:fix/e2e-snapshot-ccaas-retry

Conversation

@Slambot01
Copy link
Copy Markdown
Contributor

Fixes #734

The first expectInvokeRest after snapshot restore intermittently fails with DEADLINE_EXCEEDED. waitForContainer confirms the CCaaS gRPC server is listening, but the peer's gRPC client reconnection happens asynchronously - sleep 2 (from #648) is insufficient under variable CI load.

The previous retry attempt in #648 used && break without || true, which couldn't work because expect-invoke-rest.sh calls exit 1 on failure and the test runs under set -e, aborting before the loop can retry.

Fix

Added expectInvokeRestWithRetry using the same || true pattern proven in test-01-v3-simple.sh (expectQueryWithRetry, lines 48–61). Applied only to the first invoke after snapshot restore .the second invoke runs normally since the connection is established by then.

Signed-off-by: Ritesh Pandit <riteshpandit1708@gmail.com>
dzikowski
dzikowski previously approved these changes May 11, 2026
@dzikowski dzikowski dismissed their stale review May 11, 2026 16:52

sorry, I missed it didn't resolved the underlying issue

@dzikowski
Copy link
Copy Markdown
Contributor

I initially approved, but then I realized it does not solved the underlying issue. Test still fail, probably because the error is in a different place. Not in the test, and waiting till the container is ready - but somewhere in the network boot/restore process. There is probably race condition when the chaincode container starts. Maybe as a workaround we should try to restart it in the test to verify if that's the issue? And then fix in a proper way, eliminating the root cause.

@Slambot01
Copy link
Copy Markdown
Contributor Author

I initially approved, but then I realized it does not solved the underlying issue. Test still fail, probably because the error is in a different place. Not in the test, and waiting till the container is ready - but somewhere in the network boot/restore process. There is probably race condition when the chaincode container starts. Maybe as a workaround we should try to restart it in the test to verify if that's the issue? And then fix in a proper way, eliminating the root cause.

Makes sense. I’ll look into the chaincode container restart angle. I’ll check if restarting the CCaaS container after snapshot restore fixes the failure, and if it does, I’ll trace it back to the actual issue in the restore flow.

…RPC connections

Signed-off-by: Ritesh Pandit <riteshpandit1708@gmail.com>
@Slambot01
Copy link
Copy Markdown
Contributor Author

I initially approved, but then I realized it does not solved the underlying issue. Test still fail, probably because the error is in a different place. Not in the test, and waiting till the container is ready - but somewhere in the network boot/restore process. There is probably race condition when the chaincode container starts. Maybe as a workaround we should try to restart it in the test to verify if that's the issue? And then fix in a proper way, eliminating the root cause.

The CCaaS restart approach worked locally, but CI is still failing. All 10 retries end up hitting DEADLINE_EXCEEDED.
The CCaaS containers are bootstrapping properly after restart, and the peer gRPC state is showing READY, so it doesn’t look like the issue is with the CCaaS ↔ peer connection itself.At this point the problem seems to be somewhere deeper in the restore flow. I need to dig into the state the peer or fablo-rest ends up in after restore, because something there is preventing the endorsement flow from completing properly. I guess ,will need to see further for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test-04-v3-snapshot-ccaas still flaky - sleep insufficient, needs retry with || true pattern

2 participants