Skip to content

feat: replace tmate with native SSH debug step for --ssh-after-step#456

Open
robstolarz wants to merge 3 commits intomainfrom
rob/dep-3862-replace-tmate-action-in-ssh-after-step-with-our-own-solution
Open

feat: replace tmate with native SSH debug step for --ssh-after-step#456
robstolarz wants to merge 3 commits intomainfrom
rob/dep-3862-replace-tmate-action-in-ssh-after-step-with-our-own-solution

Conversation

@robstolarz
Copy link
Copy Markdown
Contributor

@robstolarz robstolarz commented Mar 19, 2026

Summary

  • Replace mxschmitt/action-tmate@v3 with a shell script that pauses via touch /tmp/depot-continue
  • After starting the run, CLI waits for sandbox provisioning, polls logs for ::depot-ssh-ready:: marker, then connects via PTY
  • Falls back to printing SSH connection info when no TTY is available
  • Fix job key matching for inline workflows (_inline_0.yaml:e2ee2e)
  • Improve printSSHInfo to show depot ci ssh command

Stacked on #454 and needs https://github.com/depot/api/pull/3233. Closes DEP-3862.

Screen Recording 2026-03-18 at 8 25 39 PM

Test plan

  • Full end-to-end test: --ssh-after-step 2 on docker-integration.yml
  • Debug step injected correctly, workflow pauses
  • Log marker detected, PTY session connects after step N completes
  • touch /tmp/depot-continue resumes workflow
  • No-TTY fallback prints SSH connection info
  • go build ./... and go test ./... pass

🤖 Generated with Claude Code


Note

Medium Risk
Changes the CI run/SSH connection flow by injecting a blocking debug step and polling logs before attaching, which could affect workflow execution timing and CLI attach reliability.

Overview
depot ci run --ssh-after-step now injects a native "Depot SSH Debug" shell step that pauses the job until /tmp/depot-continue is touched, replacing the previous action-tmate approach.

After starting a run, the CLI can wait for sandbox provisioning and (for --ssh-after-step) poll job attempt logs for a ::depot-ssh-ready:: marker before connecting; in non-TTY contexts it prints improved SSH info (now includes run_id and a depot ci ssh <run-id> hint).

Job lookup for depot ci ssh is tightened to prefer exact job-key matches and only fall back to matching the short name portion of inline workflow keys.

Written by Cursor Bugbot for commit 4fbb90f. This will update automatically on new commits. Configure here.

@linear
Copy link
Copy Markdown

linear bot commented Mar 19, 2026

@robstolarz robstolarz marked this pull request as ready for review March 19, 2026 03:32
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for all 3 issues found in the latest run.

  • ✅ Fixed: waitForLogMarker ignores terminal job/attempt status
    • Added a switch statement to check for terminal attempt statuses (finished, failed, cancelled) and return an error immediately, matching the pattern used in waitForSandbox.
  • ✅ Fixed: Duplicate "Connecting to sandbox" message printed
    • Removed the duplicate fmt.Fprintf call from run.go since waitForSandbox already prints this message before returning.
  • ✅ Fixed: printSSHInfo suggests invalid depot ci ssh command
    • Added a runOrJobID parameter to printSSHInfo and only display the 'depot ci ssh' suggestion when a valid run/job ID is available, passing empty string when called from 'depot ci run'.

Create PR

Or push these changes by commenting:

@cursor push 0ccc9c9234
Preview (0ccc9c9234)
diff --git a/pkg/cmd/ci/run.go b/pkg/cmd/ci/run.go
--- a/pkg/cmd/ci/run.go
+++ b/pkg/cmd/ci/run.go
@@ -237,13 +237,12 @@
 					}
 				}
 
-				if sshAfterStep > 0 {
-					fmt.Fprintf(os.Stderr, "Run 'touch /tmp/depot-continue' to resume the workflow. (Your session will not end.)\n")
-				}
-				fmt.Fprintf(os.Stderr, "Connecting to sandbox %s...\n", sandboxID)
-				if !helpers.IsTerminal() {
-					return printSSHInfo(sandboxID, sessionID, "")
-				}
+			if sshAfterStep > 0 {
+				fmt.Fprintf(os.Stderr, "Run 'touch /tmp/depot-continue' to resume the workflow. (Your session will not end.)\n")
+			}
+			if !helpers.IsTerminal() {
+				return printSSHInfo(sandboxID, sessionID, "", "")
+			}
 				return pty.Run(ctx, pty.SessionOptions{
 					Token:     tokenVal,
 					SandboxID: sandboxID,
@@ -549,6 +548,11 @@
 			continue
 		}
 
+		switch attempt.Status {
+		case "finished", "failed", "cancelled":
+			return fmt.Errorf("job has already completed (status: %s) before marker appeared", attempt.Status)
+		}
+
 		lines, err := api.CIGetJobAttemptLogs(ctx, token, orgID, attempt.AttemptId)
 		if err == nil {
 			for _, line := range lines {

diff --git a/pkg/cmd/ci/ssh.go b/pkg/cmd/ci/ssh.go
--- a/pkg/cmd/ci/ssh.go
+++ b/pkg/cmd/ci/ssh.go
@@ -72,9 +72,9 @@
 				return err
 			}
 
-			if info || !helpers.IsTerminal() {
-				return printSSHInfo(sandboxID, sessionID, output)
-			}
+		if info || !helpers.IsTerminal() {
+			return printSSHInfo(sandboxID, sessionID, output, runID)
+		}
 
 			return pty.Run(ctx, pty.SessionOptions{
 				Token:     tokenVal,
@@ -254,7 +254,7 @@
 	return latest
 }
 
-func printSSHInfo(sandboxID, sessionID, output string) error {
+func printSSHInfo(sandboxID, sessionID, output, runOrJobID string) error {
 	if output == "json" {
 		enc := json.NewEncoder(os.Stdout)
 		enc.SetIndent("", "  ")
@@ -270,10 +270,14 @@
 	fmt.Printf("User:     %s\n", sandboxID)
 	fmt.Printf("Password: Your Depot API token ($DEPOT_TOKEN)\n")
 	fmt.Println()
-	fmt.Printf("Connect interactively:\n")
-	fmt.Printf("  depot ci ssh %s\n", sandboxID)
-	fmt.Println()
-	fmt.Printf("Or via SSH directly:\n")
+	if runOrJobID != "" {
+		fmt.Printf("Connect interactively:\n")
+		fmt.Printf("  depot ci ssh %s\n", runOrJobID)
+		fmt.Println()
+		fmt.Printf("Or via SSH directly:\n")
+	} else {
+		fmt.Printf("Connect via SSH:\n")
+	}
 	fmt.Printf("  ssh %s@api.depot.dev\n", sandboxID)
 	return nil
 }

Base automatically changed from rob/dep-3852-cli-depot-ci-ssh-and-interactive-terminal-for-ci-jobs to main March 19, 2026 11:41
robstolarz and others added 2 commits March 19, 2026 11:54
Replace the `mxschmitt/action-tmate@v3` GitHub Action with a simple
shell script that pauses the workflow by polling for a magic file.
After starting the run, the CLI waits for the sandbox, polls logs for
a `::depot-ssh-ready::` marker, then connects via the existing PTY
infrastructure.

- Inject a shell-based debug step instead of tmate action
- Wait for sandbox + log marker before connecting
- Fall back to printing SSH info when no TTY is available
- Fix job key matching for inline workflows (e.g. `_inline_0.yaml:e2e`)
- Improve `printSSHInfo` to show `depot ci ssh` command

Closes DEP-3862

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Early-exit waitForLogMarker when attempt reaches terminal status
  instead of polling forever if a step before the debug step fails
- Remove duplicate "Connecting to sandbox" message (already in waitForSandbox)
- Fix printSSHInfo to use run ID for `depot ci ssh` suggestion
- Remove arbitrary 10min timeout on log marker polling (rely on
  terminal status check and context cancellation instead)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@robstolarz robstolarz force-pushed the rob/dep-3862-replace-tmate-action-in-ssh-after-step-with-our-own-solution branch from 9408c41 to 3707ba5 Compare March 19, 2026 19:03
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: Exact job key match priority lost in findJob
    • Split the combined matching loop into two separate passes: first checking only exact matches, then checking short-name matches, ensuring exact matches always take priority regardless of job order.
  • ✅ Fixed: Redundant fallback suffix-match loop in findJob
    • Removed the redundant suffix-match fallback loop as part of the refactoring that separated exact and short-name matching into distinct passes.

Create PR

Or push these changes by commenting:

@cursor push 810e322841
Preview (810e322841)
diff --git a/pkg/cmd/ci/ssh.go b/pkg/cmd/ci/ssh.go
--- a/pkg/cmd/ci/ssh.go
+++ b/pkg/cmd/ci/ssh.go
@@ -208,24 +208,22 @@
 		return nil, &retryableJobError{msg: fmt.Sprintf("run %s has no jobs yet", resp.RunId)}
 	}
 
-	// Match by job key (--job flag): exact match or short name (after colon).
+	// Match by job key (--job flag): exact match first, then short name (after colon).
 	// Job keys from inline workflows look like "_inline_0.yaml:e2e" — the
 	// user passes just "e2e", so match on the suffix after the colon too.
 	if jobKey != "" {
+		// First pass: exact match takes priority.
 		for _, j := range allJobs {
-			short := j.JobKey
-			if i := strings.IndexByte(short, ':'); i >= 0 {
-				short = short[i+1:]
-			}
-			if j.JobKey == jobKey || short == jobKey {
+			if j.JobKey == jobKey {
 				return j, nil
 			}
 		}
-		// Inline workflows get prefixed keys (e.g. "_inline_0.yaml:lint_typecheck"),
-		// so fall back to a suffix match when the user passes just the job name.
+		// Second pass: match on short name (part after colon).
 		for _, j := range allJobs {
-			if strings.HasSuffix(j.JobKey, ":"+jobKey) {
-				return j, nil
+			if i := strings.IndexByte(j.JobKey, ':'); i >= 0 {
+				if j.JobKey[i+1:] == jobKey {
+					return j, nil
+				}
 			}
 		}
 		// Job might not exist yet if workflows are still being expanded.

Split the single-pass loop into exact-first then short-name fallback
so a bare "build" job key is preferred over "_inline_0.yaml:build".
Also removes the redundant HasSuffix fallback loop.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Missing timeout causes indefinite hang in log polling
    • Added a 5-minute timeout to waitForLogMarker matching the pattern used in waitForSandbox, so the function now returns an error instead of hanging indefinitely when the marker never appears.

Create PR

Or push these changes by commenting:

@cursor push 0aad5a83f3
Preview (0aad5a83f3)
diff --git a/pkg/cmd/ci/run.go b/pkg/cmd/ci/run.go
--- a/pkg/cmd/ci/run.go
+++ b/pkg/cmd/ci/run.go
@@ -512,8 +512,14 @@
 // appears. This is used to detect when the injected debug step is running.
 func waitForLogMarker(ctx context.Context, token, orgID, runID, jobKey, marker string) error {
 	const pollInterval = 3 * time.Second
+	const timeout = 5 * time.Minute
 
+	deadline := time.Now().Add(timeout)
+
 	for {
+		if time.Now().After(deadline) {
+			return fmt.Errorf("timed out waiting for debug step marker (waited %s)", timeout)
+		}
 
 		// Resolve the latest attempt ID for the job.
 		resp, err := api.CIGetRunStatus(ctx, token, orgID, runID)

case <-time.After(pollInterval):
}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing timeout causes indefinite hang in log polling

Medium Severity

waitForLogMarker has no timeout, unlike waitForSandbox which uses a 5-minute deadline. If the marker never appears and the job stays running (e.g., a preceding step hangs, the debug step isn't reached, or the API returns persistent errors that are silently retried), the CLI blocks indefinitely. The sandbox is already provisioned at this point, and the caller on line 234 treats errors as non-fatal warnings to "connect anyway" — but it never gets the chance because waitForLogMarker never returns. API errors are also swallowed and retried forever (unlike waitForSandbox which fails immediately on API errors), compounding the risk.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants