Initial changes for stream logs command#7490

Open

saanikaguptamicrosoft wants to merge 1 commit intoAzure:foundry-training-devfrom

saanikaguptamicrosoft:saanika/stream

Collaborator

saanikaguptamicrosoft commented Apr 3, 2026

No description provided.


          Initial changes for stream logs command

c5577a8

saanikaguptamicrosoft requested review from JeffreyCA, achauhan-scc, hemarina, kingernupur, rabollin, rajeshkamal5050, tg-msft, trangevi, vhvb1989, wbreza and weikanglim as code owners

April 3, 2026 11:09

microsoft-github-policy-service bot assigned saanikaguptamicrosoft

jongio requested changes

View reviewed changes

Member

jongio left a comment

This adds a solid polling-based log streamer that follows the extension's established patterns for command setup and credential management. A few issues to address - one bug in the terminal-job detection, some Go modernization gaps, and structural duplication.

Summary: 2 CRITICAL, 2 HIGH, 4 MEDIUM, 2 LOW

CRITICAL:

Probe mechanism for detecting terminal jobs with no logs can't work - GetArtifactContent returns (nil, nil) on 404, so job status is never read
Discovery loop has no retry limit - wrong job name = infinite loop

HIGH:

sort.Strings instead of slices.Sort - doesn't match codebase (job_get.go uses slices.Sorted)
consecErrors counter doesn't reset on successful list, only on successful file reads

MEDIUM:

math.Min with float64 casts - use built-in min() instead
Read-and-print pattern duplicated 3 times - extract helper
Final flush loop duplicates main poll logic (~35 lines)
IsStreamableStatus doc comment says "StreamableStatuses returns"

LOW:

interface{} should be any (go fix catches this)
PR description is empty

cli/azd/extensions/azure.ai.customtraining/internal/cmd/job_stream.go

+              					fmt.Println("Use 'azd ai training job download' to download job artifacts.")
+              					return nil
+              				}

Member

jongio Apr 3, 2026

Bug: the probe mechanism can't work. GetArtifactContent returns (nil, nil) on HTTP 404 - the client closes the body and returns nil before this code can read X-VW-Job-Status. So probeResp is always nil here, lastJobStatus is never set, and the terminal-state check on line 126 never fires.

If a job finishes with no log files, this loop runs forever (only killed by Ctrl+C).

Fix: use apiClient.GetJob(ctx, name) to check status directly - same pattern as job_download.go:82:

job, jobErr := apiClient.GetJob(ctx, name)
if jobErr == nil && models.TerminalStatuses[job.Properties.Status] {
    fmt.Printf("\nJob '%s' is in terminal state '%s' with no log files.\n", name, job.Properties.Status)
    fmt.Println("Use 'azd ai training job download' to download job artifacts.")
    return nil
}

cli/azd/extensions/azure.ai.customtraining/internal/cmd/job_stream.go

+              				fmt.Println("(Discovering log files...)")
+              				time.Sleep(discoveryRetryDelay)
+              			}

Member

jongio Apr 3, 2026

No retry limit on discovery. If the job name is misspelled or logs never appear, this loops forever with no upper bound. Every other retry path in this file has maxConsecErrors - discovery should too.

Add a max discovery attempts constant and check it:

const maxDiscoveryAttempts = 30 // 30 * 5s = 2.5 min max wait
discoveryAttempts := 0
// inside loop:
discoveryAttempts++
if discoveryAttempts >= maxDiscoveryAttempts {
    return fmt.Errorf("no log files found after %d attempts - verify job name '%s' is correct", maxDiscoveryAttempts, name)
}

cli/azd/extensions/azure.ai.customtraining/internal/cmd/job_stream.go

+              				time.Sleep(discoveryRetryDelay)
+              			}
+              			sort.Strings(logFiles)

Member

jongio Apr 3, 2026

sort.Strings doesn't match the codebase. job_get.go uses slices.Sorted() consistently. Per AGENTS.md, prefer slices over sort.

Suggested change

      
            			sort.Strings(logFiles)
          
            slices.Sort(logFiles)

Also update the import to replace "sort" with "slices" (and the other sort.Strings on line 219).

cli/azd/extensions/azure.ai.customtraining/internal/cmd/job_stream.go

+              					}
+              					pollInterval = backoff(pollInterval)
+              					continue
+              				}

Member

jongio Apr 3, 2026

consecErrors only resets on successful file reads (line 262), not after a successful ListArtifactsInPath. A sequence like: list fails, list succeeds, read fails, list succeeds, read fails hits the limit even though recent operations succeeded. Reset after the successful list call too:

// After the artifactList loop (line 222), add:
consecErrors = 0

cli/azd/extensions/azure.ai.customtraining/internal/cmd/job_stream.go

+              // backoff doubles the interval up to maxPollInterval.
+              func backoff(current time.Duration) time.Duration {
+              	next := time.Duration(math.Min(float64(current*2), float64(maxPollInterval)))

Member

jongio Apr 3, 2026

math.Min with float64 casts is unnecessary - Go has built-in min() since 1.21. Also avoids potential precision loss with large time.Duration values.

Suggested change

      
            	next := time.Duration(math.Min(float64(current*2), float64(maxPollInterval)))
          
            func backoff(current time.Duration) time.Duration {
          
            return min(current*2, maxPollInterval)
          
            }

This also removes the "math" import.

cli/azd/extensions/azure.ai.customtraining/internal/cmd/job_stream.go

+              					if !strings.HasSuffix(string(data), "\n") {
+              						fmt.Println()
+              					}
+              				}

Member

jongio Apr 3, 2026

This "print data, ensure trailing newline" block appears 3 times (here, line 269, line 341). Extract it:

func printContent(data []byte) {
    fmt.Print(string(data))
    if !strings.HasSuffix(string(data), "\n") {
        fmt.Println()
    }
}

cli/azd/extensions/azure.ai.customtraining/internal/cmd/job_stream.go

+              						if contentLen > fs.offset {
+              							fs.offset = contentLen
+              						}
+              					}

Member

jongio Apr 3, 2026

The final flush loop (lines 291-324) duplicates the file-reading logic from the main poll loop. If the read logic changes (e.g., error handling, offset tracking), both copies need updating. Consider extracting a pollFile(ctx, apiClient, jobName, path, fs) helper that both the main loop and final flush can call.

cli/azd/extensions/azure.ai.customtraining/pkg/models/job.go

+              	"Running":      true,
+              }
+              // StreamableStatuses returns true if the job status indicates streaming is appropriate.

Member

jongio Apr 3, 2026

Doc comment says "StreamableStatuses returns" but the function is IsStreamableStatus. Per Go conventions, doc comments should start with the function name.

Suggested change

      
            // StreamableStatuses returns true if the job status indicates streaming is appropriate.
          
            // IsStreamableStatus returns true if the job status indicates streaming is appropriate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

jongio jongio requested changes

wbreza Awaiting requested review from wbreza wbreza is a code owner

vhvb1989 Awaiting requested review from vhvb1989 vhvb1989 is a code owner

hemarina Awaiting requested review from hemarina hemarina is a code owner

weikanglim Awaiting requested review from weikanglim weikanglim is a code owner

JeffreyCA Awaiting requested review from JeffreyCA JeffreyCA is a code owner

tg-msft Awaiting requested review from tg-msft tg-msft is a code owner

rajeshkamal5050 Awaiting requested review from rajeshkamal5050 rajeshkamal5050 is a code owner

trangevi Awaiting requested review from trangevi trangevi is a code owner

achauhan-scc Awaiting requested review from achauhan-scc achauhan-scc is a code owner

kingernupur Awaiting requested review from kingernupur kingernupur is a code owner

rabollin Awaiting requested review from rabollin rabollin is a code owner

Labels

None yet