【WIP】feat: add ocr scan for full-file code review by css521 · Pull Request #93 · alibaba/open-code-review

css521 · 2026-06-10T07:43:15Z

Description

Introduce a new top-level subcommand ocr scan (alias s) that reviews whole files instead of git diffs. Use cases include reviewing unfamiliar codebases, pre-migration audits, and ad-hoc per-directory reviews.

The change also splits scan and diff review at the package level so the two pipelines can evolve independently. Shared LLM tool-use loop, memory compression, and per-call bookkeeping move into a new internal/llmloop package; both internal/agent (diff review) and the new internal/scan (full-file review) delegate to llmloop.Runner and never import each other.

New / changed packages

Package	Role
`internal/scan/` (new)	File enumeration via `git ls-files`, full-scan agent, `FULL_SCAN_TASK` rendering, preview
`internal/llmloop/` (new)	Shared LLM tool-use loop, three-zone memory compression, `CommentWorkerPool`, `AgentWarning`
`internal/agent/`	Slimmed: LLM loop / compression / token aggregation moved to `llmloop`; review-only orchestration remains
`internal/model/`	New `ScanItem` (full-file payload) plus `Preview` / `PreviewEntry` / `ExcludeReason` shared by both modes
`internal/diff/`	New `gitignore.go` exposing gitignore helpers reused by scan
`cmd/opencodereview/`	New `scan_cmd.go`; `shared.go` consolidates startup (`loadCommonContext` / `loadLLMRuntime`), output (`emitRunResult`, `ResultProvider`), and stdout silencing (`quietHandle`); `review_cmd.go` follows the same shape

Template additions

FULL_SCAN_TASK: dedicated prompt with Tool-call discipline guidance (don't re-read the current file, ≤ 2–3 context calls per finding, batch code_comment, call task_done early) to reduce gratuitous tool calls.
FULL_SCAN_MAX_TOOL_REQUEST_TIMES (default 60): scan-only per-file budget, raised over diff's 30 to fit multi-finding files. --max-tools still composes (only raises, never lowers).

Other behavioral notes

In scan mode, file_read_diff is filtered out of MainToolDefs — its semantics don't apply without a diff, and exposing it just burns tool-call rounds.
ocr review behavior is unchanged.
internal/agent and internal/scan have zero mutual imports (go list -deps validated).

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Refactoring (no functional changes)
Documentation update
CI / Build / Tooling

How Has This Been Tested?

make test passes locally
Manual testing (described below)

Unit tests cover:

Provider enumeration against a temp git repo (internal/scan/provider_test.go): .gitignore honored, binary placeholder emitted, oversized files skipped, exact-file vs. directory-prefix path filtering.
Template rendering: all FULL_SCAN_TASK placeholders substituted, change_files literal injected, no {{...}} leakage (internal/scan/agent_test.go).
Token-budget filter and injectScanContentMap adapter (internal/scan/agent_test.go).
Flag parsing & validation (cmd/opencodereview/scan_cmd_test.go): --all / --path mutual presence, --audience / --max-tools / --max-git-procs validation, splitPaths, excludeToolDef (including absent name).
Template loading (internal/config/template/template_test.go): FULL_SCAN_MAX_TOOL_REQUEST_TIMES correctly deserialized; ApplyLanguage injects directives into FULL_SCAN_TASK.

Manual smoke tests:

# Preview matrix
ocr scan --all --preview
ocr scan --path internal/scan --preview
ocr scan --path internal/scan/agent.go --preview
ocr scan --all --path internal/diff --preview

# Error paths
ocr scan                          # → must specify --all or --path
ocr scan --audience robot         # → invalid --audience

# End-to-end against a small directory; verified:
#   - .opencodereview/sessions/.../*.jsonl  →  review_mode == "full_scan"
#   - prompt contains <current_file_content>; no {{file_content}} / {{diff}} leak
#   - generated comments have populated start_line / end_line

# Regression
ocr review --preview              # output identical to pre-refactor
ocr review --commit HEAD~1        # unchanged

Static dependency assertions:

go list -deps ./internal/agent | grep open-code-review/internal/scan    # empty
go list -deps ./internal/scan  | grep open-code-review/internal/agent   # empty
go list -deps ./internal/llmloop | grep -E 'internal/(agent|scan)'      # empty

Checklist

My code follows the project's coding style (go fmt, go vet)
I have performed a self-review of my code
I have added tests that prove my fix is effective or my feature works
New and existing unit tests pass locally with my changes
I have updated the documentation accordingly (if applicable)
I have signed the CLA

Related Issues

CLAassistant · 2026-06-10T08:03:27Z

All committers have signed the CLA.

Introduce a new top-level subcommand `ocr scan` (alias `s`) that reviews whole files instead of git diffs. Use cases include reviewing unfamiliar codebases, pre-migration audits, and ad-hoc per-directory reviews. Architecture splits scan and diff review at the package level so the two pipelines can evolve independently: - internal/scan/ new package: file enumeration via `git ls-files`, full-scan agent, FULL_SCAN_TASK rendering, preview - internal/llmloop/ new package: shared LLM tool-use loop, three-zone memory compression, CommentWorkerPool, AgentWarning. Both internal/agent and internal/scan delegate to llmloop.Runner; agent and scan never import each other - internal/agent/ slimmed: LLM loop / compression / token aggregation moved to llmloop; review-only orchestration remains - internal/model/ new ScanItem (full-file payload) + Preview / PreviewEntry / ExcludeReason shared by both modes - internal/diff/ new gitignore.go exporting helpers reused by scan - cmd/opencodereview/ new scan_cmd.go; shared.go consolidates startup (loadCommonContext / loadLLMRuntime), output (emitRunResult, ResultProvider) and stdout silencing (quietHandle); review_cmd.go follows the same shape Template additions: - FULL_SCAN_TASK: dedicated prompt with Tool-call discipline guidance to reduce gratuitous tool calls per file - FULL_SCAN_MAX_TOOL_REQUEST_TIMES (default 60): scan-only per-file budget, raised over diff's 30 to fit multi-finding files; --max-tools still composes (only raise, never lower) In scan mode, file_read_diff is filtered out of MainToolDefs since it has no useful semantics without a diff. Tests cover provider enumeration (with temp git repo), template rendering, filter passes, dependency budget, flag validation, and excludeToolDef.

github-actions

🔍 OpenCodeReview found 12 issue(s) in this PR.

✅ 12 posted as inline comment(s)
📝 0 posted as summary (missing line info)

github-actions · 2026-06-10T09:16:27Z

+func IsPathExcluded(repoDir, relPath string, patterns []string) bool {
+	stub := &Provider{repoDir: repoDir}
+	return stub.isPathExcluded(relPath, patterns)
+}


The repoDir parameter is accepted but never used — isPathExcluded only reads the package-level providerDirIgnoreDirs and the passed-in patterns; it does not access p.repoDir. This is misleading to callers (e.g., scan/provider.go passes p.repoDir expecting it to matter). Either remove the parameter or document why it's reserved for future use. If kept for forward-compatibility, consider adding a comment explaining it's currently unused.

github-actions · 2026-06-10T09:16:27Z

+		// parseReviewFlags already wraps with "parse flags: %w" — return as-is.
+		return err


Bug: Incorrect comment / inconsistent error wrapping. The comment claims parseReviewFlags already wraps all errors with "parse flags: %w", but that's only true for the a.Parse(args) error (flags.go:136). Validation errors returned later in parseReviewFlags (e.g., "only one review mode allowed", "--to is required when --from is specified", "invalid --audience value") are returned without the "parse flags:" prefix.

The old code wrapped all errors from parseReviewFlags uniformly with fmt.Errorf("parse flags: %w", err), providing consistent context. Removing this wrapper means those validation errors now lose the "parse flags:" prefix, creating inconsistent error messages.

Either restore the wrapper here (it's harmless to double-wrap the parse error), or add the prefix to every error return in parseReviewFlags.

github-actions · 2026-06-10T09:16:27Z

+
+	go func() {
+		defer cancel()
+		rebuilt, _ := r.runCompression(asyncCtx, msgSnapshot, filePath)


Bug: Silent data loss on async compression failure. The error from runCompression is discarded here (rebuilt, _). When runCompression fails, it returns msgs[:part.frozenEnd] — a truncated message list containing only the first 2 messages. This truncated result is then stored in job.rebuilt and later applied by tryApplyPendingCompression as if it were a valid compression result, silently discarding all active-zone messages.

Suggestion: Either skip setting job.rebuilt when the error is non-nil, or store the error and handle it in tryApplyPendingCompression.

Suggestion:

Suggested change

rebuilt, _ := r.runCompression(asyncCtx, msgSnapshot, filePath)

rebuilt, err := r.runCompression(asyncCtx, msgSnapshot, filePath)

if err != nil {

// Don't apply a failed/truncated compression result.

r.compressionMu.Lock()

if r.pendingJob == job {

r.pendingJob = nil

}

r.compressionMu.Unlock()

return

}

github-actions · 2026-06-10T09:16:27Z

+	if err != nil {
+		rec.SetError(err, duration)
+		fmt.Fprintf(stdout.Writer(), "[ocr] Memory compression failed: %v\n", err)
+		return msgs[:part.frozenEnd], fmt.Errorf("memory compression: %w", err)
+	}


Bug: Data loss on synchronous compression failure. When runCompression returns an error, it also returns msgs[:part.frozenEnd] (only the first 2 messages). The caller in addNextMessage discards the error (*messages, _ = r.runCompression(...)) and replaces the full message list with this truncated version. This means a transient LLM failure during compression permanently destroys the entire conversation context beyond the frozen zone, including the active zone that was intentionally preserved.

On error, the function should return the original msgs unchanged so the caller retains the full conversation history.

Suggestion:

Suggested change

if err != nil {

rec.SetError(err, duration)

fmt.Fprintf(stdout.Writer(), "[ocr] Memory compression failed: %v\n", err)

return msgs[:part.frozenEnd], fmt.Errorf("memory compression: %w", err)

}

if err != nil {

rec.SetError(err, duration)

fmt.Fprintf(stdout.Writer(), "[ocr] Memory compression failed: %v\n", err)

return msgs, fmt.Errorf("memory compression: %w", err)

}

github-actions · 2026-06-10T09:16:27Z

+	for i, m := range msgs {
+		sb.WriteString(fmt.Sprintf("<message id=\"%d\" role=\"%s\">\n", i, m.Role))
+		sb.WriteString("    <content>\n")
+		sb.WriteString(fmt.Sprintf("      %s\n", m.ExtractText()))


Bug: XML injection / malformed XML. Message content is interpolated directly into XML without escaping special characters (<, >, &, "). In a code review context, messages frequently contain code snippets with these characters, which will produce malformed XML and potentially confuse the compression LLM prompt.

Use encoding/xml.EscapeText or html.EscapeString to escape the content before embedding it.

Suggestion:

Suggested change

sb.WriteString(fmt.Sprintf(" %s\n", m.ExtractText()))

sb.WriteString(fmt.Sprintf(" %s\n", html.EscapeString(m.ExtractText())))

github-actions · 2026-06-10T09:16:27Z

+		if len(calls) == 0 {
+			fmt.Fprintf(stdout.Writer(), "[ocr] No tool calls parsed for %s, retrying...\n", newPath)
+			messages = append(messages, llm.NewTextMessage("user", "You did not successfully call any tools. Please try again or use task_done if finished."))
+			if content != "" {
+				messages = append(messages[:len(messages)-1], llm.NewTextMessage("assistant", content), messages[len(messages)-1])
+			}
+			continue
+		}


Missing consecutiveEmptyRounds tracking for no-tool-call retries: When the LLM returns no tool calls (len(calls) == 0), the code appends a retry message and continues, but consecutiveEmptyRounds is never incremented in this path. The counter only tracks rounds where tool calls were made but returned no valid data. If the model persistently outputs text without calling any tools, the loop will consume all toolReqCount iterations (each making an expensive LLM API call) before stopping.

Consider incrementing consecutiveEmptyRounds here as well, or adding a separate counter for no-tool-call retries, so the safety break applies uniformly.

Suggestion:

Suggested change

if len(calls) == 0 {

fmt.Fprintf(stdout.Writer(), "[ocr] No tool calls parsed for %s, retrying...\n", newPath)

messages = append(messages, llm.NewTextMessage("user", "You did not successfully call any tools. Please try again or use task_done if finished."))

if content != "" {

messages = append(messages[:len(messages)-1], llm.NewTextMessage("assistant", content), messages[len(messages)-1])

}

continue

}

if len(calls) == 0 {

consecutiveEmptyRounds++

if consecutiveEmptyRounds >= maxConsecutiveEmptyRounds {

fmt.Fprintf(stdout.Writer(), "[ocr] Too many rounds with no tool calls for %s, stopping.\n", newPath)

break

}

fmt.Fprintf(stdout.Writer(), "[ocr] No tool calls parsed for %s, retrying...\n", newPath)

messages = append(messages, llm.NewTextMessage("user", "You did not successfully call any tools. Please try again or use task_done if finished."))

if content != "" {

messages = append(messages[:len(messages)-1], llm.NewTextMessage("assistant", content), messages[len(messages)-1])

}

continue

}

github-actions · 2026-06-10T09:16:27Z

+	if err != nil {
+		return nil, fmt.Errorf("enumerate files: %w", err)
+	}
+	a.items = items


Potential issue: This mutates a.items with the unfiltered item list. If the Agent is ever reused after calling Preview() (e.g., calling FilesReviewed(), Diffs(), or even Run()), those methods would operate on stale/unfiltered data. For example, FilesReviewed() would return the total enumerated count rather than the reviewable count.

Consider either:

Not mutating a.items in Preview (use a local variable instead), since Preview is documented as a read-only operation.

Or documenting clearly that Preview must not be called before Run on the same Agent instance.

Suggestion:

Suggested change

a.items = items

// Use a local variable to avoid mutating agent state in a read-only operation.

// a.items remains nil/unchanged so FilesReviewed()/Diffs() stay consistent.

github-actions · 2026-06-10T09:16:27Z

+	if p.runner != nil {
+		out, err = p.runner.Run(ctx, p.repoDir, cmdArgs...)
+	} else {
+		cmd := exec.CommandContext(ctx, "git", cmdArgs...)
+		cmd.Dir = p.repoDir
+		raw, runErr := cmd.CombinedOutput()
+		out, err = string(raw), runErr
+	}


Bug: CombinedOutput corrupts NUL-delimited parsing.

When using the -z flag, git ls-files produces NUL-delimited output on stdout. However, CombinedOutput() merges stderr into stdout. If git writes any warnings or diagnostic messages to stderr (e.g., about renamed files, encoding issues, or repository warnings), those bytes will be interleaved with the NUL-delimited file list, producing corrupted filenames.

The gitcmd.Runner already provides an Output() method that captures stdout only. For the fallback path (when p.runner is nil), use cmd.Output() instead of cmd.CombinedOutput(). Consider also using p.runner.Output() for the runner path to keep stdout/stderr separate.

Suggestion:

Suggested change

if p.runner != nil {

out, err = p.runner.Run(ctx, p.repoDir, cmdArgs...)

} else {

cmd := exec.CommandContext(ctx, "git", cmdArgs...)

cmd.Dir = p.repoDir

raw, runErr := cmd.CombinedOutput()

out, err = string(raw), runErr

}

var raw []byte

if p.runner != nil {

raw, err = p.runner.Output(ctx, p.repoDir, cmdArgs...)

} else {

cmd := exec.CommandContext(ctx, "git", cmdArgs...)

cmd.Dir = p.repoDir

raw, err = cmd.Output()

}

if err != nil {

return nil, err

}

out := string(raw)

github-actions · 2026-06-10T09:16:28Z

+
+	seen := make(map[string]struct{}, len(tracked)+len(untracked))
+	all := make([]string, 0, len(tracked)+len(untracked))
+	for _, f := range append(tracked, untracked...) {


Potential slice mutation bug: append(tracked, untracked...) may corrupt tracked.

If the tracked slice returned by gitLs has excess capacity (which is possible since gitLs builds its result with make([]string, 0, len(raw)) where raw could be larger than the filtered result), then append(tracked, untracked...) will write untracked elements into tracked's underlying array beyond its length. While tracked isn't reused after this point in the current code, this is a fragile pattern that can silently cause bugs during future refactoring.

Use a separate iteration approach instead:

Suggestion:

Suggested change

for _, f := range append(tracked, untracked...) {

for _, f := range slices.Concat(tracked, untracked) {

github-actions · 2026-06-10T09:16:28Z

+func (a *Agent) lookupDiff(path string) *model.Diff {
+	for i := range a.items {
+		if a.items[i].Path == path {
+			return a.items[i].AsDiff()
+		}
+	}
+	return nil
+}


Performance: lookupDiff performs a linear scan over a.items for every comment that needs line-number resolution. Since this is called from the LLM loop's tool execution path (potentially multiple times per file, across many concurrent files), the O(n) lookup can become a bottleneck for large scans with hundreds of files.

Consider building a map[string]*model.Diff index once during initialization (e.g., in Run() after filtering) and using O(1) map lookups here instead.

Suggestion:

Suggested change

func (a *Agent) lookupDiff(path string) *model.Diff {

for i := range a.items {

if a.items[i].Path == path {

return a.items[i].AsDiff()

}

}

return nil

}

func (a *Agent) lookupDiff(path string) *model.Diff {

if d, ok := a.diffIndex[path]; ok {

return d

}

return nil

}

css521 force-pushed the main branch from 8934451 to 64323c6 Compare June 10, 2026 08:16

github-actions Bot reviewed Jun 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【WIP】feat: add ocr scan for full-file code review#93

【WIP】feat: add ocr scan for full-file code review#93
css521 wants to merge 1 commit into
alibaba:mainfrom
css521:main

css521 commented Jun 10, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Jun 10, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot Jun 10, 2026

Uh oh!

github-actions Bot Jun 10, 2026

Uh oh!

github-actions Bot Jun 10, 2026

Uh oh!

github-actions Bot Jun 10, 2026

Uh oh!

github-actions Bot Jun 10, 2026

Uh oh!

github-actions Bot Jun 10, 2026

Uh oh!

github-actions Bot Jun 10, 2026

Uh oh!

github-actions Bot Jun 10, 2026

Uh oh!

github-actions Bot Jun 10, 2026

Uh oh!

github-actions Bot Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// parseReviewFlags already wraps with "parse flags: %w" — return as-is.
		return err

-		rebuilt, _ := r.runCompression(asyncCtx, msgSnapshot, filePath)
+		rebuilt, err := r.runCompression(asyncCtx, msgSnapshot, filePath)
+		if err != nil {
+			// Don't apply a failed/truncated compression result.
+			r.compressionMu.Lock()
+			if r.pendingJob == job {
+				r.pendingJob = nil
+			}
+			r.compressionMu.Unlock()
+			return
+		}

	sb.WriteString(fmt.Sprintf(" %s\n", m.ExtractText()))
	sb.WriteString(fmt.Sprintf(" %s\n", html.EscapeString(m.ExtractText())))

	a.items = items
	// Use a local variable to avoid mutating agent state in a read-only operation.
	// a.items remains nil/unchanged so FilesReviewed()/Diffs() stay consistent.

	for _, f := range append(tracked, untracked...) {
	for _, f := range slices.Concat(tracked, untracked) {

Conversation

css521 commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

New / changed packages

Template additions

Other behavioral notes

Type of Change

How Has This Been Tested?

Checklist

Related Issues

Uh oh!

CLAassistant commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

css521 commented Jun 10, 2026 •

edited

Loading

CLAassistant commented Jun 10, 2026 •

edited

Loading