Skip to content

【WIP】feat: add ocr scan for full-file code review#93

Open
css521 wants to merge 1 commit into
alibaba:mainfrom
css521:main
Open

【WIP】feat: add ocr scan for full-file code review#93
css521 wants to merge 1 commit into
alibaba:mainfrom
css521:main

Conversation

@css521

@css521 css521 commented Jun 10, 2026

Copy link
Copy Markdown

Description

Introduce a new top-level subcommand ocr scan (alias s) that reviews whole files instead of git diffs. Use cases include reviewing unfamiliar codebases, pre-migration audits, and ad-hoc per-directory reviews.

The change also splits scan and diff review at the package level so the two pipelines can evolve independently. Shared LLM tool-use loop, memory compression, and per-call bookkeeping move into a new internal/llmloop package; both internal/agent (diff review) and the new internal/scan (full-file review) delegate to llmloop.Runner and never import each other.

New / changed packages

Package Role
internal/scan/ (new) File enumeration via git ls-files, full-scan agent, FULL_SCAN_TASK rendering, preview
internal/llmloop/ (new) Shared LLM tool-use loop, three-zone memory compression, CommentWorkerPool, AgentWarning
internal/agent/ Slimmed: LLM loop / compression / token aggregation moved to llmloop; review-only orchestration remains
internal/model/ New ScanItem (full-file payload) plus Preview / PreviewEntry / ExcludeReason shared by both modes
internal/diff/ New gitignore.go exposing gitignore helpers reused by scan
cmd/opencodereview/ New scan_cmd.go; shared.go consolidates startup (loadCommonContext / loadLLMRuntime), output (emitRunResult, ResultProvider), and stdout silencing (quietHandle); review_cmd.go follows the same shape

Template additions

  • FULL_SCAN_TASK: dedicated prompt with Tool-call discipline guidance (don't re-read the current file, ≤ 2–3 context calls per finding, batch code_comment, call task_done early) to reduce gratuitous tool calls.
  • FULL_SCAN_MAX_TOOL_REQUEST_TIMES (default 60): scan-only per-file budget, raised over diff's 30 to fit multi-finding files. --max-tools still composes (only raises, never lowers).

Other behavioral notes

  • In scan mode, file_read_diff is filtered out of MainToolDefs — its semantics don't apply without a diff, and exposing it just burns tool-call rounds.
  • ocr review behavior is unchanged.
  • internal/agent and internal/scan have zero mutual imports (go list -deps validated).

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Refactoring (no functional changes)
  • Documentation update
  • CI / Build / Tooling

How Has This Been Tested?

  • make test passes locally
  • Manual testing (described below)

Unit tests cover:

  • Provider enumeration against a temp git repo (internal/scan/provider_test.go): .gitignore honored, binary placeholder emitted, oversized files skipped, exact-file vs. directory-prefix path filtering.
  • Template rendering: all FULL_SCAN_TASK placeholders substituted, change_files literal injected, no {{...}} leakage (internal/scan/agent_test.go).
  • Token-budget filter and injectScanContentMap adapter (internal/scan/agent_test.go).
  • Flag parsing & validation (cmd/opencodereview/scan_cmd_test.go): --all / --path mutual presence, --audience / --max-tools / --max-git-procs validation, splitPaths, excludeToolDef (including absent name).
  • Template loading (internal/config/template/template_test.go): FULL_SCAN_MAX_TOOL_REQUEST_TIMES correctly deserialized; ApplyLanguage injects directives into FULL_SCAN_TASK.

Manual smoke tests:

# Preview matrix
ocr scan --all --preview
ocr scan --path internal/scan --preview
ocr scan --path internal/scan/agent.go --preview
ocr scan --all --path internal/diff --preview

# Error paths
ocr scan                          # → must specify --all or --path
ocr scan --audience robot         # → invalid --audience

# End-to-end against a small directory; verified:
#   - .opencodereview/sessions/.../*.jsonl  →  review_mode == "full_scan"
#   - prompt contains <current_file_content>; no {{file_content}} / {{diff}} leak
#   - generated comments have populated start_line / end_line

# Regression
ocr review --preview              # output identical to pre-refactor
ocr review --commit HEAD~1        # unchanged

Static dependency assertions:

go list -deps ./internal/agent | grep open-code-review/internal/scan    # empty
go list -deps ./internal/scan  | grep open-code-review/internal/agent   # empty
go list -deps ./internal/llmloop | grep -E 'internal/(agent|scan)'      # empty

Checklist

  • My code follows the project's coding style (go fmt, go vet)
  • I have performed a self-review of my code
  • I have added tests that prove my fix is effective or my feature works
  • New and existing unit tests pass locally with my changes
  • I have updated the documentation accordingly (if applicable)
  • I have signed the CLA

Related Issues

@CLAassistant

CLAassistant commented Jun 10, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

Introduce a new top-level subcommand `ocr scan` (alias `s`) that reviews
whole files instead of git diffs. Use cases include reviewing unfamiliar
codebases, pre-migration audits, and ad-hoc per-directory reviews.

Architecture splits scan and diff review at the package level so the two
pipelines can evolve independently:

- internal/scan/      new package: file enumeration via `git ls-files`,
                      full-scan agent, FULL_SCAN_TASK rendering, preview
- internal/llmloop/   new package: shared LLM tool-use loop, three-zone
                      memory compression, CommentWorkerPool, AgentWarning.
                      Both internal/agent and internal/scan delegate to
                      llmloop.Runner; agent and scan never import each other
- internal/agent/     slimmed: LLM loop / compression / token aggregation
                      moved to llmloop; review-only orchestration remains
- internal/model/     new ScanItem (full-file payload) + Preview /
                      PreviewEntry / ExcludeReason shared by both modes
- internal/diff/      new gitignore.go exporting helpers reused by scan
- cmd/opencodereview/ new scan_cmd.go; shared.go consolidates startup
                      (loadCommonContext / loadLLMRuntime), output
                      (emitRunResult, ResultProvider) and stdout silencing
                      (quietHandle); review_cmd.go follows the same shape

Template additions:
- FULL_SCAN_TASK: dedicated prompt with Tool-call discipline guidance to
  reduce gratuitous tool calls per file
- FULL_SCAN_MAX_TOOL_REQUEST_TIMES (default 60): scan-only per-file budget,
  raised over diff's 30 to fit multi-finding files; --max-tools still
  composes (only raise, never lower)

In scan mode, file_read_diff is filtered out of MainToolDefs since it has
no useful semantics without a diff.

Tests cover provider enumeration (with temp git repo), template rendering,
filter passes, dependency budget, flag validation, and excludeToolDef.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 OpenCodeReview found 12 issue(s) in this PR.

  • ✅ 12 posted as inline comment(s)
  • 📝 0 posted as summary (missing line info)

Comment on lines +22 to +25
func IsPathExcluded(repoDir, relPath string, patterns []string) bool {
stub := &Provider{repoDir: repoDir}
return stub.isPathExcluded(relPath, patterns)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repoDir parameter is accepted but never used — isPathExcluded only reads the package-level providerDirIgnoreDirs and the passed-in patterns; it does not access p.repoDir. This is misleading to callers (e.g., scan/provider.go passes p.repoDir expecting it to matter). Either remove the parameter or document why it's reserved for future use. If kept for forward-compatibility, consider adding a comment explaining it's currently unused.

Comment on lines +18 to +19
// parseReviewFlags already wraps with "parse flags: %w" — return as-is.
return err

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Incorrect comment / inconsistent error wrapping. The comment claims parseReviewFlags already wraps all errors with "parse flags: %w", but that's only true for the a.Parse(args) error (flags.go:136). Validation errors returned later in parseReviewFlags (e.g., "only one review mode allowed", "--to is required when --from is specified", "invalid --audience value") are returned without the "parse flags:" prefix.

The old code wrapped all errors from parseReviewFlags uniformly with fmt.Errorf("parse flags: %w", err), providing consistent context. Removing this wrapper means those validation errors now lose the "parse flags:" prefix, creating inconsistent error messages.

Either restore the wrapper here (it's harmless to double-wrap the parse error), or add the prefix to every error return in parseReviewFlags.


go func() {
defer cancel()
rebuilt, _ := r.runCompression(asyncCtx, msgSnapshot, filePath)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Silent data loss on async compression failure. The error from runCompression is discarded here (rebuilt, _). When runCompression fails, it returns msgs[:part.frozenEnd] — a truncated message list containing only the first 2 messages. This truncated result is then stored in job.rebuilt and later applied by tryApplyPendingCompression as if it were a valid compression result, silently discarding all active-zone messages.

Suggestion: Either skip setting job.rebuilt when the error is non-nil, or store the error and handle it in tryApplyPendingCompression.

Suggestion:

Suggested change
rebuilt, _ := r.runCompression(asyncCtx, msgSnapshot, filePath)
rebuilt, err := r.runCompression(asyncCtx, msgSnapshot, filePath)
if err != nil {
// Don't apply a failed/truncated compression result.
r.compressionMu.Lock()
if r.pendingJob == job {
r.pendingJob = nil
}
r.compressionMu.Unlock()
return
}

Comment on lines +217 to +221
if err != nil {
rec.SetError(err, duration)
fmt.Fprintf(stdout.Writer(), "[ocr] Memory compression failed: %v\n", err)
return msgs[:part.frozenEnd], fmt.Errorf("memory compression: %w", err)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Data loss on synchronous compression failure. When runCompression returns an error, it also returns msgs[:part.frozenEnd] (only the first 2 messages). The caller in addNextMessage discards the error (*messages, _ = r.runCompression(...)) and replaces the full message list with this truncated version. This means a transient LLM failure during compression permanently destroys the entire conversation context beyond the frozen zone, including the active zone that was intentionally preserved.

On error, the function should return the original msgs unchanged so the caller retains the full conversation history.

Suggestion:

Suggested change
if err != nil {
rec.SetError(err, duration)
fmt.Fprintf(stdout.Writer(), "[ocr] Memory compression failed: %v\n", err)
return msgs[:part.frozenEnd], fmt.Errorf("memory compression: %w", err)
}
if err != nil {
rec.SetError(err, duration)
fmt.Fprintf(stdout.Writer(), "[ocr] Memory compression failed: %v\n", err)
return msgs, fmt.Errorf("memory compression: %w", err)
}

for i, m := range msgs {
sb.WriteString(fmt.Sprintf("<message id=\"%d\" role=\"%s\">\n", i, m.Role))
sb.WriteString(" <content>\n")
sb.WriteString(fmt.Sprintf(" %s\n", m.ExtractText()))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: XML injection / malformed XML. Message content is interpolated directly into XML without escaping special characters (<, >, &, "). In a code review context, messages frequently contain code snippets with these characters, which will produce malformed XML and potentially confuse the compression LLM prompt.

Use encoding/xml.EscapeText or html.EscapeString to escape the content before embedding it.

Suggestion:

Suggested change
sb.WriteString(fmt.Sprintf(" %s\n", m.ExtractText()))
sb.WriteString(fmt.Sprintf(" %s\n", html.EscapeString(m.ExtractText())))

Comment thread internal/llmloop/loop.go
Comment on lines +170 to +177
if len(calls) == 0 {
fmt.Fprintf(stdout.Writer(), "[ocr] No tool calls parsed for %s, retrying...\n", newPath)
messages = append(messages, llm.NewTextMessage("user", "You did not successfully call any tools. Please try again or use task_done if finished."))
if content != "" {
messages = append(messages[:len(messages)-1], llm.NewTextMessage("assistant", content), messages[len(messages)-1])
}
continue
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing consecutiveEmptyRounds tracking for no-tool-call retries: When the LLM returns no tool calls (len(calls) == 0), the code appends a retry message and continues, but consecutiveEmptyRounds is never incremented in this path. The counter only tracks rounds where tool calls were made but returned no valid data. If the model persistently outputs text without calling any tools, the loop will consume all toolReqCount iterations (each making an expensive LLM API call) before stopping.

Consider incrementing consecutiveEmptyRounds here as well, or adding a separate counter for no-tool-call retries, so the safety break applies uniformly.

Suggestion:

Suggested change
if len(calls) == 0 {
fmt.Fprintf(stdout.Writer(), "[ocr] No tool calls parsed for %s, retrying...\n", newPath)
messages = append(messages, llm.NewTextMessage("user", "You did not successfully call any tools. Please try again or use task_done if finished."))
if content != "" {
messages = append(messages[:len(messages)-1], llm.NewTextMessage("assistant", content), messages[len(messages)-1])
}
continue
}
if len(calls) == 0 {
consecutiveEmptyRounds++
if consecutiveEmptyRounds >= maxConsecutiveEmptyRounds {
fmt.Fprintf(stdout.Writer(), "[ocr] Too many rounds with no tool calls for %s, stopping.\n", newPath)
break
}
fmt.Fprintf(stdout.Writer(), "[ocr] No tool calls parsed for %s, retrying...\n", newPath)
messages = append(messages, llm.NewTextMessage("user", "You did not successfully call any tools. Please try again or use task_done if finished."))
if content != "" {
messages = append(messages[:len(messages)-1], llm.NewTextMessage("assistant", content), messages[len(messages)-1])
}
continue
}

Comment thread internal/scan/preview.go
if err != nil {
return nil, fmt.Errorf("enumerate files: %w", err)
}
a.items = items

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential issue: This mutates a.items with the unfiltered item list. If the Agent is ever reused after calling Preview() (e.g., calling FilesReviewed(), Diffs(), or even Run()), those methods would operate on stale/unfiltered data. For example, FilesReviewed() would return the total enumerated count rather than the reviewable count.

Consider either:

  1. Not mutating a.items in Preview (use a local variable instead), since Preview is documented as a read-only operation.
  2. Or documenting clearly that Preview must not be called before Run on the same Agent instance.

Suggestion:

Suggested change
a.items = items
// Use a local variable to avoid mutating agent state in a read-only operation.
// a.items remains nil/unchanged so FilesReviewed()/Diffs() stay consistent.

Comment thread internal/scan/provider.go
Comment on lines +162 to +169
if p.runner != nil {
out, err = p.runner.Run(ctx, p.repoDir, cmdArgs...)
} else {
cmd := exec.CommandContext(ctx, "git", cmdArgs...)
cmd.Dir = p.repoDir
raw, runErr := cmd.CombinedOutput()
out, err = string(raw), runErr
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: CombinedOutput corrupts NUL-delimited parsing.

When using the -z flag, git ls-files produces NUL-delimited output on stdout. However, CombinedOutput() merges stderr into stdout. If git writes any warnings or diagnostic messages to stderr (e.g., about renamed files, encoding issues, or repository warnings), those bytes will be interleaved with the NUL-delimited file list, producing corrupted filenames.

The gitcmd.Runner already provides an Output() method that captures stdout only. For the fallback path (when p.runner is nil), use cmd.Output() instead of cmd.CombinedOutput(). Consider also using p.runner.Output() for the runner path to keep stdout/stderr separate.

Suggestion:

Suggested change
if p.runner != nil {
out, err = p.runner.Run(ctx, p.repoDir, cmdArgs...)
} else {
cmd := exec.CommandContext(ctx, "git", cmdArgs...)
cmd.Dir = p.repoDir
raw, runErr := cmd.CombinedOutput()
out, err = string(raw), runErr
}
var raw []byte
if p.runner != nil {
raw, err = p.runner.Output(ctx, p.repoDir, cmdArgs...)
} else {
cmd := exec.CommandContext(ctx, "git", cmdArgs...)
cmd.Dir = p.repoDir
raw, err = cmd.Output()
}
if err != nil {
return nil, err
}
out := string(raw)

Comment thread internal/scan/provider.go

seen := make(map[string]struct{}, len(tracked)+len(untracked))
all := make([]string, 0, len(tracked)+len(untracked))
for _, f := range append(tracked, untracked...) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential slice mutation bug: append(tracked, untracked...) may corrupt tracked.

If the tracked slice returned by gitLs has excess capacity (which is possible since gitLs builds its result with make([]string, 0, len(raw)) where raw could be larger than the filtered result), then append(tracked, untracked...) will write untracked elements into tracked's underlying array beyond its length. While tracked isn't reused after this point in the current code, this is a fragile pattern that can silently cause bugs during future refactoring.

Use a separate iteration approach instead:

Suggestion:

Suggested change
for _, f := range append(tracked, untracked...) {
for _, f := range slices.Concat(tracked, untracked) {

Comment thread internal/scan/agent.go
Comment on lines +184 to +191
func (a *Agent) lookupDiff(path string) *model.Diff {
for i := range a.items {
if a.items[i].Path == path {
return a.items[i].AsDiff()
}
}
return nil
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance: lookupDiff performs a linear scan over a.items for every comment that needs line-number resolution. Since this is called from the LLM loop's tool execution path (potentially multiple times per file, across many concurrent files), the O(n) lookup can become a bottleneck for large scans with hundreds of files.

Consider building a map[string]*model.Diff index once during initialization (e.g., in Run() after filtering) and using O(1) map lookups here instead.

Suggestion:

Suggested change
func (a *Agent) lookupDiff(path string) *model.Diff {
for i := range a.items {
if a.items[i].Path == path {
return a.items[i].AsDiff()
}
}
return nil
}
func (a *Agent) lookupDiff(path string) *model.Diff {
if d, ok := a.diffIndex[path]; ok {
return d
}
return nil
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants