Scope: 8 files, ~10,000 lines — tracing message → response for both local and cloud paths
Files: agenticChat.js, llmEngine.js, agenticChatHelpers.js, mcpToolParser.js, mcpToolServer.js, modelProfiles.js, modelDetection.js, constants.js
Focus: native function calling vs legacy text parsing, context budget calculations, progressive tool disclosure state machine, evaluateResponse/classifyResponseFailure cycles, grammar enabled/disabled transitions
The pipeline is well-architected with layered defenses (transactional rollback, progressive disclosure, context compaction, fabrication detection). However, the audit found 3 bugs, 18 risks, and 2 dead-code items across the 8 files.
The most critical finding is a rollback counter exhaustion bug that silently disables the transactional rollback system for the remainder of a session after the first retry budget is spent.
Classification: BUG — Silent logic failure
Severity: HIGH
Affects: Local agentic loop
The rollbackRetries counter resets to 0 only when evaluateResponse() returns COMMIT (line 1785). When the budget is exhausted (rollbackRetries >= maxRollbackRetries) and a ROLLBACK verdict is returned:
- The ROLLBACK branch doesn't execute (condition fails)
- The COMMIT reset doesn't execute (verdict is ROLLBACK, not COMMIT)
- The bad response falls through and gets appended to
fullResponseText rollbackRetriesremains atmaxRollbackRetriespermanently
Consequence: After the first exhaustion event, ALL subsequent ROLLBACK verdicts for every remaining iteration are silently accepted. The transactional safety system is permanently disabled for the session.
Fix: Reset rollbackRetries = 0 at the start of each iteration (after iteration++), or add an else branch after the ROLLBACK if-block:
} else if (responseVerdict.verdict === 'ROLLBACK') {
// Budget exhausted — accept response but reset for next iteration
console.log(`[AI Chat] ROLLBACK budget exhausted — accepting response`);
rollbackRetries = 0;
consecutiveEmptyGrammarRetries = 0;
}Classification: BUG — State mutation leak
Severity: MEDIUM
Affects: Local agentic loop
On the first rollback retry:
if (context?.params) context.params.temperature = Math.max((context.params.temperature || 0.7) - 0.2, 0.1);This mutates context.params.temperature in-place. Since context is the IPC event payload shared across the entire handler, the lowered temperature persists for all subsequent iterations, not just the retry. After several rollback events, temperature could be driven to 0.1 and stay there.
Fix: Store the original temperature and restore it after rollback retries:
const originalTemp = context?.params?.temperature;
// ... after COMMIT ...
if (originalTemp !== undefined && context?.params) context.params.temperature = originalTemp;Classification: RISK — Behavioral inconsistency
Severity: MEDIUM
Cloud cloudTaskType (line 350) and local detectTaskType() (line 880) have different patterns:
| Aspect | Cloud | Local |
|---|---|---|
Greeting includes help |
Yes | No |
| Short message threshold | < 20 chars |
< 15 chars |
| Casual question pattern | Broad (can|do|does|is|are|will|would|should|could + you|weather|time|...) |
Narrow (what|who|how + is|are|do|does + your|you + name|favorite|...) |
| Action word escalation | None (falls to general) |
Extensive (actionWords regex with 30+ terms) |
code task detection |
Via general fallback only |
Dedicated codeWords pattern |
The same message can be classified differently on cloud vs local, resulting in different tool sets being provided.
Fix: Extract a single detectTaskType(message) function and use it in both paths.
RISK-2: evaluateResponse vs classifyResponseFailure — inconsistent refusal thresholds (Lines 1727 + helpers:760)
Classification: RISK — Conflicting evaluation
Severity: MEDIUM
evaluateResponse()rolls back refusals up toiteration <= 5classifyResponseFailure()nudges refusals up toiteration < 10
At iterations 6–9: a refusal is COMMITTED (accepted into response text), then classifyResponseFailure detects it as a failure and tries to nudge. The committed refusal text remains in fullResponseText and chatHistory, potentially reinforcing the refusal pattern via attention.
Recommendation: Align thresholds — either both use 5 or both use 10.
Classification: RISK — Code smell (not a bug)
Severity: LOW
nonContextRetries is declared at line 1378 with let nonContextRetries = 0;. Line 1681 if (!nonContextRetries) nonContextRetries = 0; is redundant — it guards against a problem that doesn't exist. Misleading for future maintainers.
Classification: RISK — Over-allocation
Severity: LOW
buildStaticPrompt() allocates against maxPromptTokens. buildDynamicContext() allocates against Math.floor(maxPromptTokens * 0.4). These are added together in the user message, so the total prompt can be up to 1.4 × maxPromptTokens before tool feedback is added.
Context overflow is handled gracefully (rotation/compaction), so this doesn't crash. But it means early iterations routinely overshoot the budget, triggering unnecessary compaction.
Recommendation: Share a single token budget between static and dynamic builders.
Classification: RISK — Dead guard
Severity: LOW
WALL_CLOCK_DEADLINE = Date.now() + 30 * 60 * 1000 is defined but never checked anywhere in the cloud loop body. The 500-iteration limit is the only guard.
Fix: Add deadline check at the top of the cloud loop:
if (Date.now() > WALL_CLOCK_DEADLINE) { console.log('Cloud deadline hit'); break; }Classification: RISK — Transient memory spike
Severity: LOW
The 2MB cap on fullResponseText is applied at line ~2410, but tool feedback (browser snapshots, file contents) is appended earlier. Between append and cap-check, fullResponseText can transiently exceed 2MB.
- Transactional checkpoint/restore (lines 1510–1530) — clean save/restore of chatHistory + lastEvaluation
- Progressive tool disclosure via
getProgressiveTools()— reduces decision space without losing capability - Chat-type hard gate (line ~1968) — blocks hallucinated tool calls for casual messages
- Stuck/cycle detection (lines 2480–2530) — catches repeated tool calls and tool call sequences
- Fabrication detection and auto-correction (lines 2430–2500) — validates written file content against gathered data
- Auto-snapshot injection after browser actions — model always has fresh element refs
- Unified tool pipeline for both native and text-parsed paths — no divergent execution code
Classification: RISK — Stale state
Severity: MEDIUM
In generateWithFunctions(), the user message is pushed to chatHistory at line 1556 before generation. On AbortError, the function returns without popping the user message. This leaves a dangling user message in chatHistory with no corresponding model response.
The agenticChat.js checkpoint system mitigates this for ROLLBACK cases, but for user-initiated cancellations (abort without rollback), the orphaned message persists.
Fix: Pop the user message in the catch block:
if (error.name === 'AbortError') {
if (this.chatHistory[this.chatHistory.length - 1]?.type === 'user') {
this.chatHistory.pop();
}
return { text: fullResponse, response: fullResponse, functionCalls: collectedFunctionCalls, stopReason: 'abort' };
}Classification: RISK — OOM on low-RAM systems
Severity: LOW
Math.max(recommended, 8192) forces a minimum 8192-token context even when RAM calculations suggest less. On systems with <4GB free RAM and a large model loaded, this could push KV cache memory past available RAM.
Mitigated by: node-llama-cpp's failedCreationRemedy (6 retries with 16% auto-shrink).
Classification: RISK — Premature abort
Severity: LOW
Grammar-constrained generation (especially with large function schemas and small models on CPU) can be significantly slower than free-text generation. 120s may be tight on slow hardware. The timeout fires cancelGeneration('timeout') which returns partial results, so it degrades gracefully.
- KV cache reuse with cooldown — avoids infinite retry loops when cache causes failures
- Context shift strategy
eraseFirstResponseAndKeepFirstSystem— preserves system prompt during overflow - Think-token budget from ModelProfile — prevents runaway thinking on reasoning models
- _compactHistory() — bounds JS-side chatHistory array independently of context window
- Empty response retry with KV cache disabled — targeted recovery
Classification: DEAD — Exported but never imported
Severity: LOW
getModelTier(paramSize) at line 554 returns {tier, maxToolsPerPrompt, grammarAlwaysOn, retryBudget, pruneAggression} with hardcoded thresholds. agenticChat.js imports many helpers but NOT this function — it uses llmEngine.getModelTier() instead, which derives values from ModelProfile.
This creates a maintenance hazard: the hardcoded thresholds can silently diverge from the ModelProfile-driven values without anyone noticing.
Fix: Remove the export and the function, or make it call through to getModelProfile().
RISK-10: evaluateResponse + classifyResponseFailure both detect hallucination but at different stages
Classification: RISK — Redundant detection
Severity: LOW
evaluateResponse() checks detectActionHallucination() at iteration ≤3 → ROLLBACK. classifyResponseFailure() checks the same function with no iteration limit → nudge. For iterations 1–3, a hallucination is caught by evaluateResponse (ROLLBACK). At iteration 4+, evaluateResponse commits it, then classifyResponseFailure catches it. The committed hallucination text remains in context.
Classification: RISK — Wasteful double-compaction
Severity: LOW
At 80% context usage, phases 1, 2, AND 3 all execute in sequence. Phase 3 re-compresses items Phase 1 already compacted. The _pruned guard prevents data corruption, but Phase 1's work is immediately overridden.
Recommendation: Use if/else if instead of cascading if blocks.
- evaluateResponse() — clean deterministic verdict system: COMMIT/ROLLBACK/SKIP
- classifyResponseFailure() — 7-type failure taxonomy with specific recovery strategies
- getProgressiveTools() — effective transition-based state machine
- isNearDuplicate() — Jaccard similarity with configurable threshold
- EXPANDED_REFUSAL_PATTERNS — 20+ patterns covering explicit refusals, passive refusals, deflections
Classification: RISK — Parse failure edge case
Severity: LOW
if (c === '"' && (i === 0 || blockContent[i - 1] !== '\\')) inStr = !inStr;Fails on multi-level escaped backslashes (\\\\" — four backslashes + quote). In practice, LLM-generated tool call JSON rarely has this, so the risk is minimal.
Classification: RISK — Maintenance burden
Severity: LOW
Method 3a's KNOWN_TOOLS array is hardcoded separately from the VALID_TOOLS set used by normalizeToolCall(). New tools added to one but not the other won't be recognized by the corresponding path.
Fix: const KNOWN_TOOLS = [...VALID_TOOLS, ...Object.keys(TOOL_NAME_ALIASES)];
Classification: RISK — Unnecessary delay
Severity: LOW
When a batch contains both data-gathering AND write tools, ALL writes are deferred — even if the write uses pre-existing data, not the data being gathered. Tiny models are exempted, which is appropriate.
- Input cap at 200KB — O(n²) protection
- 5 progressive parse methods — fenced blocks → raw JSON → function syntax → plain JSON → XML tags → OpenAI wrapper
- 50+ tool name aliases — handles common model misspellings
- web_search → run_command remap — catches shell commands misrouted as web searches
- repairToolCalls() — recovers empty write_file content from code blocks
Classification: BUG — Security vulnerability
Severity: MEDIUM (mitigated by local-only context)
_useCustomTool() uses new Function('sandbox', 'with(sandbox) { ... }'). The with(sandbox) pattern only shadows names in the sandbox — it does NOT prevent:
this.constructor.constructor('return process')()→ Node.jsprocessaccess- String concatenation to bypass blocklist:
const p = 'pro' + 'cess' arguments.callee.constructorchain to Function constructor
Mitigated by: Local desktop app (user already has system access). Risk is prompt injection causing the model to create a malicious tool.
Fix: Use Node.js vm module with vm.createContext() for real sandboxing.
Classification: RISK — Unintended multi-replace
Severity: MEDIUM
After matching text (across 4 flexibility tiers):
content = content.replaceAll(oldText, newText);If oldText appears multiple times, ALL are replaced. The count is returned, but the model typically expects single replacement.
Fix: Use content.replace(oldText, newText) as default.
Classification: RISK — Data loss
Severity: LOW
_writeFile and _editFile create backups; _deleteFile does not. undo_edit cannot restore deleted files.
Classification: RISK — DNS rebinding
Severity: LOW
Checks hostname string before DNS resolution. A hostname resolving to 127.0.0.1 bypasses the check. Low severity for a local desktop app.
Classification: RISK — Command injection
Severity: LOW
Sanitizer strips common injection chars, but Windows findstr has different escaping rules. Primary path (RAG engine) avoids shell entirely.
Classification: RISK — Tool omission
Severity: LOW
codeTools set doesn't include web_search or coreTool. Code-type prompts don't mention web tools. "Write code to call this API" → model doesn't know about web_search/fetch_webpage.
Fix:
} else if (taskType === 'code') {
selectedNames = new Set([...coreTool, ...codeTools]);
}- Path sanitization — blocks traversal and hallucinated absolute paths
- Dangerous command blocklist — fork bombs, rm -rf /, dd, pipe-to-shell
- Result truncation — 50KB cap prevents context blowup
- Undo/backup with LRU eviction — bounded memory
- Permission gate for destructive operations
Classification: RISK — Suboptimal performance
Severity: LOW
Unrecognized filenames → BASE_DEFAULTS with no family overrides. Reasonable middle-ground values, but potentially wrong for specific architectures. Mitigated by tier system and user overrides.
- 15 family profiles × 5 size tiers
- deepMerge for 3-level inheritance
- Quirks system for behavioral flags
- Pattern matching for 15+ families
- Returns 0 for unrecognized →
getSizeTier(0)= 'tiny' → most conservative default
Classification: DEAD — Stale reference risk
Severity: LOW
Preamble mentions specific tool names. If tools are renamed, preamble becomes stale. Maintenance coupling only.
The grammar lifecycle has the same counter-reset gap as BUG-1:
consecutiveEmptyGrammarRetriesincrements on empty grammar responses- At ≥2, grammar is disabled (
nativeFunctions = null) - Counter resets only on COMMIT (line 1787)
If text-mode fallback also produces ROLLBACK → counter stays ≥2 → grammar stays permanently disabled for the session.
Fix: Same as BUG-1 — reset counters at the start of each iteration.
| ID | File | Class | Sev | Description |
|---|---|---|---|---|
| BUG-1 | agenticChat.js | BUG | HIGH | Rollback counter never resets after exhaustion |
| BUG-2 | agenticChat.js | BUG | MED | Temperature permanently lowered by rollback |
| BUG-3 | mcpToolServer.js | BUG | MED | Custom tool sandbox escapable |
| RISK-1 | agenticChat.js | RISK | MED | Duplicate task type detection diverges |
| RISK-2 | agenticChat.js | RISK | MED | Refusal threshold mismatch (≤5 vs <10) |
| RISK-7 | llmEngine.js | RISK | MED | chatHistory orphan on abort |
| RISK-15 | mcpToolServer.js | RISK | MED | editFile replaceAll instead of replace |
| RISK-3 | agenticChat.js | RISK | LOW | Redundant nonContextRetries guard |
| RISK-4 | agenticChat.js | RISK | LOW | Context budget overshooting |
| RISK-5 | agenticChat.js | RISK | LOW | Cloud deadline never enforced |
| RISK-6 | agenticChat.js | RISK | LOW | fullResponseText transient overshoot |
| RISK-8 | llmEngine.js | RISK | LOW | Context minimum 8192 on low-RAM |
| RISK-9 | llmEngine.js | RISK | LOW | 120s timeout tight for grammar gen |
| RISK-10 | agenticChatHelpers.js | RISK | LOW | Redundant hallucination detection |
| RISK-11 | agenticChatHelpers.js | RISK | LOW | Cascading compaction wastes work |
| RISK-12 | mcpToolParser.js | RISK | LOW | Naive brace-matching quote detection |
| RISK-13 | mcpToolParser.js | RISK | LOW | KNOWN_TOOLS duplicated from VALID_TOOLS |
| RISK-14 | mcpToolParser.js | RISK | LOW | Write deferral false positives |
| RISK-16 | mcpToolServer.js | RISK | LOW | deleteFile creates no backup |
| RISK-17 | mcpToolServer.js | RISK | LOW | SSRF hostname-only check |
| RISK-18 | mcpToolServer.js | RISK | LOW | Shell injection in grep fallback |
| RISK-19 | mcpToolServer.js | RISK | LOW | Code taskType excludes web_search |
| RISK-20 | modelProfiles.js | RISK | LOW | Unknown families get untuned defaults |
| DEAD-1 | agenticChatHelpers.js | DEAD | LOW | getModelTier() exported, never imported |
| DEAD-2 | constants.js | DEAD | LOW | Hardcoded tool names in preamble |
-
BUG-1 + Grammar gap: Reset
rollbackRetriesandconsecutiveEmptyGrammarRetriesat the start of each iteration. Single 2-line fix restores the transactional safety system. -
BUG-2: Save
context.params.temperaturebefore rollback retry and restore on COMMIT. -
RISK-1: Extract
detectTaskType()into a shared function used by both cloud and local paths. -
RISK-19: Add
coreToolto thecodetaskType ingetToolPromptForTask(). -
RISK-5: Add
if (Date.now() > WALL_CLOCK_DEADLINE) break;to cloud loop. -
RISK-16: Add backup before delete in
_deleteFile().