Skip to content

Fix COUNT with bracket notation in conditional expressions (Issue #59)#60

Open
McNultyyy wants to merge 11 commits into
mainfrom
fix/issue-59-count-bracket-notation
Open

Fix COUNT with bracket notation in conditional expressions (Issue #59)#60
McNultyyy wants to merge 11 commits into
mainfrom
fix/issue-59-count-bracket-notation

Conversation

@McNultyyy
Copy link
Copy Markdown
Collaborator

Summary

  • COUNT(expr) was stringifying the inner argument and handing it to JToken.SelectToken as a JSONPath. That failed with Newtonsoft.Json.JsonException: Unexpected character while parsing path for any non-trivial expression — including the bracket notation required when a field name is a Cosmos SQL reserved word (c.amount["value"]) combined with comparison or ternary operators (COUNT(c.amount["value"] > 0 ? 1 : undefined)).
  • COUNT(expr) now evaluates the inner expression per-item via EvaluateSqlExpression (mirroring how SUM/AVG already work) and counts rows whose result is defined and non-null — matching Cosmos DB semantics and preserving the existing CountField_InSelectValueObject_ExcludesNulls behaviour.
  • Adds UndefinedLiteralExpression so the SQL undefined keyword maps to the runtime undefined sentinel instead of the string "undefined" (previously indistinguishable from the literal 'undefined').

Closes #59. Version bumped to 4.0.17.

Test plan

  • New regression tests in Issue59_CountBracketNotationTests.cs covering SUM, COUNT with ternary, and combined COUNT+SUM on a reserved-word bracket-notation field
  • Full Unit suite: 7,701 / 7,701 pass on net8.0 and net10.0
  • Full Integration suite: 292 / 292 pass on net8.0
  • Pre-existing COUNT(c.score) null-exclusion behaviour preserved (CountField_InSelectValueObject_ExcludesNulls)

🤖 Generated with Claude Code

… - bump to v4.0.17

COUNT(expr) was stringifying the inner argument and passing it to
JToken.SelectToken as a JSONPath, which failed with JsonException for
any non-trivial expression — including the required double-quoted
bracket notation when a field name is a Cosmos SQL reserved word
(e.g. COUNT(c.amount["value"] > 0 ? 1 : undefined)).

Evaluate the inner expression per-item via EvaluateSqlExpression
(mirroring SUM/AVG) and count rows whose result is defined and
non-null. Add UndefinedLiteralExpression so the `undefined` keyword
maps to the runtime undefined sentinel and isn't conflated with the
string literal 'undefined'.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@McNultyyy McNultyyy requested a review from lemonlion as a code owner May 14, 2026 10:50
The SDK-rewritten {payload: ...} aggregate query path in ProjectAggregateFields
still used the buggy stringified-JSONPath approach for COUNT(expr), so the
Issue #59 fix only landed in EvaluateAggregateExpression. CI's aggregate
plan caching exposed it; local debug runs happened to skip this code path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@McNultyyy
Copy link
Copy Markdown
Collaborator Author

CI follow-up:

  • Unit (net8.0/net10.0) shard 2 — my Issue 59 tests failed against a second buggy COUNT path in InMemoryContainer.ProjectAggregateFields (the SDK-rewritten {payload: ...} aggregate-query path) that I missed. Local debug runs didn't hit it; CI's Release/plan-cache state did. Routed it through the same CountDefinedResults helper in commit 2a492dc. Full Unit suite (7,701 tests) still passes in Release on net8.0.
  • integration (net8.0/net10.0), test-inmemory — three pre-existing failures on main (Query_CountIf_SimpleCondition, Query_CountIf_WithAlias, Query_CountIf_NoMatchesReturnsZero). These fail with RewrittenAggregateProjections was not an array for a value aggregate query because DefaultQueryPlanStrategy.ContainsAggregate doesn't include COUNTIF in its list, so the SDK aggregate pipeline isn't bypassed for SELECT VALUE COUNTIF(...) queries. Same failures on the last main CI run (run 25670957622). Out of scope for this PR.
  • windows-parity — also pre-existing on main.

McNultyyy and others added 9 commits May 14, 2026 12:24
Single-aggregate non-VALUE queries (e.g. SELECT SUM(c.x) AS total FROM c)
through InMemoryCosmos.Builder() hit a pre-existing SDK aggregate-pipeline
shape mismatch on CI: the handler returns the result without the {payload:
{alias: {item: value}}} envelope the pipeline expects. The pre-existing
SqlFunctionTests.SumAggregate_ReturnsCorrectSum sidesteps it by going
through InMemoryContainer directly. Switching the single-aggregate Issue 59
tests to SELECT VALUE form triggers isValueAggregateBypass in
DefaultQueryPlanStrategy, so the SDK pipeline is suppressed and the
reporter-scenario coverage (Combined_CountAndSum, which uses the issue's
exact query shape) keeps its multi-aggregate bypass coverage too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The query-plan ContainsAggregate whitelist (used to decide whether to bypass
the SDK aggregate pipeline) was missing COUNTIF, so SELECT VALUE COUNTIF(...)
queries returned a raw scalar via the bypass-only handler path while the SDK
still tried to apply its aggregate pipeline — resulting in
"RewrittenAggregateProjections was not an array for a value aggregate query"
on CI Release runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… budget

Both Linux Docker and Windows emulator integration jobs were timing out on
the very first POST /dbs/parity-validation/colls with 503 / 1007 "high demand
in this region". The HTTP listener accepts requests before the partition
services are ready for writes, so the cold-start retry was burning through
the default ~75s EmulatorRetry budget on every test that needed a container.

Two changes, both on the emulator path only (in-memory tests unaffected):

  - EmulatorSession.InitializeAsync now does a probe create+delete on a
    throwaway container after CreateDatabaseIfNotExists, using the generous
    maxRetries=60 / maxBackoffSeconds=15 budget. This pre-warms the partition
    services once per run so subsequent per-test creates hit warm services.

  - EmulatorTestFixture.CreateContainerCoreAsync now uses the same retry
    budget as the session warmup (maxRetries=30 / maxBackoffSeconds=15) so
    transient 503s mid-run are absorbed instead of failing tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adding COUNTIF to ContainsAggregate so the SDK's aggregate pipeline ran for
SELECT COUNTIF(...) AS x FROM c surfaced the underlying problem: the SDK
pipeline doesn't recognise COUNTIF and treats the response as a malformed
aggregate envelope ("Underlying object does not have an 'payload' field").

Add a narrow COUNTIF-only bypass alongside the existing GROUP BY /
multi-aggregate / VALUE-aggregate bypasses, so any query mentioning COUNTIF
short-circuits the SDK accumulator and the handler returns the result
directly. Leaves COUNT/SUM/MIN/MAX/AVG single-non-VALUE queries on the
existing path so the QueryPlan_*Aggregate_Detects* tests keep their
contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ansient

The 503/1007 fix from 18babd2 surfaced a deeper layer: freshly-created
containers can return 404 / 1013 ("Collection is not yet available for
read") for tens of seconds while partition routing settles. Tests calling
ReadItemAsync immediately after CreateContainerAsync would burn ~8 minutes
of SDK-internal retries per test before failing.

  - EmulatorSession.InitializeAsync warmup probe now does Create + Upsert +
    Read on the throwaway container so the data plane is genuinely warm
    (not just the control plane) before tests start.

  - EmulatorTestFixture.CreateContainerCoreAsync now does the same probe
    inside the retry loop — each test class only gets back a container
    once Upsert succeeds, so test bodies don't hit 1013 on their first op.

  - EmulatorRetry.IsTransient treats 404/1013 specifically as transient.
    Plain 404/0 ("resource doesn't exist") is still non-transient, so this
    cannot mask genuine missing-resource bugs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous probe wrote {id, __pk} with the partition key header set to
probeId — but the container's partition key path is whatever the test
specified (e.g. /partitionKey). The PK extracted from the document was
undefined while the header said probeId, so the emulator returned
400 / 1001 "PartitionKey extracted from document doesn't match the one
specified in the header" and every per-test create failed fast.

ProbeDataPlaneAsync now reads props.PartitionKeyPaths and constructs a
JObject with each path's top-level property set, plus a PartitionKeyBuilder
matching. Skips the probe for nested (deeper than one slash) paths — those
are rare and the SDK's own retry will absorb the warmup window for them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI runs showed the Linux Docker emulator's read replica lags the write
replica by 6-7 minutes per freshly-created container. The per-test
ProbeDataPlaneAsync (which read back its own write to confirm read
availability) ate the full retry budget per test class — N test classes ×
~7 min = past the 45min CI job timeout, so no tests completed.

Net effect was worse than no probe at all: pre-existing Issue18 emulator
failures (13 tests on main) went away, but new failures in
FakeCosmosHandlerQueryAdvancedTests appeared because we never got to those
classes before timeout.

Reverting:
  - EmulatorTestFixture.CreateContainerCoreAsync goes back to plain
    CreateContainerIfNotExistsAsync with retry (no data-plane probe).
  - EmulatorSession session-level warmup probe keeps the CreateItem step
    (write replica is fast) but drops the ReadItemAsync (read replica is
    slow and won't be ready in time anyway).

This restores main's baseline emulator behaviour for the pre-existing
flaky tests, while keeping the COUNTIF whitelist fixes that this PR
genuinely improves.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous bump (30 × 15s) caused each pre-existing flaky-on-main test
class to wait ~7m before failing instead of ~1m. Across 13 known-flaky
tests, that pushed the emulator-linux job past its 45m timeout, masking
the difference between PR and main and burning runner minutes.

Default budget (10 × 10s ≈ 75s) makes the flaky tests fail fast, same as
main. Session-level warmup (60 × 15s ≈ 15m) is kept — it's a one-time
per-run cost and absorbs the initial 503 storm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…budget

Previous commit's session-level write probe created and deleted a
throwaway container before tests started. On the CI Linux Docker
emulator (PARTITION_COUNT=10) and especially the Windows emulator
(PartitionCount=3), this consumed a slot that didn't fully free on
delete in time, causing Handler_MultiContainer_CrudIsolated (which
needs 2 concurrent containers) to fail with 503 — a new failure not
in main's baseline.

Reverted to the original behaviour: only CreateDatabaseIfNotExistsAsync
in session init. Per-test fixture retries (default 75s budget) absorb
the initial 503 storm, same as main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: [\"field\"] bracket notation fails inside conditional COUNT expressions (Pattern 5)

1 participant