Skip to content

Commit c78c136

Browse files
DavidRajnohaclaude
andcommitted
fix(tests): prevent Chrome OOM in e2e incident lifecycle test
Split the alert polling into two phases to avoid Chrome OOM (exit code 137): Phase 1: Poll the Thanos Querier API via cy.exec/oc to check if the alert is actually firing. This is lightweight — a single shell command per iteration with zero Chrome DOM interaction or Cypress command log growth. Phase 2: Once the alert is confirmed firing, use the UI traversal to find the incident, but with doubled interval (2 min) and fewer max iterations. This reduces the number of heavy DOM traversals that accumulate Cypress command snapshots in Chrome memory. Also fix poll-ci-status.py to use --repo openshift/monitoring-plugin flag so it queries the upstream repo instead of the local fork. CI run: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_monitoring-plugin/860/pull-ci-openshift-monitoring-plugin-main-e2e-incidents/2036818563108442112 Classifications: INFRA_OOM (2 consecutive runs) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent d5cdd03 commit c78c136

2 files changed

Lines changed: 28 additions & 7 deletions

File tree

.claude/commands/cypress/scripts/poll-ci-status.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@
2727
def poll(pr, job_substring="e2e-incidents", max_attempts=30, interval=300):
2828
for attempt in range(max_attempts):
2929
result = subprocess.run(
30-
["gh", "pr", "checks", pr, "--json", "name,state,link"],
30+
["gh", "pr", "checks", pr, "--repo", "openshift/monitoring-plugin", "--json", "name,state,link"],
3131
capture_output=True,
3232
text=True,
3333
)

web/cypress/e2e/incidents/00.coo_incidents_e2e.cy.ts

Lines changed: 27 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -41,16 +41,37 @@ describe('BVT: Incidents - e2e', { tags: ['@smoke', '@slow', '@incidents', '@e2e
4141
cy.log('1.1 Navigate to Incidents page and clear filters');
4242
incidentsPage.goTo();
4343
incidentsPage.clearAllFilters();
44-
44+
4545
const intervalMs = 60_000;
46-
const maxMinutes = 30;
46+
const maxMinutes = 30;
47+
48+
cy.log('1.2 Wait for alert to start firing on cluster');
49+
// Phase 1: Poll the Thanos Querier API to check if the alert is actually
50+
// firing. This is lightweight — a single cy.exec per iteration with no
51+
// Chrome DOM interaction, preventing the OOM (exit code 137) caused by
52+
// repeated heavy UI traversals accumulating Cypress command log snapshots.
53+
const kubeconfigPath = Cypress.env('KUBECONFIG_PATH');
54+
cy.waitUntil(
55+
() => cy.exec(
56+
`oc get --raw '/api/v1/namespaces/openshift-monitoring/services/thanos-querier:web/proxy/api/v1/rules?type=alert' --kubeconfig ${kubeconfigPath}`,
57+
{ failOnNonZeroExit: false, timeout: 20000 },
58+
).then((result) => result.code === 0 && result.stdout.includes(currentAlertName)),
59+
{
60+
interval: 30_000,
61+
timeout: 15 * 60_000,
62+
errorMsg: `Alert ${currentAlertName} not firing on cluster within 15 minutes`,
63+
}
64+
);
4765

48-
cy.log('1.2 Wait for incident with custom alert to appear');
66+
cy.log('1.2.1 Wait for incident detection to pick up the firing alert');
67+
// Phase 2: Alert is confirmed firing. Wait for incident detection to group
68+
// it into an incident. Uses the UI traversal but with fewer iterations
69+
// since incident detection typically takes 5-10 minutes after alert fires.
4970
cy.waitUntil(
5071
() => incidentsPage.findIncidentWithAlert(currentAlertName),
51-
{
52-
interval: intervalMs,
53-
timeout: maxMinutes * intervalMs,
72+
{
73+
interval: 2 * intervalMs,
74+
timeout: 20 * intervalMs,
5475
}
5576
);
5677

0 commit comments

Comments
 (0)