Claude merge 3#2706
Open
dimoffon wants to merge 4577 commits into
Open
Conversation
…tartup_dummy) EXPLAIN ANALYZE of a CREATE TABLE AS / SELECT INTO that uses EXECUTE of a prepared statement crashed the coordinator: #6 intorel_startup_dummy (rel == NULL) #7 standard_ExecutorRun #8 ExplainOnePlan #9 ExplainExecuteQuery #10 ExplainOneUtility GPDB creates the CTAS target relation in intorel_initplan() (called from InitPlan, execMain.c) gated on PlannedStmt->intoClause, and leaves the DestReceiver's rStartup a near-dummy that dereferences the relation created there. The freshly-planned EXPLAIN path (ExplainOneQuery) copies the IntoClause onto the plan before running it, but ExplainExecuteQuery passed the cached plan to ExplainOnePlan without setting PlannedStmt->intoClause. intorel_initplan was therefore skipped, the DR_intorel's rel stayed NULL, and intorel_startup_dummy dereferenced it during ExecutorRun -> coordinator SIGSEGV (which drops every session in the run). Fix: in ExplainExecuteQuery set the IntoClause on the plan when into != NULL, mirroring ExplainOneQuery. The PlannedStmt belongs to the shared cached plan, so set it on a copy -- otherwise a stale IntoClause would make a later plain EXECUTE of the same statement try to create a table. Verified: pre-fix, EXPLAIN ANALYZE CREATE TABLE t AS EXECUTE p dumps a core with the backtrace above; post-fix it no longer crashes. (CREATE TABLE AS ... EXECUTE then hits a separate, pre-existing OID-dispatch error -- "oids were assigned, but not dispatched to QEs" -- that also affects plain CREATE TABLE AS EXECUTE and is out of scope for this crash fix.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…("oids were assigned, but not dispatched to QEs")
All forms of CREATE TABLE AS / SELECT INTO ... EXECUTE (plain, temp, and under
EXPLAIN ANALYZE) failed on the coordinator with:
WARNING: OID assignment not dispatched: catalog 1259 ... name "x"
ERROR: oids were assigned, but not dispatched to QEs
GPDB plans a CTAS specially: the IntoClause must reach the planner so the query
is marked PARENTSTMTTYPE_CTAS and a *distributed* plan is built -- rows are
redistributed to the target table's segments, which create and populate the
relation (consuming the OIDs the QD pre-assigned). This is threaded through the
plancache via an extra IntoClause argument to GetCachedPlan() ->
RevalidateCachedQuery()/choose_custom_plan()/BuildCachedPlan() (MPP-8135).
The PG14 plancache merge adopted upstream's GetCachedPlan() signature and turned
the GPDB intoClause argument into a hardcoded local `IntoClause *intoClause =
NULL;`, even though the function's own header comment still documents the extra
parameter and the rest of plancache.c still threads it. With it forced NULL the
EXECUTE plan was the plain gather-to-coordinator SELECT: the QD created the
table and assigned OIDs, the segments never did, and the un-dispatched OIDs
tripped the end-of-xact check in AtEOXact_DispatchOids().
Fix: restore the IntoClause parameter to GetCachedPlan() (header + definition,
dropping the dead NULL local) and pass it from the two CTAS callers,
ExecuteQuery() and ExplainExecuteQuery(); the remaining non-CTAS callers (spi.c,
postgres.c) pass NULL as before.
Verified: plain, parameterized, WITH NO DATA, and EXPLAIN ANALYZE CREATE TABLE
AS EXECUTE all succeed with the rows correctly distributed across segments
(nsegs=3); a plain EXECUTE returning rows is unaffected and the cached plan is
not corrupted. Together with the prior intorel_startup_dummy fix, CTAS-EXECUTE
works end to end.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
REINDEX CONCURRENTLY corrupts the catalog on a distributed cluster. Upstream's
concurrent reindex builds a fresh index (a new "_ccnew" pg_class entry, new OID),
swaps it in under the original name, and drops the old one -- i.e. it
intentionally *changes the index OID*. In GPDB only the coordinator runs the
concurrent path; the segments reindex non-concurrently and keep the original OID.
The result is an index whose OID differs between the QD and the segments
("could not open relation with OID N (segX)" on the next index scan), plus the
transient _ccnew OIDs are never dispatched ("oids were assigned, but not
dispatched to QEs").
Properly supporting concurrent reindex in MPP requires the segments to rebuild
under the QD's new OID, which collides with PreventInTransactionBlock() and the
fact that a QE cannot run a multi-transaction concurrent reindex from within a
dispatched statement -- a larger feature effort (tracked separately).
As a safe interim, fall back to a normal (non-concurrent) reindex on the
coordinator, with a NOTICE, in ReindexIndex()/ReindexTable() (covering REINDEX
TABLE/INDEX, including partitioned, which then reindexes each child
non-concurrently via ReindexPartitions) and ReindexMultipleInternal() (covering
REINDEX DATABASE/SCHEMA). Single-node utility mode is unaffected and keeps the
working concurrent path.
Verified: REINDEX (CONCURRENTLY) TABLE/INDEX and a partitioned table all succeed
with a NOTICE; the index OID is identical on the coordinator and every segment
(no divergence) and a forced index scan returns correct results; no cores, no
OID-dispatch error.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ordered child columns
A distribution-key UPDATE on a partitioned table whose leaf partitions have a
different physical column order than the parent crashed in the executor:
pg_detoast_datum <- hashtext/hash_numeric <- cdbhash <- ExecSplitUpdate
create_splitupdate_plan() derived the SplitUpdate's hash attnos and column types
from path->resultRelation's relcache entry. For a partitioned UPDATE that is a
leaf partition, whose physical column order can differ from the parent's
(e.g. update.sql's part_c_1_100 "e,d,c,b,a"). But the SplitUpdate's input tuples
are the subplan output, which is labeled with root->processed_tlist -- i.e. the
*nominal* (parent) column layout. So the distribution-key attnos taken from the
leaf policy indexed the wrong column of the parent-layout tuple: a key like "a"
at leaf attno 5 selected the parent's attno-5 column ("d", an int), which cdbhash
then fed to hashtext, dereferencing the small int as a varlena -> SIGSEGV. (The
mismatch also tripped the type Assert in the insertColIdx loop, which is compiled
out in non-cassert builds.)
Fix: take resultDesc and cdbpolicy from the nominal target relation
(root->parse->resultRelation), which is the layout the subplan tuples are in, so
insertColIdx, hashAttnos and hashFuncs all line up with the SplitUpdate input.
For a non-partitioned UPDATE this is the same relation as before.
Verified with update.sql's reordered-column partitions and assorted
text/numeric/multi-column distribution-key updates (including cross-partition row
movement): the hash now reads the correct key column and there is no crash; the
distribution key routes correctly and results are unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ld columns
Follow-on to the SplitUpdate cdbhash fix. A distribution-key UPDATE of a
partitioned table whose leaf partitions have a different physical column order
than the parent failed during the reinsert with:
ERROR: table row type and query-specified row type do not match
(ExecCheckPlanOutput, nodeModifyTable.c)
The SplitUpdate emits the new tuple in the root (nominal) target relation's
column layout (the subplan is labeled with root->processed_tlist), but the
DML_INSERT replay built the insert projection against the *source leaf*
partition's ResultRelInfo, whose descriptor can differ -- so ExecCheckPlanOutput
rejected the layout. (Before the cdbhash fix this path crashed earlier and was
never reached.)
Fix: build the split-update insert projection against the root result relation
(mtstate->rootResultRelInfo), so it matches the subplan output, and set up
partition tuple routing for split updates -- so ExecInsert() routes the new
tuple to the correct leaf, converting the layout via ri_RootToPartitionMap, and
enforces the partition constraint. For a non-partitioned target
rootResultRelInfo is the result relation itself, so this is a no-op there.
Verified: every split-update shape (heap/AO/AOCS, toasted, dropped-column,
multi-column and text/numeric distribution keys, UPDATE ... FROM, and
partitioned) inserts/redistributes correctly with no crash; the update
regression test no longer errors with a row-type mismatch (a partition-key
update that must stay within its subtree now correctly raises a partition
constraint violation -- naming the UPDATE target relation).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ResetTempNamespace() (reached from AbortTransaction -> ResetAllGangs during
primary gang-loss recovery) unconditionally cancels the temp-namespace
before_shmem_exit callback:
cancel_before_shmem_exit(RemoveTempRelationsCallback, 0);
Before PG14, cancel_before_shmem_exit() silently did nothing when the callback
was not the latest entry. Upstream commit c9ae5cb made it raise an error in
that case, and the PG14 merge adopted the strict version. But ResetTempNamespace
can legitimately reach this with the callback absent (temp namespace created but
not yet committed, so AtEOXact_Namespace never registered it) or no longer the
latest entry. Thrown from inside AbortTransaction(), that error escalates to a
coordinator PANIC -- turning any recoverable gang loss into an all-sessions-down
crash, which then cascades into hundreds of "protocol synchronization was lost"
and "transaction is aborted" failures across the suite.
Add a non-throwing cancel_before_shmem_exit_if_latest() (the pre-PG14 lenient
behavior: remove only if it is the latest entry, else return false) and use it
in ResetTempNamespace(). Leaving the callback registered is harmless --
RemoveTempRelationsCallback() no-ops once myTempNamespace is reset just below.
The strict cancel_before_shmem_exit() is kept for its correct-LIFO callers
(PG_ENSURE_ERROR_CLEANUP).
Verified: a full installcheck-good run that previously dumped a
cancel_before_shmem_exit coordinator-PANIC core now completes the same span with
zero cores.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…limit" (PG14)
A trivial COPY ... FROM stdin (or any COPY FROM) failed on every segment with
"invalid message length" -> "terminating connection because protocol
synchronization was lost", which the QD reported as "MPP detected N segment
failures, system is reconnected". Because COPY loads the data in most
regression tests, this poisoned a large swath of the suite (231
protocol-synchronization-lost FATALs in a full run). The same bug independently
broke nextval/sequences over MPP ("nextval: unable to parse nextval response
from QD", ~291 hits).
Root cause: GPDB multiplexes non-query messages over a libpq connection and
reads them with pq_getmessage(buf, 0), using a maxlen of 0 to mean "no upper
limit" -- COPY data forwarded QD->QE (copy.c) and the nextval-over-NOTIFY
response (sequence.c). GPDB's pq_getmessage encoded that as
if (len < 4 || (maxlen > 0 && len > maxlen))
The PG14 merge adopted upstream's length check verbatim
if (len < 4 || len > maxlen)
dropping the "maxlen > 0 &&" guard. With maxlen == 0 every message with len > 0
is rejected as "invalid message length", so the QE tears down the connection on
the first CopyData byte.
Diagnosis: wire-probes confirmed the QD frames CopyData correctly (type 'd',
len = 4 + nbytes, outBuffer_shared = 0) and the QE reads the correct len but
with maxlen = 0 -- so the comparison, not the data, was wrong.
Fix: restore the GPDB "maxlen > 0 &&" guard and document the maxlen == 0
convention so a future merge does not drop it again.
Verified on a live cluster: COPY FROM stdin, COPY FROM file, nextval(), and
SERIAL inserts (segment nextval) all succeed with correct row counts; no
"invalid message length" / protocol-synchronization-lost.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
COPY FROM into any partitioned (or sub-partitioned) table crashed every segment
with a NULL-pointer SIGSEGV in ExecInitPartitionInfo (reached via
CopyFrom -> ExecFindPartition), reported on the QD as "MPP detected N segment
failures, system is reconnected". This surfaced broadly once COPY data delivery
was fixed (see the pq_getmessage maxlen fix), because pg_regress loads much of
its data via COPY into partitioned tables.
Root cause: GPDB's COPY implementation lives in commands/copy.c (the upstream
commands/copyfrom.c is a stub here). copy.c builds its ResultRelInfo with
InitResultRelInfo() directly -- it never calls ExecInitResultRelation(), so
estate->es_result_relations stays NULL -- yet it initialized the
ModifyTableState with
mtstate->resultRelInfo = estate->es_result_relations; /* NULL */
and never set mt_nrels or rootResultRelInfo. The PG14 tuple-routing rework made
ExecInitPartitionInfo() dereference mtstate->resultRelInfo[0]
(.ri_RangeTableIndex / .ri_RelationDesc) and mtstate->rootResultRelInfo, so the
NULL resultRelInfo crashed at the first partition routed to. (copyfrom.c's
CopyFrom sets these three fields correctly; copy.c was not updated in the merge.)
Fix: set mt_nrels = 1 and point both resultRelInfo and rootResultRelInfo at the
COPY target's ResultRelInfo (a valid one-element array), mirroring copyfrom.c.
Verified: COPY FROM stdin into range-partitioned and range-sub-partitioned
tables inserts the correct row counts with no crash.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… "see"'
~46 internal SQL functions (col_description, obj_description, shobj_description,
and many overloaded date/time and system helpers) carry the placeholder body
prosrc => 'see system_functions.sql' in pg_proc.dat; their real bodies live in
src/backend/catalog/system_functions.sql, which initdb is supposed to run after
bootstrap. The PG merge dropped system_functions.sql from BOTH initdb.c and the
catalog Makefile's install list, so:
- the file was never installed to $(datadir), and
- initdb never executed it.
The placeholder bodies therefore survived, and every call to one of these
functions tried to execute the literal text "see system_functions.sql", failing
with: ERROR: syntax error at or near "see". Because col_description() is
invoked by psql's \d+, this poisoned a huge fraction of the regression suite
(439 "see" syntax errors plus the bulk of the downstream "current transaction is
aborted" / "relation does not exist" cascade).
Fix:
- initdb.c: declare system_functions_file, set_input/check_input it, add a
generic setup_run_file() helper, and run system_functions.sql right after
setup_auth() and before setup_depend() (so the functions are pinned),
matching upstream ordering.
- catalog/Makefile: install (and uninstall) system_functions.sql alongside
system_views.sql.
Verified: a fresh initdb succeeds; col_description()/obj_description() get real
bodies (PG14 stores them in prosqlbody) and psql \d+ shows column descriptions
with no syntax error.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CREATE VIEW (and any utility statement that dispatches a Query tree, e.g. CREATE TABLE AS) over a join failed on every segment with: ERROR: could not deserialize unrecognized node type: 3 (readfast.c) PG14 (commit 055fee7) added the JoinExpr.join_using_alias field. The merge added it to the text writer (_outJoinExpr in outfuncs.c) and the reader (_readJoinExpr in readfuncs.c, used for binary too), but NOT to the binary writer _outJoinExpr in outfast.c, which is the one used for QD->QE dispatch. So the wire layout written for a JoinExpr was one NODE field short of what the reader expected; the reader went off the rails and hit a garbage tag (3 = T_ProjectionInfo, a run-time node that is never serialized) -> deserialize error on the segment. Fix: write join_using_alias between usingClause and quals in outfast.c's _outJoinExpr, matching the reader and the text writer. Class of bug: a PG14-added node field re-grafted into the shared text/read funcs but missed in the separate binary writer (outfast.c). Found by probing readNodeBinary() to dump the QE-side tag stream, then diffing each outfast.c binary writer against its reader. Verified: CREATE VIEW over CROSS/INNER/NATURAL joins and selecting from the views all succeed, no segment crash. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…G14) PG14 added Result Cache (later renamed Memoize), gated by enable_resultcache (default on). GPDB has not integrated it with the MPP planner/executor: a generated T_ResultCache plan node is not handled by expression_tree_mutator() (nodeFuncs.c) and is also absent from the binary plan-dispatch serialization (outfast.c/readfast.c). So whenever the planner chose a Result Cache -- e.g. the information_schema.columns is_updatable computation under enable_mergejoin/enable_nestloop -- the query failed with: ERROR: unrecognized node type: 53 (53 = T_ResultCache) This hit a broad set of join-heavy tests (updatable_views, returning, rowtypes, join, indexjoin, subselect, partition_prune, gin, ...). Until Result Cache is properly supported in MPP, default enable_resultcache to off so the planner never generates the node. GPDB's expected outputs never show a Result Cache node, so this matches them; the standalone resultcache regression test is not in any schedule, and aggregates.sql already sets it off explicitly for its one relevant query. Verified: with the new default, the minimal repro (CREATE VIEW + information_ schema is_updatable under merge/nestloop) and the updatable_views / returning / rowtypes regression tests no longer raise "unrecognized node type: 53". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…heir C helpers (PG14) create_function_0 (input/create_function_0.source) defines the C helper functions binary_coercible(oid,oid), test_enc_conversion(...) and test_opclass_options_func(internal) from regress.so, which type_sanity, opr_sanity and conversion depend on. The merge left create_function_0 out of parallel_schedule entirely (only create_function_1/2/3 are listed), so those helpers were never created and the dependent tests failed with "function binary_coercible(oid, oid) does not exist" / "function test_enc_conversion(...) does not exist" (~40 hits, plus the downstream cascade in opr_sanity/type_sanity). Add create_function_0 to the schedule right after test_setup (before any test that uses the helpers). It has its own expected output (create_function_0.out). Verified (focused schedule test_setup -> create_function_0 -> opr_sanity conversion): create_function_0 and conversion now pass and the "function ... does not exist" errors are gone (opr_sanity still differs for unrelated reasons). NB create_function_0 also creates trigger functions from contrib/spi's refint.so/autoinc.so, which must be installed (make -C contrib/spi install) -- a build/install step, not part of this source change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A correlated subquery in the SELECT targetlist over a distributed table ran without its correlation filter, returning wrong results: SELECT a, (SELECT count(*) FROM t2 WHERE t2.b = t1.a) FROM t1; returned the *total* count for every row instead of the per-row correlated count, and a non-aggregate correlated subquery raised "more than one row returned by a subquery used as an expression". GPDB routes a correlated subquery's param filter (e.g. t2.b = $0) to the rel's upperrestrictinfo; bring_to_outer_query() then applies it via a Result node above the Broadcast Motion, so each segment filters the broadcast set by its local param. That Result is built by create_projection_path_with_quals() with the filter in cdb_restrict_clauses. Two code paths silently dropped the filter: 1. When the subpath is projection-capable (a Motion is), the function took the no-Result shortcut (dummypp = true) and never stored cdb_restrict_clauses, discarding the filter. Don't take the shortcut when there are restrict clauses to apply. 2. When a later projection (the scan/join target) was layered over the filter-carrying ProjectionPath, the path-collapsing code stripped the inner ProjectionPath and discarded its cdb_restrict_clauses. Carry them up into the surviving ProjectionPath instead. With both fixed, create_projection_plan() emits a Result with plan->qual = the param filter, and correlated targetlist subqueries return correct results. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…G14) A correlated EXISTS in the SELECT targetlist over a distributed table failed with: ERROR: subplan is missing Flow information SELECT a, EXISTS (SELECT 1 FROM t2 WHERE t2.b = t1.a) FROM t1; For a simple EXISTS, make_subplan() additionally builds a hashed ANY variant and wraps both in an AlternativeSubPlan, leaving setrefs.c to choose. GPDB's MPP slice machinery does not support AlternativeSubPlan: the hashed plan is created without Flow/slice information (so cdbllize raises "subplan is missing Flow information"), and cdbllize cannot reason about an AlternativeSubPlan when pruning unused subplans. In dispatch (MPP) mode, skip building the hashed alternative and keep just the correlated SubPlan we already built. It is correct on its own -- its correlation filter is applied above the Motion (see the companion fix in create_projection_path_with_quals) -- so EXISTS in the targetlist now returns correct results. The hashed alternative is still considered for non-MPP (utility-mode) planning. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PG14 reworked UPDATE so ModifyTable carries updateColnosLists -- one update_colnos list per result relation, mapping each non-junk column produced by the subplan to its target-table attribute number. The executor's ExecInitUpdateProjection() does list_nth(node->updateColnosLists, whichrel) unconditionally for CMD_UPDATE. The Postgres planner sets this in grouping_planner(), but ORCA's DXL->Plan translator (TranslateDXLDml) never did, so every UPDATE planned by ORCA -- the default optimizer -- produced a ModifyTable with updateColnosLists == NIL. On the segment, list_nth(NIL, 0) dereferences NULL -> SIGSEGV, killing the QE for any distributed UPDATE. ORCA emits a full new tuple: the subplan's non-junk entries are the table columns in physical order (the Result node coerces to the exact physical table layout, including dropped columns as NULLs), so the mapping is simply each non-junk entry's resno. Build that list and attach it as the single per-result-relation entry. Verified under ORCA: simple distributed UPDATE, split UPDATE (modifying the distribution key), and append-optimized UPDATE all succeed with correct results; DELETE is unaffected. Partitioned ORCA UPDATE now gets past this crash and surfaces a separate, pre-existing ORCA partition-routing issue (tracked independently). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…o (PG14)
PG14 moved aggregate deduplication into the planner: preprocess_aggrefs()
assigns each Aggref an aggno (index into the executor's per-agg result
array) and an aggtransno (index into the per-transition-state array),
and ExecInitAgg() sizes those arrays from the maxima and reads each
aggregate's value from aggvalues[aggref->aggno].
ORCA plans never pass through preprocess_aggrefs(), and the DXL->Plan
translator left both fields at their MakeNode() default of 0. Every
aggregate in an Agg node therefore shared per-agg slot 0 and transition
state 0: all aggregates in a query returned the first aggregate's
result. E.g. under optimizer=on,
SELECT min(q1), min(q2), max(q1), count(*) FROM agg_t
returned min(q1) four times. This silently corrupted any ORCA query
with more than one aggregate (and count(DISTINCT) alongside count()),
breaking dozens of regress tests via wrong results rather than errors.
Fix: in TranslateDXLAgg(), after the targetlist and qual are final,
collect all Aggrefs (including ones nested in expressions and SubPlan
args) and number them densely: aggno = aggtransno = 0..N-1. Dense
numbering matters because finalize_aggregates() walks every slot up to
the maximum. An instance referenced more than once keeps its number;
upstream's shared-transition-state optimization is not attempted.
Verified under ORCA: multi-agg min/max/count, count(DISTINCT) beside
count(), aggregates in HAVING quals, and aggregates nested in arithmetic
all return correct results; distributed UPDATE still works.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…type (PG14)
PG14 made exprType() of a SubscriptingRef read the new refrestype field
(commit c7aba7c14e5) instead of deriving the result type from
refelemtype/refcontainertype. ORCA's DXL->Scalar translator
(TranslateDXLScalarArrayRefToScalar) still filled only refcontainertype,
refelemtype, refcollid and reftypmod, leaving refrestype at 0, so any
expression above a subscript under ORCA failed with
ERROR: cache lookup failed for type 0
This broke every query using subscripting (arrays, point[i],
positions[i] over unnest(tsvector), ...) with the GPORCA optimizer
enabled: point, tstypes, arrays, geometry, insert, updatable_views,
domain, tuplesort and more regress tests.
The producer side (TranslateArrayRefToDXL) already computes the result
type with upstream semantics -- element type for a single-element fetch,
container type for slices and assignments -- and stores it in the DXL
operator; the consumer just never read it back. Set refrestype from
the DXL operator's ReturnTypeMDid.
Verified under ORCA: point[0] fetch, positions[1] over unnest(tsvector),
int[] slice v[2:3], and subscripted SET v[2]=x all work and return
correct types/values.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…G14) ORCA's DML plan for a partitioned target goes through a dynamic scan against the partition root, with ModifyTable.resultRelations naming only the root; finding the leaf a tuple belongs to relied on ModifyTable.forceTupleRouting, whose executor consumer was removed during the PG14 nodeModifyTable rework (b04e559) -- PG14 routes inherited updates via per-leaf result relations and "tableoid" junk columns instead, which ORCA's plans don't provide. An in-place UPDATE/DELETE therefore tried to modify the storage-less partition root: ERROR: could not open file "pg_tblspc/0/GPDB_8_.../0/0" (catcache, qp_dropped_cols and the wider could-not-open-file failure cluster; split updates that modify the distribution key survived only because their INSERT half goes through the partitioned-INSERT routing.) Restore the pre-#14129 guards in TranslateUpdateQueryToDXL and TranslateDeleteQueryToDXL so ORCA raises ExmiQuery2DXLUnsupportedFeature and falls back to the Postgres planner, which handles partitioned UPDATE/DELETE correctly on PG14. INSERT stays with ORCA. Revisit by porting per-tuple leaf routing onto the PG14 executor model. Verified under optimizer=on: partitioned non-key UPDATE, partition-key (cross-partition) UPDATE, partitioned DELETE all correct; UPDATE routing a NULL partition key reports the proper "no partition found" error instead of the storage error. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… dispatched" (PG14) Since PG13 (commit 5028981), CREATE TABLE (LIKE ... INCLUDING INDEXES) defers index creation: transformCreateStmt leaves a TableLikeClause in the statement list, and ProcessUtilitySlow expands it later via expandTableLikeClause() into IndexStmts marked transformed=true. The upstream T_IndexStmt path maps transformed -> is_alter_table=true ("treat it like ALTER TABLE ADD INDEX"), and GPDB's DefineIndex() suppresses its QE dispatch when is_alter_table on the assumption that an enclosing ALTER TABLE is dispatched as a whole. For the LIKE path there is no enclosing command: the CreateStmt was already dispatched with its own oids, and the cloned IndexStmt was executed only on the QD. The index oids preassigned there were never sent ("ERROR: oids were assigned, but not dispatched to QEs") and the index was missing on the segments. This broke CREATE TABLE LIKE INCLUDING INDEXES/ALL across alter_table, partition1, partition_storage, index_constraint_naming*, and bfv_index. Dispatch the transformed IndexStmt explicitly from ProcessUtilitySlow's T_IndexStmt case, mirroring DefineIndex's own dispatch (same flags, name pinned via stmt->idxname, oldNode cleared, preassigned oids attached). The QE re-executes it with is_alter_table=true and consumes the dispatched oids. Verified: CREATE TABLE (LIKE src INCLUDING ALL) succeeds, the pkey index exists on the QD and on every segment, and the unique constraint is enforced segment-side (duplicate key rejected). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…R_READY (PG14) PG14 added CAC_NOTCONSISTENT (pmState == PM_RECOVERY) to reject connections to a hot-standby that has not reached consistency. The merge placed that branch in canAcceptConnections() above GPDB's GetMirrorReadyFlag() -> CAC_MIRROR_READY check. A GPDB mirror runs with hot_standby off and stays in PM_RECOVERY for its whole life, so the mirror-ready branch became dead code: every connection to a mirror was answered with "the database system is not accepting connections / Hot standby mode is disabled." CAC_NOTCONSISTENT has no FTS exemption in ProcessStartupPacket (only CAC_STARTUP and CAC_MIRROR_READY do), so the FTS probe process could not connect to any mirror at all: probes ended in "FTS double fault detected", promotion requests never reached the mirror (catalog flipped to role=p while the segment kept running as a standby -- standby.signal in place, walreceiver streaming), and gprecoverseg failed because it could not read the version string from the CAC_MIRROR_READY error. Every mirror failover wedged the cluster unrecoverably, and the FTS regress tests (fts_error, fts_recovery_in_progress, ...) hung. Return CAC_MIRROR_READY before the CAC_NOTCONSISTENT branch when the walreceiver has been launched and hot_standby is off (a GPDB mirror). Genuine hot-standby servers that merely have not reached consistency keep the upstream fail-fast CAC_NOTCONSISTENT behavior. Verified end-to-end: direct connection to a mirror reports the version-bearing mirror-ready error; killing a primary now leads to FTS truly promoting the mirror (standby.signal removed, segment accepts queries, QD queries work across the failover); gprecoverseg -a and -ar restore and rebalance the cluster. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
gprecoverseg incremental recovery runs
pg_rewind --write-recovery-conf --slot="internal_wal_replication_slot" ...
The PG14 merge took upstream pg_rewind wholesale (46f49ad), dropping
the GPDB-added --slot option (eacc688), so every incremental
recovery failed with "unrecognized option '--slot=...'" and left the
downed segment unrecovered.
Re-add -S/--slot on top of the PG14 implementation: upstream's
GenerateRecoveryConfig() (shared with pg_basebackup) already takes a
replication-slot argument and emits primary_slot_name; pass the option
through at both -R call sites and reject --slot without
--write-recovery-conf, as before.
Verified: gprecoverseg -a incremental recovery succeeds ("Segments
successfully recovered", mirror back in sync) and gprecoverseg -ar
rebalances to preferred roles.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…(PG14)
GPDB append-optimized tables cannot fetch the old tuple by TID, so an
UPDATE plan over them must emit the full new tuple:
preprocess_targetlist() expands the targetlist when the target relation
is AO, and ExecModifyTable leaves the old slot empty for AO result
relations on the strength of that contract.
With PG14 native partitioning the target of a partitioned UPDATE is the
storage-less root, for which RelationIsAppendOptimized() is always
false, so the expansion never happened when only the leaves are AO.
The per-leaf update projection then referenced old-tuple columns the AO
leaf could not provide, failing with
ERROR: getsomeattrs is not required to be called on a virtual tuple table slot
across alter_table_aocs*, expand_table_ao*, alter_ao_part_tables*,
alter_ao_part_exch* (10 regress tests).
Add rel_has_appendoptimized_partition(): for a partitioned target, scan
its inheritors and force the expansion when any of them uses an AO
access method. Split updates already expanded; pure-heap partitioned
updates keep the upstream narrow targetlist.
Verified: AO-row and AOCS partitioned UPDATEs return correct results;
heap partitioned UPDATE unaffected.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replaying a 2PC DROP TABLESPACE (COMMIT_PREPARED with GPDB's
tablespace_oid_to_delete_on_commit) on a mirror could die with
FATAL: could not open directory "<location>/<dbid>": No such file or directory
CONTEXT: WAL redo at ... for Transaction/COMMIT_PREPARED ...
destroy_tablespace_directories() downgrades its own errors to LOG under
redo, but the directory_is_empty() check on the symlink target uses
ReadDir at ERROR, which the startup process escalates to FATAL -- so a
vanished/unreadable target directory took the whole mirror down over
disk space we merely failed to release. FTS then marked the mirror
down and pg_regress aborted the suite (temp_tablespaces /
alter_db_set_tablespace window).
Add directory_is_empty_ext() with a caller-chosen elevel and use it in
the redo path (LOG); an unreadable directory counts as empty and the
subsequent rmdir's LOG reports the leftover.
Verified: create tablespace -> create/insert/drop table -> drop
tablespace replays cleanly; all mirrors stay up and in sync.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…g group (PG14)
A merge artifact in ExplainOnePlan left a dangling
"if (es->summary && (planduration || bufusage))" glued onto the
query-identifier condition, plus a GPDB6-leftover second buffer-usage
block ending in an ExplainCloseGroup("Planning") with no matching open.
Whenever ANALYZE ran without BUFFERS, that stray close popped the
"Query" group early: every key after "Planning Time" (Triggers, Slice
statistics, Execution Time) was emitted outside the object, producing
structurally invalid JSON ("Expected , or ] but found :"). Text format
hid it because group closes are no-ops there; explain, explain_format,
gin and join_hash failed on it.
Restore the upstream PG14 shape (queryId, then the Planning group
wrapping only planning buffer usage, then Planning Time), keeping
GPDB's slice-table print, and drop the duplicate buffer block.
Verified: EXPLAIN (FORMAT JSON, ANALYZE) and (FORMAT JSON, ANALYZE,
BUFFERS) both parse with json.loads.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The expanded append-optimized UPDATE targetlist kept NULL placeholders
for dropped columns (resno == attno), but PG14's
ExecBuildUpdateProjection() pairs every non-junk subplan column with an
update_colnos target and rejects dropped target columns:
ERROR: table row type and query-specified row type do not match
DETAIL: Query provides a value for a dropped column at ordinal position N.
broke every AO/AOCS UPDATE on a table with a dropped column
(alter_table_gp, drop_column_update, alter_table_analyze,
alter_ao_table_col_ddl_*, uao_allalter_*).
For a plain (non-split) AO update, strip the dropped-column
placeholders after expansion and renumber the resnos; the executor sets
dropped columns of the new tuple to NULL itself and, with every live
column assigned, never reads the old-tuple slot that AO cannot
populate. A Split Update keeps the full physical row: it runs as
delete+insert and never builds the update projection.
Also includes rel_has_appendoptimized_partition() interplay: partitioned
AO targets take the same path.
Verified: AO and AOCS dropped-column UPDATEs return correct results.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replaying Database/CREATE (ALTER DATABASE SET TABLESPACE, movedb) on a
mirror copies a live database directory while the checkpointer can
unlink files of dropped relations at restartpoints. copy_file()/lstat
then died with
FATAL: could not open file ".../<relfilenode>_fsm": No such file or directory
CONTEXT: WAL redo ... Database/CREATE: copy dir ...
killing the startup process and downing the mirror (alter_db_set_tablespace
aborted the whole regress suite this way). The primary's copy simply
never saw those files.
Skip ENOENT sources with a LOG during recovery (InRecovery) in both the
directory scan and the file copy; normal execution still errors.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
After an incremental (pg_rewind) recovery a mirror replays from before a CREATE TABLESPACE whose location directory has since been removed from disk (regression tests drop the tablespace and clean up the directory). create_tablespace_directories() then FATALed the startup process with "directory does not exist", leaving the mirror permanently unrecoverable short of a full rebuild. During recovery, create the missing location with pg_mkdir_p() and press on -- the same philosophy TablespaceCreateDbspace() documents for replaying into dropped tablespaces. Normal execution still errors. Also includes the directory_is_empty_ext() redo hardening in the drop path from the previous commit series. Verified: a mirror rewound to before CREATE TABLESPACE replays through create/use/drop of tablespaces and returns to sync; "creating missing directory ... during replay" appears in its log. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two problems in TranslateDXLDml on tables with dropped columns:
1. A plain (non-split) UPDATE padded the subplan target list with NULL
placeholders for dropped columns and listed their attnos in
updateColnosLists; PG14's ExecBuildUpdateProjection() rejects
assignments to dropped columns ("Query provides a value for a
dropped column"). Skip the padding for non-split updates and build
updateColnosLists from the live columns' attribute numbers; the
executor nulls dropped columns of the new tuple itself.
2. A Split Update (delete+insert, distribution key change) silently
CORRUPTED rows: the insert half wrote misaligned values (a SET a=...
update lost the other columns' values). Until the PG14 insert path
understands ORCA's padded rows, raise unsupported and fall back to
the Postgres planner, which handles it correctly.
Verified under optimizer=on: AO and heap dropped-column UPDATEs return
correct results; a distribution-key UPDATE on a dropped-column table
falls back and preserves all column values.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A mis-merged brace in heap_create_with_catalog() attached the outer
"relkind has no rowtype" else-branch to the inner GPDB
"skip array type for AO relations" if-statement. For every AO/AOCS
relation the composite type was created (pg_type row present, typrelid
correct) and then new_type_oid was reset to InvalidOid, so the pg_class
tuple was written with reltype = 0.
Fallout: RenameRelationInternal() skips RenameTypeInternal() when
reltype is invalid, so renaming an AO table left its rowtype under the
old name. ALTER TABLE EXCHANGE PARTITION decomposes into a three-way
rename and collided with the stale type ("type <partition> already
exists"), breaking 15 regress tests (partition, partition1,
distributed_transactions, alter_table_ao*, alter_ao_part_*,
column_compression, oid_consistency, portals_updatable); anything
consulting an AO relation's rowtype (whole-row Vars, "relation does not
have a composite type") misbehaved too.
Move the else to the outer if, where upstream has it. Newly created
clusters/tables get correct catalogs; existing AO tables keep reltype=0
until recreated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nodes (PG14)
PG14's CREATE FUNCTION/PROCEDURE ... BEGIN ATOMIC / RETURN stores the
body in CreateFunctionStmt.sql_body. The field was copied and compared
(copyfuncs/equalfuncs) but never serialized, so the QD dispatched the
statement without a body and QEs failed with "no function body
specified" (create_procedure, create_function_3).
Dispatching the raw body surfaced further binary-serialization gaps:
- ReturnStmt and RawStmt had no readers at all; add _readReturnStmt /
_readRawStmt and wire both node types into the outfast/readfast
switches.
- _readSelectStmt did not read the PG14 groupDistinct bool that
_outSelectStmt writes, desynchronizing the stream one byte
("could not deserialize unrecognized node type: <garbage>").
- ParamRef had a writer but no reader, breaking RETURN $1-style bodies.
Verified: BEGIN ATOMIC procedure inserts through dispatch, RETURN $1*2
function evaluates on segments, GROUP BY DISTINCT round-trips.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…n struct)
The PG15 merge backend was built --disable-orca, so the ORCA translator
(src/backend/gpopt) had never been compiled against the PG15-merged node
structures. Enabling ORCA surfaces two PG15 node-API changes:
1. Value node removed (split into Integer/Float/String/Boolean/BitString;
T_Null gone). gpdb::MakeStringValue/MakeIntegerValue now return Node*
(makeString->String*, makeInteger->Integer*); the column-name list
callers take Node*/String* (they feed LAppend or strVal()); the CTAS
storage-option sentinel uses T_Invalid instead of T_Null (only stored,
read solely under !is_null, so the value is immaterial).
2. SeqScan is now its own struct embedding Scan (was a typedef of Scan).
seq_scan->scanrelid/plan become ->scan.scanrelid/->scan.plan; the
GGDB DynamicSeqScan (embeds SeqScan) becomes ->seqscan.scan.{scanrelid,plan}.
Also fixed the stale optimizer/plan/objfiles.txt that omitted orca.o (built
only when enable_orca=yes) -> undefined reference to optimize_query at link;
regenerated. ORCA now builds, links, and produces GPORCA plans on PG15
(verified: a user-table group-by explains as "Pivotal Optimizer (GPORCA)").
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
With ORCA now building/running on PG15, the optimizer=on regress matrix failed 68/202 (full parallel) — but ~25 were segment memory-pressure cascades (trivial queries OOMing under parallel ORCA load); the true set is 43 (MAX_CONNECTIONS=4). Triage (success->error gate + comparing ORCA results to the green planner base and the prior _optimizer.out): - 36 are stale _optimizer.out predating the PG15 test-suite reorg (the base .out files were regenerated for PG15 but the _optimizer.out were not, e.g. create_index missing \getenv abs_srcdir / the inline slow_emp4000 block). - The handful with extra ORCA errors are longstanding ORCA/planner divergences, not regressions or wrong results: "cannot display a value of type anycompatible" and "could not identify a hash function for type money" were already in the prior _optimizer.out; "UNIQUE and DISTRIBUTED RANDOMLY are incompatible" comes from ORCA defaulting a CTAS-without-DISTRIBUTED-BY to a NULL/random policy where the planner picks the first column (the test that hits it, with's `m` table, is new in PG15 and uses no DISTRIBUTED BY, so there is no PG14-ORCA baseline showing otherwise) — all clean errors, no crashes, no wrong data rows. Regenerated and verified green (4-of-202 in a MAX_CONNECTIONS=6 re-run, the 4 being flaky/pressure: portals = fork-memory pressure; with/select_parallel/gist = row-order / EXPLAIN-ANALYZE per-segment actual-rows flutter, same class as the planner flaky tail). Those 4 are left for separate stabilization. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tion + cost-tie) Running the optimizer=on matrix many times (for the ORCA bring-up) exposed a flaky tail whose common root cause is that ORCA gives a table created without an explicit DISTRIBUTED BY a NULL/random policy (the Postgres planner instead picks the first column). Randomly-distributed data has non-deterministic per-segment placement, which destabilizes per-segment EXPLAIN ANALYZE counts, unordered SELECT row order, and LIMIT-over-ties. Fixed at the source so both matrices are deterministic: - test_setup: CREATE TABLE tenk2 ... DISTRIBUTED BY (unique1) (matches tenk1; was random under ORCA -> select_parallel tenk2 actual-rows flutter, and a redistribute motion that vanishes once co-located). - with: ORDER BY 1 on the SELECT * FROM y / yy result-checks (y is DISTRIBUTED RANDOMLY; atmsort does not sort these GP_IGNORE blocks). - gist: add a point-coordinate tiebreaker to "order by circle(p,1) <-> point(0,0) limit 1" — point(0,0) is inside many points' unit circles so the lossy distance ties at 0, and gist_tbl has no hash-distributable column. - select_parallel: the merge-join's tenk1 input is a cost-tie between an Index Only Scan and a Seq Scan + Sort; pin enable_seqscan=off (scoped) for that one query so the plan is stable (same approach as the earlier planner flaky tail). Verified stable across 5 ORCA runs (optimizer=on) and 2 planner runs (optimizer=off), All 202 passed each. Safety-gated: no test turned a success into an error (gist's lossy-distance error becomes a deterministic result). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…drift These three carry EXPLAIN ANALYZE plans whose volatile annotations differ from the committed answer files on the rebuilt (--enable-orca) cluster: the GGDB per-operator "Executor Memory: NkB" line appears only when a Sort's branch is actually executed, and runtime partition pruning shows "(never executed)" vs "(actual rows=N)" depending on which partitions are probed. No data rows or error behavior change (success->error gate clean); the EXPLAIN VERBOSE "Settings:" line that now lists optimizer='off' is already ignored by init_file (m/^ Settings:.*/). Regenerated against the clean post-fix gpdemo and verified stable across the planner runs (P/Q/M/N) and ORCA runs (H/I/J/S/T), All 202 passed. Note: these runtime-pruning/memory annotations are sensitive to the cluster instance; if they flutter in CI they should be hardened with GP_IGNORE rather than re-pinned. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Our PG15 base (15beta2, merge target adadae4) carries a recovery WAL prefetcher bug: lrq_complete_lsn()'s readahead can advance the XLogReader past the record just returned by XLogNextRecord(), tripping Assert(record == prefetcher->reader->record) at xlogprefetcher.c:1061 and aborting the startup process (signal 6) during WAL replay. In an MPP cluster this crashes mirror segments as they replay WAL under load, degrading the cluster (cascading "could not connect to the primary"/FTS-down). It is cassert-only but reliably reproducible during the optimizer=on regress matrix. xlogprefetcher.c is otherwise upstream (GPDB only changed smgropen to the 3-arg SMGR_MD form), so this is an upstream beta2 bug fixed later on REL_15_STABLE. Until that fix is backported, default this recovery-only prefetch optimization to off; re-enable to RECOVERY_PREFETCH_TRY afterwards. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… to PG15 Two fixes to bring up isolation2 on the PG15 merge: 1. pg_regress.c: the PG15 merge dropped the convert_sourcefiles() call from initialize_environment() (PG14 had it right before load_resultmap()). That function generates sql/*.sql and expected/*.out from the input/ and output/ *.source templates at run start; without it, any .source-based test fails with "cannot open .../sql/<name>.sql: No such file". The main regress only survived because its generated files were left over from the initial build; isolation2 (run for the first time here) had none, so it aborted at the first .source test (autovacuum-analyze). Restore the call. 2. workfile_mgr_test.c (isolation2 harness): port logicaltape_test() to the PG15 LogicalTape-as-object API — LogicalTapeSetCreate is now 3-arg, tapes are created with LogicalTapeCreate(set), and Tell/Write/Freeze/Seek/Read take a LogicalTape* instead of (set, tapenum). The file never compiled before because isolation2 was built --disable-orca-era and not exercised. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…5 merge)
The PG15 merge adopted the callers' half of an upstream change but not the
matching relaxation in recordDependencyOnCurrentExtension():
- PG15 made makeOperatorDependencies() and GenerateTypeDependencies() pass
isReplace=true unconditionally when recording extension membership (the
merge took these correctly in pg_operator.c / pg_type.c).
- PG15 *also* removed the "free-standing object, so reject" branch from
recordDependencyOnCurrentExtension(): with isReplace=true, an object that
is not yet anyone's member is now absorbed into the current extension
rather than rejected. The merge kept GPDB's PG14 version, which still
rejects.
The broken combination breaks CREATE EXTENSION for any extension that
creates a shell operator (a CREATE OPERATOR referencing a not-yet-existing
COMMUTATOR/NEGATOR makes a shell, recorded with isReplace=true) or replaces
a shell type: e.g. citext fails with "(null) is not a member of extension
\"citext\" / An extension is not allowed to replace an object that it does
not own" (the (null) is a shell operator, whose getObjectDescription is
empty). The function's own header comment already documents the PG15
behavior ("the object will be absorbed into the extension ... desirable for
cases such as replacing a shell type") — the code just didn't match it.
Adopt upstream PG15's recordDependencyOnCurrentExtension. Verified:
CREATE EXTENSION citext succeeds and contrib/citext installcheck passes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PG15 forbids RequestAddinShmemSpace() outside the new shmem_request_hook; orafce called it directly in _PG_init(), so CREATE EXTENSION orafce (or first use) crashed every backend with FATAL "cannot request additional shared memory outside shmem_request_hook" (ipci.c), cascading the whole orafce test suite (12 of 13 failed). Move the request into an orafce_shmem_request() assigned to shmem_request_hook (chaining any previous hook), guarded by PG_VERSION_NUM >= 150000 so the pre-15 path is unchanged. Verified: the installed orafce.so now references shmem_request_hook and the orafce test runs without the shmem crash. (Separate orafce follow-ups remain: add_months type-resolution answer-file drift, and the nlssort test.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The PG15 merge moved ReadCheckpointRecord() from xlog.c into the new
xlogrecovery.c (the xlog→xlogrecovery split) but dropped GGDB's call to
XLogProcessCheckpointRecord() and the helper's definition along the way;
only the orphaned forward declaration survived in xlog.c.
GGDB writes an *extended* checkpoint record that appends the in-doubt
distributed-committed transactions (getDtxCheckPointInfo) after the
CheckPoint struct. Recovery must extract that DTX payload from the
checkpoint it starts from and feed it to redoDtxCheckPoint() so the
second phase of 2PC can complete. Without it, a distributed transaction
that has committed on the coordinator is treated as an orphaned prepared
transaction and aborted on the segments after crash recovery — the table
then exists only on the coordinator ("could not open relation with OID
N" when a later query is dispatched to the segments).
Re-graft the helper into xlogrecovery.c next to ReadCheckpointRecord and
call it from the report==true path (the starting-checkpoint read), matching
the pre-merge GGDB behavior; remove the now-dead declaration from xlog.c.
The control file is updated to point at the new checkpoint before the
keep_log_seg panic fires, so crash recovery starts from the checkpoint
carrying the DTX info and this single call site is sufficient.
Fixes isolation2 test checkpoint_dtx_info (issue #12977 coverage).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… drift The PG15 test version sets the imported snapshot via psql -Atc "BEGIN TRANSACTION ...; SET TRANSACTION SNAPSHOT '...';" Since the PG14 psql change, "psql -c" with multiple semicolon-separated commands prints each command's status tag, so the output is now "BEGIN\nSET" rather than just "SET". Deterministic, functionally correct (the snapshot is still imported); add the BEGIN tag line to the expected output. The run-varying snapshot token is masked by the existing substitution rule and left as-is. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…merge)
The PG15 merge adopted upstream's cluster_rel() which asserts the target
relkind is one of RELATION / MATVIEW / TOASTVALUE. GPDB's pre-merge
cluster_rel() had no such assert.
VACUUM FULL on an append-optimized table recurses into its heap-backed
auxiliary relations — aoseg, block directory and visimap (vacuum.c
vacuum_rel) — with VACOPT_FULL still set. Each aux relation has
is_appendoptimized == false (they are plain heaps), so it takes the
"VACUUM FULL is a variant of CLUSTER" path and is rewritten via
cluster_rel(), exactly like the TOAST table. But those relations carry
the GPDB-specific relkinds RELKIND_AOSEGMENTS / RELKIND_AOBLOCKDIR /
RELKIND_AOVISIMAP, which the upstream assert rejects → FailedAssertion at
cluster.c:514, crashing the segment ("server closed the connection
unexpectedly" on "VACUUM FULL <ao table>", triggering cluster recovery).
The rewrite itself is correct (make_new_heap uses the relation's heap
access method and the swap preserves the original relkind); only the
assert was too strict. Extend it to accept the three AO aux relkinds.
Surfaced by isolation2 uao/compaction_full_stats_{column,row}; assert-only
crash, so it bites debug/cassert builds.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… merge)
TRUNCATE on an append-optimized table assigns a fresh relfilenode to the
table and its auxiliary relations via RelationSetNewRelfilenode(). The
PG15 merge replaced GPDB's explicit relkind switch there with upstream's
form, which gates the table-AM path on RELKIND_HAS_TABLE_AM():
if (RELKIND_HAS_TABLE_AM(relkind))
table_relation_set_new_filenode(..., &freezeXid, ...);
else if (RELKIND_HAS_STORAGE(relkind))
RelationCreateStorage(...); /* leaves freezeXid invalid */
The upstream RELKIND_HAS_TABLE_AM() macro lists only RELATION / TOASTVALUE
/ MATVIEW, not GPDB's AO auxiliary relkinds (AOSEGMENTS / AOBLOCKDIR /
AOVISIMAP). Those aux relations are heaps with a heap table AM, so they
fell into the storage-only branch and TRUNCATE stored relfrozenxid = 0
(InvalidTransactionId) for them. A later plain VACUUM of the AO table
recurses into the aux relations through heap_vacuum_rel(), whose early
wraparound-failsafe precheck asserts TransactionIdIsNormal(relfrozenxid)
(vacuumlazy.c) and crashes the segment.
GPDB's pattern elsewhere (RelationBuildLocalRelation, the relcache reload
paths, plancat.c) is to keep the upstream macro and add an explicit
AO-aux relkind check alongside it; RelationSetNewRelfilenode() was the one
site that missed it. Add the same check so the aux relations get a valid
relfrozenxid on TRUNCATE.
Surfaced by isolation2 uao/vacuum_cleanup_{column,row} (TRUNCATE followed
by VACUUM with a concurrent writer); assert-only crash.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PG15 revoked the default CREATE privilege on the public schema from PUBLIC, and GreengageDB adopts that (initdb grants only USAGE on public to PUBLIC). The core regression suite compensates by running "GRANT ALL ON SCHEMA public TO public;" in src/test/regress/sql/ test_setup.sql, but the isolation2 suite's setup.sql had no equivalent. As a result, isolation2 tests that create objects in the public schema as a non-superuser role fail with "permission denied for schema public". In resource_queue_deadlock this is especially nasty: session 1 creates "t_deadlock_test" as the non-superuser role_deadlock_test, the CREATE TABLE fails, so the subsequent INSERT (and its auto-stats ANALYSE) never runs, the before_auto_stats fault never triggers, and session 0's gp_wait_until_triggered_fault() hangs forever (the whole isolation2 run stalls). Mirror test_setup.sql: GRANT ALL ON SCHEMA public TO public in the isolation2 setup (which isolation2_main.c always runs as a prerequisite), and regenerate setup.out. This fixes resource_queue_deadlock and the other isolation2 specs that create objects as non-superuser roles. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The pg_basebackup() PL/Python helper in setup.sql formats a failed
command's output in its error handler as:
results = str(e) + "\ncommand output: " + e.output
Under Python 3, subprocess.CalledProcessError.output is bytes, so the
concatenation raises "TypeError: can only concatenate str (not 'bytes')
to str". The error handler thus crashes instead of reporting the real
pg_basebackup failure, masking it in every test that uses the helper
(e.g. segwalrep/master_wal_switch, pg_basebackup_large_database_oid).
Decode e.output before concatenating (matching the success path, which
already .decode()s), and guard against None. Regenerate setup.out.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…G15 merge) pg_basebackup's GPDB --target-gp-dbid option must write the new segment's dbid into internal.auto.conf in the freshly-created data directory: the backup stream carries the *source* segment's internal.auto.conf (its gp_dbid), so a mirror/standby created from a primary/coordinator would otherwise inherit the source's gp_dbid. The PG15 merge rewrote pg_basebackup's receive path around bbstreamer and dropped the call to WriteInternalConfFile(); the function survived but was never invoked. As a result every pg_basebackup-created segment kept the source's gp_dbid (e.g. content-0 mirror got gp_dbid=2 instead of 5, the standby got 1 instead of 8). This is dormant in normal operation — FTS probes the primary, not the mirror directly — but breaks the moment a mirror is promoted (or the standby activated): FTS probes the new primary with its catalog dbid, the segment compares against its (wrong) configured dbid, and rejects the probe with "PROBE received dbid:N doesn't match this segments configured dbid" (ftsmessagehandler.c). The segment never finishes promotion and stays in recovery, hanging every mirror-promotion / failover / pg_rewind based test (segwalrep/mirror_promotion, recoverseg_from_file, twophase_tolerance_with_mirror_promotion, failover_with_many_records, prepared_xact_deadlock_pg_rewind, ...) and real HA failover. Re-graft the WriteInternalConfFile() call after BaseBackup() completes, for plain-format backups (tar mode adds the file manually, per the existing note). Verified: a recreated demo cluster now gives mirrors gp_dbid 5/6/7 and the standby 8. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… merge)
GPDB's pg_basebackup --force-overwrite extracts a backup into an existing
data directory (e.g. gprecoverseg full recovery in place). Before PG15
the tar-receive code tolerated pre-existing directories when
forceoverwrite was set. The PG15 merge moved extraction into the new
bbstreamer extractor (bbstreamer_file.c), whose extract_directory() only
ignores EEXIST for pg_wal / pg_xlog / archive_status — every other
existing directory makes it pg_fatal("could not create directory ...:
File exists").
As a result gprecoverseg full recovery fails as soon as the target data
directory still contains e.g. pg_serial, which blocks recovering a downed
segment after a mirror promotion (segwalrep/mirror_promotion,
recoverseg_from_file, ...).
Thread forceoverwrite into the bbstreamer extractor and tolerate EEXIST
for any directory when it is set (files are already truncated via
fopen "wb"), restoring the pre-merge behavior.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ittest (PG15) PG15 requires RequestNamedLWLockTranche() to run only during the shmem-request phase, so it was moved out of SharedSnapshotShmemSize() and CreateSharedSnapshotArray() into the shmem_request hook (CreateSharedMemoryAndSemaphores). sharedsnapshot_test.c still set up expect_string/expect_value/will_be_called for RequestNamedLWLockTranche around both calls, so cmockery failed with "Remaining item(s) declared at sharedsnapshot_test.c:29 / :41" (the expectations were never consumed), breaking `make unittest-check` in CI. Remove the two stale RequestNamedLWLockTranche expectation blocks; the test's actual purpose (shared snapshot array slot/lock/xip boundary checks) is unchanged and now passes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
test_GetNewTransactionId_xid_warn_limit exercises the warn-limit path, which (unlike the stop-limit path that ereport(ERROR)s first) continues into the XID-assignment code. There GetNewTransactionId() indexes ProcGlobal->xids[MyProc->pgxactoff] and ProcGlobal->subxidStates[MyProc->pgxactoff]. The test left its stack PGPROC uninitialized, so pgxactoff was garbage; it happened to be 0 (a valid index into the size-1 arrays) before, but the PG15 PGPROC layout change turned it into an out-of-bounds index, segfaulting the test and breaking `make unittest-check` in CI. Zero-initialize the stack PGPROC and PROC_HDR so pgxactoff is 0 and MyProc->subxidStatus is empty (satisfying the asserts in that path). The test logic is otherwise unchanged and now passes 5/5. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…pile fix)
PG15 moved the WAL record's main data out of XLogReaderState into the
decoded record (XLogReaderState->record, a DecodedXLogRecord); the main
data is now reached via XLogRecGetData()/XLogRecGetDataLen() which
dereference record->record->main_data[_len]. cdbappendonlyxlog_test.c
still assigned mockrecord->main_data directly, so it failed to compile
("'XLogReaderState' has no member named 'main_data'"), breaking
`make unittest-check`.
Point the mock reader at a stack DecodedXLogRecord (header + main_data +
main_data_len, max_block_id = -1) so ao_insert_replay/ao_truncate_replay
read the data through the PG15 accessors. Tests pass 2/2.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… (PG15) Back-to-back segwalrep failover/recovery tests race a just-promoted or just-recovered segment that is still transiently unavailable. Two distinct race classes were diagnosed: 1. The direct "NU: select 1" promotion-waits connect to the freshly promoted mirror before it finishes recovery and fail with "FATAL: the database system is not accepting connections". This raw connection rejection is NOT covered by gp_gang_creation_retry. Add a plpython helper wait_until_segment_accepts_connections(content_id) that polls pg_isready against the content's current primary (nudging FTS) until it is ready, and call it before the 1U/0U promotion-waits in recoverseg_from_file. 2. gprecoverseg's gang creation in mirror_promotion fails with "Segments are in reset/recovery mode" because a segment is still recovering. mirror_promotion was missing the gp_gang_creation_retry bump that twophase_tolerance_with_mirror_promotion and failover_with_many_records already use; add it (120 x 1000ms ~= 120s) via gpconfig + gpstop -u, reset at the end. The default gp_gang_creation_retry is only 5 x 2s = 10s, too short for an in-order run. Note: keep plpython helper bodies comment-free and free of any trailing ';' -- the isolation2 harness splits commands on ';' at end of line, which corrupts the function definition. mirror_promotion's second, fault-injection scenario can still flake in back-to-back in-order runs (the whole cluster goes transiently into reset/recovery, where even coordinator-only helper queries error uncatchably); that residual is environmental and tracked separately. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Upstream PG15 commit cc50080 ("Rearrange core regression tests to reduce cross-script dependencies") moved the shared C helper functions, including binary_coercible(), into test_setup.sql, which runs before create_function_0. The PG15 merge kept test_setup.sql's definition but did not remove create_function_0's now-duplicate one, so create_function_0 failed with: ERROR: function "binary_coercible" already exists with same argument types That failure (and the resulting missing-object cascade into downstream tests) shows up in any regression run, and dominated the JIT (jit=on jit_above_cost=0) installcheck matrix. Remove the duplicate from both the input and output .source templates; test_setup.sql's definition (which runs first) serves every later test. Verified: test_setup, create_function_0, create_function_c and opr_sanity (a binary_coercible consumer) all pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…_conversion, greenplum test_setup) Two more PG15-merge dedup misses that broke `make installcheck` (installcheck-good runs parallel_schedule then greenplum_schedule in one database), surfaced while triaging the JIT installcheck matrix: * test_enc_conversion() is created by conversion.sql in PG15 (upstream commit cc50080), but create_function_0.source still defined it too, so conversion failed "function already exists". Remove it from create_function_0 (only conversion uses it, and it creates its own). * test_setup is the first test of parallel_schedule (PG15 upstream); the merge also added it to greenplum_schedule. Since installcheck-good runs parallel_schedule first in the same database, the greenplum copy's CREATE TABLEs failed "already exists" and its INSERTs double-loaded the shared read-only tables, cascading into the greenplum tests. Drop it from greenplum_schedule; parallel_schedule's run serves both. With these plus the earlier binary_coercible dedup, the core parallel_schedule passes under JIT (jit=on jit_above_cost=0) except misc; the remaining greenplum_schedule failures are pre-existing PG15 answer drift unrelated to JIT. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PG15 adopted upstream commit 6867f96, which hardened pg_get_expr_worker() by walking the input node with pull_varnos() to reject expressions containing Vars. The gp_partition_template.template catalog column stores a serialized GpPartitionDefinition node tree (a GPDB-specific node with no Vars); expression_tree_walker() has no case for the GPDB partition nodes, so pull_varnos() errored with "unrecognized node type: 740". The deparse path already handles T_GpPartitionDefinition (get_rule_expr), so skip the Var-safety check for it, restoring pre-PG15 behavior. Fixes regress tests AOCO_Compression, bfv_partition, column_compression, gp_partition_template, partition (opt=on and opt=off). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PG15 adopted upstream's ATWrongRelkindError path in ATSimplePermissions, which calls alter_table_type_to_string(cmdtype) and, when it returns NULL, falls through to the internal-error elog "invalid ALTER action attempted on relation". The GPDB-specific AT_ExpandTable / AT_ExpandPartitionTablePrepare values had no case, so EXPAND TABLE on a wrong relkind (e.g. a view) raised that internal error instead of a clean message. Add the cases and regenerate expand_table.out. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The PG15 merge left two complete SELECT...FROM pg_class blocks in getTables(), both appending to the same query buffer, producing malformed SQL. The first block was upstream PG15's rewritten minimal query (no WHERE/ORDER/execute); the second is the GPDB query that the result parsing actually depends on (it reads distclause, parrelid, parlevel, relstorage, partclause, parttemplate via PQfnumber) and which carries the full FROM/JOINs/WHERE/ORDER and ExecuteSqlQuery. Drop the leftover upstream block, keeping the GPDB query. Fixes regress test pg_dump_binary_upgrade (opt=on and opt=off). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The PG15 createdb rewrite (WAL_LOG/FILE_COPY strategies) dropped two
GPDB pieces from the new strategy functions:
- ScheduleDbDirDelete(): registers the destination DB directory on the
PendingDBDelete list so a failed/aborted CREATE DATABASE removes it
(GPDB removes the dir via pending-deletes; createdb_failure_callback's
upstream remove_dbtablespaces is #if 0'd out for GPDB).
- the create_db_after_file_copy / after_xlog_create_database fault
injection points used by the createdb regress test.
Re-graft both into CreateDatabaseUsingFileCopy (and ScheduleDbDirDelete
into CreateDatabaseUsingWalLog), matching adb-6.x.
PG15 defaults to the wal_log strategy, but these faults are file-copy-path
specific, so the createdb test requests STRATEGY file_copy for every fault
case (db_with_leftover_files, db2, db3, db4). This also avoids a buffer
leak: the wal_log strategy copies relations through the shared buffer
cache, and createdb_failure_callback only drops those buffers when an error
is caught inside its PG_ENSURE_ERROR_CLEANUP block. db4's CASE 4 aborts via
an end_prepare_two_phase panic during 2PC commit -- after the block has
ended -- so the callback never runs; with wal_log, ScheduleDbDirDelete would
then unlink the directory while its dirty buffers remained, orphaning them
('could not write block ... No such file or directory'). file_copy uses
copydir (no buffer-cache load), so there are no buffers to orphan.
Fixes regress test createdb (opt=on and opt=off).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A backend that spilled to a shared work file (FileSet) crashed on exit:
pgstat_shutdown_hook() runs as a before_shmem_exit callback and tears down
the cumulative-stats state, then dsm_backend_shutdown() (an on_shmem_exit
callback, so it runs later) detaches the DSM segment, whose cleanup deletes
the FileSet's temporary files:
pgstat_report_tempfile <- asserts pgstat is up (pgstat.c:1227)
ReportTemporaryFileUsage
PathNameDeleteTemporaryFile
FileSetDeleteAll
dsm_detach
dsm_backend_shutdown
Reporting temp-file usage after the stats subsystem is shut down trips
pgstat_assert_is_up() under assertions, and touches detached stats shared
memory in any build. Because only backends that spilled hit this, and more
queries spill under load, it crashed segments intermittently during
installcheck-good and cascaded ('terminating connection because of crash of
another server process') into dozens of unrelated tests.
Skip the temp-file stats report once proc_exit is in progress; per-file
accounting is moot for a backend that is leaving, and query-time temp-file
deletions still report normally (proc_exit_inprogress is false then).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
COPY ... IGNORE EXTERNAL PARTITIONS over a partitioned table that has an
external (foreign) partition crashed the QD planner with
FailedAssertion("child_rel != NULL", planner.c:9046).
expand_partitioned_rtentry() walks live_parts and, for the GPDB
"skip foreign partitions" hack, does 'continue' for a foreign partition
without building its part_rels[] entry -- but it left that partition's index
in live_parts. PG15's apply_scanjoin_target_to_paths() now iterates
live_parts and asserts a non-NULL part_rels[] entry for every live member
(PG14 tolerated NULL slots), so the skipped external partition tripped the
assert (and would dereference a NULL RelOptInfo in a non-assert build).
Remove the skipped partition from live_parts too, keeping live_parts and
part_rels[] consistent for all downstream consumers.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A QD backend could SIGSEGV in mppExecutorCleanup(): mppExecutorCleanup estate->dispatcherState (execUtils.c:1727) standard_ExecutorStart PG_CATCH (execMain.c:335) PortalStart / exec_simple_query standard_ExecutorStart() runs the resource-manager operator-memory assignment (PolicyAutoAssignOperatorMemoryKB / PolicyEagerFree...) inside a PG_TRY, and on error calls mppExecutorCleanup() from the PG_CATCH. That assignment happens before queryDesc->estate is created, so mppExecutorCleanup dereferenced a NULL estate and crashed the backend -- which on the QD takes down the whole coordinator (crash recovery) and cascades into concurrently running tests. It was hit intermittently under load by a complex query whose operator-memory needs exceeded the (contention-reduced) query_mem, e.g. the psql \d publications query (a 3-way UNION) -- 'insufficient memory reserved for statement' was thrown during executor start. Return early from mppExecutorCleanup() when queryDesc->estate is NULL: the executor state was never built, so there is nothing to tear down, and the original error then propagates as a normal ERROR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
GPDB extends SQL window frames to allow non-constant (column-valued) start/end offsets, e.g. ROWS BETWEEN <expr> PRECEDING where <expr> references the current row. compute_start_end_offsets() only (re)computes an offset when its start_offset_valid / end_offset_valid flag is false, and those flags mean "valid for the current row". The PG15 merge kept upstream's advance-current-row and begin_partition logic (which resets framehead_valid/frametail_valid but not the GPDB offset-valid flags, since upstream offsets are constant), so the flags were only ever set true and never reset. The frame offset was therefore frozen at the first row's value, producing wrong window-aggregate results whenever the offset varied across rows (e.g. SUM over ROWS BETWEEN off PRECEDING gave the off-of-the-first-row frame for every row). Re-graft the per-row resets (matching adb-6.x): in begin_partition() and when advancing the current row -- unconditionally for RANGE framing, and for ROWS/GROUPS only when the offset is not var-free (non-constant). Fixes regress test qp_olap_window (and related non-constant-frame cases). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bump up to PostgreSQL15