Skip to content

Claude merge 3#2706

Open
dimoffon wants to merge 4577 commits into
adb-8.xfrom
claude-merge-3
Open

Claude merge 3#2706
dimoffon wants to merge 4577 commits into
adb-8.xfrom
claude-merge-3

Conversation

@dimoffon

Copy link
Copy Markdown
Member

Bump up to PostgreSQL15

dimoffon and others added 30 commits June 3, 2026 16:47
…tartup_dummy)

EXPLAIN ANALYZE of a CREATE TABLE AS / SELECT INTO that uses EXECUTE of a
prepared statement crashed the coordinator:

  #6  intorel_startup_dummy        (rel == NULL)
  #7  standard_ExecutorRun
  #8  ExplainOnePlan
  #9  ExplainExecuteQuery
  #10 ExplainOneUtility

GPDB creates the CTAS target relation in intorel_initplan() (called from
InitPlan, execMain.c) gated on PlannedStmt->intoClause, and leaves the
DestReceiver's rStartup a near-dummy that dereferences the relation created
there.  The freshly-planned EXPLAIN path (ExplainOneQuery) copies the IntoClause
onto the plan before running it, but ExplainExecuteQuery passed the cached plan
to ExplainOnePlan without setting PlannedStmt->intoClause.  intorel_initplan was
therefore skipped, the DR_intorel's rel stayed NULL, and intorel_startup_dummy
dereferenced it during ExecutorRun -> coordinator SIGSEGV (which drops every
session in the run).

Fix: in ExplainExecuteQuery set the IntoClause on the plan when into != NULL,
mirroring ExplainOneQuery.  The PlannedStmt belongs to the shared cached plan,
so set it on a copy -- otherwise a stale IntoClause would make a later plain
EXECUTE of the same statement try to create a table.

Verified: pre-fix, EXPLAIN ANALYZE CREATE TABLE t AS EXECUTE p dumps a core with
the backtrace above; post-fix it no longer crashes.  (CREATE TABLE AS ... EXECUTE
then hits a separate, pre-existing OID-dispatch error -- "oids were assigned, but
not dispatched to QEs" -- that also affects plain CREATE TABLE AS EXECUTE and is
out of scope for this crash fix.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…("oids were assigned, but not dispatched to QEs")

All forms of CREATE TABLE AS / SELECT INTO ... EXECUTE (plain, temp, and under
EXPLAIN ANALYZE) failed on the coordinator with:

  WARNING:  OID assignment not dispatched: catalog 1259 ... name "x"
  ERROR:   oids were assigned, but not dispatched to QEs

GPDB plans a CTAS specially: the IntoClause must reach the planner so the query
is marked PARENTSTMTTYPE_CTAS and a *distributed* plan is built -- rows are
redistributed to the target table's segments, which create and populate the
relation (consuming the OIDs the QD pre-assigned).  This is threaded through the
plancache via an extra IntoClause argument to GetCachedPlan() ->
RevalidateCachedQuery()/choose_custom_plan()/BuildCachedPlan() (MPP-8135).

The PG14 plancache merge adopted upstream's GetCachedPlan() signature and turned
the GPDB intoClause argument into a hardcoded local `IntoClause *intoClause =
NULL;`, even though the function's own header comment still documents the extra
parameter and the rest of plancache.c still threads it.  With it forced NULL the
EXECUTE plan was the plain gather-to-coordinator SELECT: the QD created the
table and assigned OIDs, the segments never did, and the un-dispatched OIDs
tripped the end-of-xact check in AtEOXact_DispatchOids().

Fix: restore the IntoClause parameter to GetCachedPlan() (header + definition,
dropping the dead NULL local) and pass it from the two CTAS callers,
ExecuteQuery() and ExplainExecuteQuery(); the remaining non-CTAS callers (spi.c,
postgres.c) pass NULL as before.

Verified: plain, parameterized, WITH NO DATA, and EXPLAIN ANALYZE CREATE TABLE
AS EXECUTE all succeed with the rows correctly distributed across segments
(nsegs=3); a plain EXECUTE returning rows is unaffected and the cached plan is
not corrupted.  Together with the prior intorel_startup_dummy fix, CTAS-EXECUTE
works end to end.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
REINDEX CONCURRENTLY corrupts the catalog on a distributed cluster.  Upstream's
concurrent reindex builds a fresh index (a new "_ccnew" pg_class entry, new OID),
swaps it in under the original name, and drops the old one -- i.e. it
intentionally *changes the index OID*.  In GPDB only the coordinator runs the
concurrent path; the segments reindex non-concurrently and keep the original OID.
The result is an index whose OID differs between the QD and the segments
("could not open relation with OID N (segX)" on the next index scan), plus the
transient _ccnew OIDs are never dispatched ("oids were assigned, but not
dispatched to QEs").

Properly supporting concurrent reindex in MPP requires the segments to rebuild
under the QD's new OID, which collides with PreventInTransactionBlock() and the
fact that a QE cannot run a multi-transaction concurrent reindex from within a
dispatched statement -- a larger feature effort (tracked separately).

As a safe interim, fall back to a normal (non-concurrent) reindex on the
coordinator, with a NOTICE, in ReindexIndex()/ReindexTable() (covering REINDEX
TABLE/INDEX, including partitioned, which then reindexes each child
non-concurrently via ReindexPartitions) and ReindexMultipleInternal() (covering
REINDEX DATABASE/SCHEMA).  Single-node utility mode is unaffected and keeps the
working concurrent path.

Verified: REINDEX (CONCURRENTLY) TABLE/INDEX and a partitioned table all succeed
with a NOTICE; the index OID is identical on the coordinator and every segment
(no divergence) and a forced index scan returns correct results; no cores, no
OID-dispatch error.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ordered child columns

A distribution-key UPDATE on a partitioned table whose leaf partitions have a
different physical column order than the parent crashed in the executor:

  pg_detoast_datum <- hashtext/hash_numeric <- cdbhash <- ExecSplitUpdate

create_splitupdate_plan() derived the SplitUpdate's hash attnos and column types
from path->resultRelation's relcache entry.  For a partitioned UPDATE that is a
leaf partition, whose physical column order can differ from the parent's
(e.g. update.sql's part_c_1_100 "e,d,c,b,a").  But the SplitUpdate's input tuples
are the subplan output, which is labeled with root->processed_tlist -- i.e. the
*nominal* (parent) column layout.  So the distribution-key attnos taken from the
leaf policy indexed the wrong column of the parent-layout tuple: a key like "a"
at leaf attno 5 selected the parent's attno-5 column ("d", an int), which cdbhash
then fed to hashtext, dereferencing the small int as a varlena -> SIGSEGV.  (The
mismatch also tripped the type Assert in the insertColIdx loop, which is compiled
out in non-cassert builds.)

Fix: take resultDesc and cdbpolicy from the nominal target relation
(root->parse->resultRelation), which is the layout the subplan tuples are in, so
insertColIdx, hashAttnos and hashFuncs all line up with the SplitUpdate input.
For a non-partitioned UPDATE this is the same relation as before.

Verified with update.sql's reordered-column partitions and assorted
text/numeric/multi-column distribution-key updates (including cross-partition row
movement): the hash now reads the correct key column and there is no crash; the
distribution key routes correctly and results are unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ld columns

Follow-on to the SplitUpdate cdbhash fix.  A distribution-key UPDATE of a
partitioned table whose leaf partitions have a different physical column order
than the parent failed during the reinsert with:

  ERROR: table row type and query-specified row type do not match
         (ExecCheckPlanOutput, nodeModifyTable.c)

The SplitUpdate emits the new tuple in the root (nominal) target relation's
column layout (the subplan is labeled with root->processed_tlist), but the
DML_INSERT replay built the insert projection against the *source leaf*
partition's ResultRelInfo, whose descriptor can differ -- so ExecCheckPlanOutput
rejected the layout.  (Before the cdbhash fix this path crashed earlier and was
never reached.)

Fix: build the split-update insert projection against the root result relation
(mtstate->rootResultRelInfo), so it matches the subplan output, and set up
partition tuple routing for split updates -- so ExecInsert() routes the new
tuple to the correct leaf, converting the layout via ri_RootToPartitionMap, and
enforces the partition constraint.  For a non-partitioned target
rootResultRelInfo is the result relation itself, so this is a no-op there.

Verified: every split-update shape (heap/AO/AOCS, toasted, dropped-column,
multi-column and text/numeric distribution keys, UPDATE ... FROM, and
partitioned) inserts/redistributes correctly with no crash; the update
regression test no longer errors with a row-type mismatch (a partition-key
update that must stay within its subtree now correctly raises a partition
constraint violation -- naming the UPDATE target relation).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ResetTempNamespace() (reached from AbortTransaction -> ResetAllGangs during
primary gang-loss recovery) unconditionally cancels the temp-namespace
before_shmem_exit callback:

    cancel_before_shmem_exit(RemoveTempRelationsCallback, 0);

Before PG14, cancel_before_shmem_exit() silently did nothing when the callback
was not the latest entry.  Upstream commit c9ae5cb made it raise an error in
that case, and the PG14 merge adopted the strict version.  But ResetTempNamespace
can legitimately reach this with the callback absent (temp namespace created but
not yet committed, so AtEOXact_Namespace never registered it) or no longer the
latest entry.  Thrown from inside AbortTransaction(), that error escalates to a
coordinator PANIC -- turning any recoverable gang loss into an all-sessions-down
crash, which then cascades into hundreds of "protocol synchronization was lost"
and "transaction is aborted" failures across the suite.

Add a non-throwing cancel_before_shmem_exit_if_latest() (the pre-PG14 lenient
behavior: remove only if it is the latest entry, else return false) and use it
in ResetTempNamespace().  Leaving the callback registered is harmless --
RemoveTempRelationsCallback() no-ops once myTempNamespace is reset just below.
The strict cancel_before_shmem_exit() is kept for its correct-LIFO callers
(PG_ENSURE_ERROR_CLEANUP).

Verified: a full installcheck-good run that previously dumped a
cancel_before_shmem_exit coordinator-PANIC core now completes the same span with
zero cores.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…limit" (PG14)

A trivial COPY ... FROM stdin (or any COPY FROM) failed on every segment with
"invalid message length" -> "terminating connection because protocol
synchronization was lost", which the QD reported as "MPP detected N segment
failures, system is reconnected".  Because COPY loads the data in most
regression tests, this poisoned a large swath of the suite (231
protocol-synchronization-lost FATALs in a full run).  The same bug independently
broke nextval/sequences over MPP ("nextval: unable to parse nextval response
from QD", ~291 hits).

Root cause: GPDB multiplexes non-query messages over a libpq connection and
reads them with pq_getmessage(buf, 0), using a maxlen of 0 to mean "no upper
limit" -- COPY data forwarded QD->QE (copy.c) and the nextval-over-NOTIFY
response (sequence.c).  GPDB's pq_getmessage encoded that as

    if (len < 4 || (maxlen > 0 && len > maxlen))

The PG14 merge adopted upstream's length check verbatim

    if (len < 4 || len > maxlen)

dropping the "maxlen > 0 &&" guard.  With maxlen == 0 every message with len > 0
is rejected as "invalid message length", so the QE tears down the connection on
the first CopyData byte.

Diagnosis: wire-probes confirmed the QD frames CopyData correctly (type 'd',
len = 4 + nbytes, outBuffer_shared = 0) and the QE reads the correct len but
with maxlen = 0 -- so the comparison, not the data, was wrong.

Fix: restore the GPDB "maxlen > 0 &&" guard and document the maxlen == 0
convention so a future merge does not drop it again.

Verified on a live cluster: COPY FROM stdin, COPY FROM file, nextval(), and
SERIAL inserts (segment nextval) all succeed with correct row counts; no
"invalid message length" / protocol-synchronization-lost.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
COPY FROM into any partitioned (or sub-partitioned) table crashed every segment
with a NULL-pointer SIGSEGV in ExecInitPartitionInfo (reached via
CopyFrom -> ExecFindPartition), reported on the QD as "MPP detected N segment
failures, system is reconnected".  This surfaced broadly once COPY data delivery
was fixed (see the pq_getmessage maxlen fix), because pg_regress loads much of
its data via COPY into partitioned tables.

Root cause: GPDB's COPY implementation lives in commands/copy.c (the upstream
commands/copyfrom.c is a stub here).  copy.c builds its ResultRelInfo with
InitResultRelInfo() directly -- it never calls ExecInitResultRelation(), so
estate->es_result_relations stays NULL -- yet it initialized the
ModifyTableState with

    mtstate->resultRelInfo = estate->es_result_relations;   /* NULL */

and never set mt_nrels or rootResultRelInfo.  The PG14 tuple-routing rework made
ExecInitPartitionInfo() dereference mtstate->resultRelInfo[0]
(.ri_RangeTableIndex / .ri_RelationDesc) and mtstate->rootResultRelInfo, so the
NULL resultRelInfo crashed at the first partition routed to.  (copyfrom.c's
CopyFrom sets these three fields correctly; copy.c was not updated in the merge.)

Fix: set mt_nrels = 1 and point both resultRelInfo and rootResultRelInfo at the
COPY target's ResultRelInfo (a valid one-element array), mirroring copyfrom.c.

Verified: COPY FROM stdin into range-partitioned and range-sub-partitioned
tables inserts the correct row counts with no crash.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… "see"'

~46 internal SQL functions (col_description, obj_description, shobj_description,
and many overloaded date/time and system helpers) carry the placeholder body
prosrc => 'see system_functions.sql' in pg_proc.dat; their real bodies live in
src/backend/catalog/system_functions.sql, which initdb is supposed to run after
bootstrap.  The PG merge dropped system_functions.sql from BOTH initdb.c and the
catalog Makefile's install list, so:

  - the file was never installed to $(datadir), and
  - initdb never executed it.

The placeholder bodies therefore survived, and every call to one of these
functions tried to execute the literal text "see system_functions.sql", failing
with:  ERROR: syntax error at or near "see".  Because col_description() is
invoked by psql's \d+, this poisoned a huge fraction of the regression suite
(439 "see" syntax errors plus the bulk of the downstream "current transaction is
aborted" / "relation does not exist" cascade).

Fix:
  - initdb.c: declare system_functions_file, set_input/check_input it, add a
    generic setup_run_file() helper, and run system_functions.sql right after
    setup_auth() and before setup_depend() (so the functions are pinned),
    matching upstream ordering.
  - catalog/Makefile: install (and uninstall) system_functions.sql alongside
    system_views.sql.

Verified: a fresh initdb succeeds; col_description()/obj_description() get real
bodies (PG14 stores them in prosqlbody) and psql \d+ shows column descriptions
with no syntax error.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CREATE VIEW (and any utility statement that dispatches a Query tree, e.g.
CREATE TABLE AS) over a join failed on every segment with:

  ERROR: could not deserialize unrecognized node type: 3  (readfast.c)

PG14 (commit 055fee7) added the JoinExpr.join_using_alias field.  The
merge added it to the text writer (_outJoinExpr in outfuncs.c) and the reader
(_readJoinExpr in readfuncs.c, used for binary too), but NOT to the binary
writer _outJoinExpr in outfast.c, which is the one used for QD->QE dispatch.
So the wire layout written for a JoinExpr was one NODE field short of what the
reader expected; the reader went off the rails and hit a garbage tag (3 =
T_ProjectionInfo, a run-time node that is never serialized) -> deserialize
error on the segment.

Fix: write join_using_alias between usingClause and quals in outfast.c's
_outJoinExpr, matching the reader and the text writer.

Class of bug: a PG14-added node field re-grafted into the shared text/read
funcs but missed in the separate binary writer (outfast.c).  Found by probing
readNodeBinary() to dump the QE-side tag stream, then diffing each outfast.c
binary writer against its reader.

Verified: CREATE VIEW over CROSS/INNER/NATURAL joins and selecting from the
views all succeed, no segment crash.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…G14)

PG14 added Result Cache (later renamed Memoize), gated by enable_resultcache
(default on).  GPDB has not integrated it with the MPP planner/executor: a
generated T_ResultCache plan node is not handled by expression_tree_mutator()
(nodeFuncs.c) and is also absent from the binary plan-dispatch serialization
(outfast.c/readfast.c).  So whenever the planner chose a Result Cache -- e.g.
the information_schema.columns is_updatable computation under
enable_mergejoin/enable_nestloop -- the query failed with:

  ERROR: unrecognized node type: 53   (53 = T_ResultCache)

This hit a broad set of join-heavy tests (updatable_views, returning, rowtypes,
join, indexjoin, subselect, partition_prune, gin, ...).

Until Result Cache is properly supported in MPP, default enable_resultcache to
off so the planner never generates the node.  GPDB's expected outputs never
show a Result Cache node, so this matches them; the standalone resultcache
regression test is not in any schedule, and aggregates.sql already sets it off
explicitly for its one relevant query.

Verified: with the new default, the minimal repro (CREATE VIEW + information_
schema is_updatable under merge/nestloop) and the updatable_views / returning /
rowtypes regression tests no longer raise "unrecognized node type: 53".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…heir C helpers (PG14)

create_function_0 (input/create_function_0.source) defines the C helper
functions binary_coercible(oid,oid), test_enc_conversion(...) and
test_opclass_options_func(internal) from regress.so, which type_sanity,
opr_sanity and conversion depend on.  The merge left create_function_0 out of
parallel_schedule entirely (only create_function_1/2/3 are listed), so those
helpers were never created and the dependent tests failed with
"function binary_coercible(oid, oid) does not exist" /
"function test_enc_conversion(...) does not exist" (~40 hits, plus the
downstream cascade in opr_sanity/type_sanity).

Add create_function_0 to the schedule right after test_setup (before any test
that uses the helpers).  It has its own expected output (create_function_0.out).

Verified (focused schedule test_setup -> create_function_0 -> opr_sanity
conversion): create_function_0 and conversion now pass and the
"function ... does not exist" errors are gone (opr_sanity still differs for
unrelated reasons).  NB create_function_0 also creates trigger functions from
contrib/spi's refint.so/autoinc.so, which must be installed (make -C
contrib/spi install) -- a build/install step, not part of this source change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A correlated subquery in the SELECT targetlist over a distributed table
ran without its correlation filter, returning wrong results:

  SELECT a, (SELECT count(*) FROM t2 WHERE t2.b = t1.a) FROM t1;

returned the *total* count for every row instead of the per-row
correlated count, and a non-aggregate correlated subquery raised
"more than one row returned by a subquery used as an expression".

GPDB routes a correlated subquery's param filter (e.g. t2.b = $0) to the
rel's upperrestrictinfo; bring_to_outer_query() then applies it via a
Result node above the Broadcast Motion, so each segment filters the
broadcast set by its local param. That Result is built by
create_projection_path_with_quals() with the filter in
cdb_restrict_clauses. Two code paths silently dropped the filter:

1. When the subpath is projection-capable (a Motion is), the function
   took the no-Result shortcut (dummypp = true) and never stored
   cdb_restrict_clauses, discarding the filter. Don't take the shortcut
   when there are restrict clauses to apply.

2. When a later projection (the scan/join target) was layered over the
   filter-carrying ProjectionPath, the path-collapsing code stripped the
   inner ProjectionPath and discarded its cdb_restrict_clauses. Carry
   them up into the surviving ProjectionPath instead.

With both fixed, create_projection_plan() emits a Result with
plan->qual = the param filter, and correlated targetlist subqueries
return correct results.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…G14)

A correlated EXISTS in the SELECT targetlist over a distributed table
failed with:

  ERROR: subplan is missing Flow information

  SELECT a, EXISTS (SELECT 1 FROM t2 WHERE t2.b = t1.a) FROM t1;

For a simple EXISTS, make_subplan() additionally builds a hashed ANY
variant and wraps both in an AlternativeSubPlan, leaving setrefs.c to
choose. GPDB's MPP slice machinery does not support AlternativeSubPlan:
the hashed plan is created without Flow/slice information (so cdbllize
raises "subplan is missing Flow information"), and cdbllize cannot reason
about an AlternativeSubPlan when pruning unused subplans.

In dispatch (MPP) mode, skip building the hashed alternative and keep just
the correlated SubPlan we already built. It is correct on its own -- its
correlation filter is applied above the Motion (see the companion fix in
create_projection_path_with_quals) -- so EXISTS in the targetlist now
returns correct results. The hashed alternative is still considered for
non-MPP (utility-mode) planning.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PG14 reworked UPDATE so ModifyTable carries updateColnosLists -- one
update_colnos list per result relation, mapping each non-junk column
produced by the subplan to its target-table attribute number. The
executor's ExecInitUpdateProjection() does
list_nth(node->updateColnosLists, whichrel) unconditionally for
CMD_UPDATE.

The Postgres planner sets this in grouping_planner(), but ORCA's
DXL->Plan translator (TranslateDXLDml) never did, so every UPDATE
planned by ORCA -- the default optimizer -- produced a ModifyTable with
updateColnosLists == NIL. On the segment, list_nth(NIL, 0) dereferences
NULL -> SIGSEGV, killing the QE for any distributed UPDATE.

ORCA emits a full new tuple: the subplan's non-junk entries are the
table columns in physical order (the Result node coerces to the exact
physical table layout, including dropped columns as NULLs), so the
mapping is simply each non-junk entry's resno. Build that list and
attach it as the single per-result-relation entry.

Verified under ORCA: simple distributed UPDATE, split UPDATE (modifying
the distribution key), and append-optimized UPDATE all succeed with
correct results; DELETE is unaffected. Partitioned ORCA UPDATE now gets
past this crash and surfaces a separate, pre-existing ORCA
partition-routing issue (tracked independently).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…o (PG14)

PG14 moved aggregate deduplication into the planner: preprocess_aggrefs()
assigns each Aggref an aggno (index into the executor's per-agg result
array) and an aggtransno (index into the per-transition-state array),
and ExecInitAgg() sizes those arrays from the maxima and reads each
aggregate's value from aggvalues[aggref->aggno].

ORCA plans never pass through preprocess_aggrefs(), and the DXL->Plan
translator left both fields at their MakeNode() default of 0.  Every
aggregate in an Agg node therefore shared per-agg slot 0 and transition
state 0: all aggregates in a query returned the first aggregate's
result.  E.g. under optimizer=on,

    SELECT min(q1), min(q2), max(q1), count(*) FROM agg_t

returned min(q1) four times.  This silently corrupted any ORCA query
with more than one aggregate (and count(DISTINCT) alongside count()),
breaking dozens of regress tests via wrong results rather than errors.

Fix: in TranslateDXLAgg(), after the targetlist and qual are final,
collect all Aggrefs (including ones nested in expressions and SubPlan
args) and number them densely: aggno = aggtransno = 0..N-1.  Dense
numbering matters because finalize_aggregates() walks every slot up to
the maximum.  An instance referenced more than once keeps its number;
upstream's shared-transition-state optimization is not attempted.

Verified under ORCA: multi-agg min/max/count, count(DISTINCT) beside
count(), aggregates in HAVING quals, and aggregates nested in arithmetic
all return correct results; distributed UPDATE still works.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…type (PG14)

PG14 made exprType() of a SubscriptingRef read the new refrestype field
(commit c7aba7c14e5) instead of deriving the result type from
refelemtype/refcontainertype.  ORCA's DXL->Scalar translator
(TranslateDXLScalarArrayRefToScalar) still filled only refcontainertype,
refelemtype, refcollid and reftypmod, leaving refrestype at 0, so any
expression above a subscript under ORCA failed with

    ERROR:  cache lookup failed for type 0

This broke every query using subscripting (arrays, point[i],
positions[i] over unnest(tsvector), ...) with the GPORCA optimizer
enabled: point, tstypes, arrays, geometry, insert, updatable_views,
domain, tuplesort and more regress tests.

The producer side (TranslateArrayRefToDXL) already computes the result
type with upstream semantics -- element type for a single-element fetch,
container type for slices and assignments -- and stores it in the DXL
operator; the consumer just never read it back.  Set refrestype from
the DXL operator's ReturnTypeMDid.

Verified under ORCA: point[0] fetch, positions[1] over unnest(tsvector),
int[] slice v[2:3], and subscripted SET v[2]=x all work and return
correct types/values.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…G14)

ORCA's DML plan for a partitioned target goes through a dynamic scan
against the partition root, with ModifyTable.resultRelations naming only
the root; finding the leaf a tuple belongs to relied on
ModifyTable.forceTupleRouting, whose executor consumer was removed
during the PG14 nodeModifyTable rework (b04e559) -- PG14 routes
inherited updates via per-leaf result relations and "tableoid" junk
columns instead, which ORCA's plans don't provide.  An in-place
UPDATE/DELETE therefore tried to modify the storage-less partition root:

    ERROR:  could not open file "pg_tblspc/0/GPDB_8_.../0/0"

(catcache, qp_dropped_cols and the wider could-not-open-file failure
cluster; split updates that modify the distribution key survived only
because their INSERT half goes through the partitioned-INSERT routing.)

Restore the pre-#14129 guards in TranslateUpdateQueryToDXL and
TranslateDeleteQueryToDXL so ORCA raises ExmiQuery2DXLUnsupportedFeature
and falls back to the Postgres planner, which handles partitioned
UPDATE/DELETE correctly on PG14.  INSERT stays with ORCA.  Revisit by
porting per-tuple leaf routing onto the PG14 executor model.

Verified under optimizer=on: partitioned non-key UPDATE, partition-key
(cross-partition) UPDATE, partitioned DELETE all correct; UPDATE routing
a NULL partition key reports the proper "no partition found" error
instead of the storage error.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… dispatched" (PG14)

Since PG13 (commit 5028981), CREATE TABLE (LIKE ... INCLUDING INDEXES)
defers index creation: transformCreateStmt leaves a TableLikeClause in
the statement list, and ProcessUtilitySlow expands it later via
expandTableLikeClause() into IndexStmts marked transformed=true.  The
upstream T_IndexStmt path maps transformed -> is_alter_table=true
("treat it like ALTER TABLE ADD INDEX"), and GPDB's DefineIndex()
suppresses its QE dispatch when is_alter_table on the assumption that an
enclosing ALTER TABLE is dispatched as a whole.

For the LIKE path there is no enclosing command: the CreateStmt was
already dispatched with its own oids, and the cloned IndexStmt was
executed only on the QD.  The index oids preassigned there were never
sent ("ERROR: oids were assigned, but not dispatched to QEs") and the
index was missing on the segments.  This broke CREATE TABLE LIKE
INCLUDING INDEXES/ALL across alter_table, partition1, partition_storage,
index_constraint_naming*, and bfv_index.

Dispatch the transformed IndexStmt explicitly from ProcessUtilitySlow's
T_IndexStmt case, mirroring DefineIndex's own dispatch (same flags, name
pinned via stmt->idxname, oldNode cleared, preassigned oids attached).
The QE re-executes it with is_alter_table=true and consumes the
dispatched oids.

Verified: CREATE TABLE (LIKE src INCLUDING ALL) succeeds, the pkey index
exists on the QD and on every segment, and the unique constraint is
enforced segment-side (duplicate key rejected).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…R_READY (PG14)

PG14 added CAC_NOTCONSISTENT (pmState == PM_RECOVERY) to reject
connections to a hot-standby that has not reached consistency.  The
merge placed that branch in canAcceptConnections() above GPDB's
GetMirrorReadyFlag() -> CAC_MIRROR_READY check.  A GPDB mirror runs with
hot_standby off and stays in PM_RECOVERY for its whole life, so the
mirror-ready branch became dead code: every connection to a mirror was
answered with "the database system is not accepting connections /
Hot standby mode is disabled."

CAC_NOTCONSISTENT has no FTS exemption in ProcessStartupPacket (only
CAC_STARTUP and CAC_MIRROR_READY do), so the FTS probe process could not
connect to any mirror at all: probes ended in "FTS double fault
detected", promotion requests never reached the mirror (catalog flipped
to role=p while the segment kept running as a standby -- standby.signal
in place, walreceiver streaming), and gprecoverseg failed because it
could not read the version string from the CAC_MIRROR_READY error.
Every mirror failover wedged the cluster unrecoverably, and the FTS
regress tests (fts_error, fts_recovery_in_progress, ...) hung.

Return CAC_MIRROR_READY before the CAC_NOTCONSISTENT branch when the
walreceiver has been launched and hot_standby is off (a GPDB mirror).
Genuine hot-standby servers that merely have not reached consistency
keep the upstream fail-fast CAC_NOTCONSISTENT behavior.

Verified end-to-end: direct connection to a mirror reports the
version-bearing mirror-ready error; killing a primary now leads to FTS
truly promoting the mirror (standby.signal removed, segment accepts
queries, QD queries work across the failover); gprecoverseg -a and -ar
restore and rebalance the cluster.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
gprecoverseg incremental recovery runs

    pg_rewind --write-recovery-conf --slot="internal_wal_replication_slot" ...

The PG14 merge took upstream pg_rewind wholesale (46f49ad), dropping
the GPDB-added --slot option (eacc688), so every incremental
recovery failed with "unrecognized option '--slot=...'" and left the
downed segment unrecovered.

Re-add -S/--slot on top of the PG14 implementation: upstream's
GenerateRecoveryConfig() (shared with pg_basebackup) already takes a
replication-slot argument and emits primary_slot_name; pass the option
through at both -R call sites and reject --slot without
--write-recovery-conf, as before.

Verified: gprecoverseg -a incremental recovery succeeds ("Segments
successfully recovered", mirror back in sync) and gprecoverseg -ar
rebalances to preferred roles.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…(PG14)

GPDB append-optimized tables cannot fetch the old tuple by TID, so an
UPDATE plan over them must emit the full new tuple:
preprocess_targetlist() expands the targetlist when the target relation
is AO, and ExecModifyTable leaves the old slot empty for AO result
relations on the strength of that contract.

With PG14 native partitioning the target of a partitioned UPDATE is the
storage-less root, for which RelationIsAppendOptimized() is always
false, so the expansion never happened when only the leaves are AO.
The per-leaf update projection then referenced old-tuple columns the AO
leaf could not provide, failing with

    ERROR:  getsomeattrs is not required to be called on a virtual tuple table slot

across alter_table_aocs*, expand_table_ao*, alter_ao_part_tables*,
alter_ao_part_exch* (10 regress tests).

Add rel_has_appendoptimized_partition(): for a partitioned target, scan
its inheritors and force the expansion when any of them uses an AO
access method.  Split updates already expanded; pure-heap partitioned
updates keep the upstream narrow targetlist.

Verified: AO-row and AOCS partitioned UPDATEs return correct results;
heap partitioned UPDATE unaffected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replaying a 2PC DROP TABLESPACE (COMMIT_PREPARED with GPDB's
tablespace_oid_to_delete_on_commit) on a mirror could die with

    FATAL:  could not open directory "<location>/<dbid>": No such file or directory
    CONTEXT: WAL redo at ... for Transaction/COMMIT_PREPARED ...

destroy_tablespace_directories() downgrades its own errors to LOG under
redo, but the directory_is_empty() check on the symlink target uses
ReadDir at ERROR, which the startup process escalates to FATAL -- so a
vanished/unreadable target directory took the whole mirror down over
disk space we merely failed to release.  FTS then marked the mirror
down and pg_regress aborted the suite (temp_tablespaces /
alter_db_set_tablespace window).

Add directory_is_empty_ext() with a caller-chosen elevel and use it in
the redo path (LOG); an unreadable directory counts as empty and the
subsequent rmdir's LOG reports the leftover.

Verified: create tablespace -> create/insert/drop table -> drop
tablespace replays cleanly; all mirrors stay up and in sync.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…g group (PG14)

A merge artifact in ExplainOnePlan left a dangling
"if (es->summary && (planduration || bufusage))" glued onto the
query-identifier condition, plus a GPDB6-leftover second buffer-usage
block ending in an ExplainCloseGroup("Planning") with no matching open.
Whenever ANALYZE ran without BUFFERS, that stray close popped the
"Query" group early: every key after "Planning Time" (Triggers, Slice
statistics, Execution Time) was emitted outside the object, producing
structurally invalid JSON ("Expected , or ] but found :").  Text format
hid it because group closes are no-ops there; explain, explain_format,
gin and join_hash failed on it.

Restore the upstream PG14 shape (queryId, then the Planning group
wrapping only planning buffer usage, then Planning Time), keeping
GPDB's slice-table print, and drop the duplicate buffer block.

Verified: EXPLAIN (FORMAT JSON, ANALYZE) and (FORMAT JSON, ANALYZE,
BUFFERS) both parse with json.loads.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The expanded append-optimized UPDATE targetlist kept NULL placeholders
for dropped columns (resno == attno), but PG14's
ExecBuildUpdateProjection() pairs every non-junk subplan column with an
update_colnos target and rejects dropped target columns:

    ERROR:  table row type and query-specified row type do not match
    DETAIL:  Query provides a value for a dropped column at ordinal position N.

broke every AO/AOCS UPDATE on a table with a dropped column
(alter_table_gp, drop_column_update, alter_table_analyze,
alter_ao_table_col_ddl_*, uao_allalter_*).

For a plain (non-split) AO update, strip the dropped-column
placeholders after expansion and renumber the resnos; the executor sets
dropped columns of the new tuple to NULL itself and, with every live
column assigned, never reads the old-tuple slot that AO cannot
populate.  A Split Update keeps the full physical row: it runs as
delete+insert and never builds the update projection.

Also includes rel_has_appendoptimized_partition() interplay: partitioned
AO targets take the same path.

Verified: AO and AOCS dropped-column UPDATEs return correct results.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replaying Database/CREATE (ALTER DATABASE SET TABLESPACE, movedb) on a
mirror copies a live database directory while the checkpointer can
unlink files of dropped relations at restartpoints.  copy_file()/lstat
then died with

    FATAL:  could not open file ".../<relfilenode>_fsm": No such file or directory
    CONTEXT: WAL redo ... Database/CREATE: copy dir ...

killing the startup process and downing the mirror (alter_db_set_tablespace
aborted the whole regress suite this way).  The primary's copy simply
never saw those files.

Skip ENOENT sources with a LOG during recovery (InRecovery) in both the
directory scan and the file copy; normal execution still errors.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
After an incremental (pg_rewind) recovery a mirror replays from before
a CREATE TABLESPACE whose location directory has since been removed
from disk (regression tests drop the tablespace and clean up the
directory).  create_tablespace_directories() then FATALed the startup
process with "directory does not exist", leaving the mirror permanently
unrecoverable short of a full rebuild.

During recovery, create the missing location with pg_mkdir_p() and
press on -- the same philosophy TablespaceCreateDbspace() documents for
replaying into dropped tablespaces.  Normal execution still errors.

Also includes the directory_is_empty_ext() redo hardening in the drop
path from the previous commit series.

Verified: a mirror rewound to before CREATE TABLESPACE replays through
create/use/drop of tablespaces and returns to sync; "creating missing
directory ... during replay" appears in its log.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two problems in TranslateDXLDml on tables with dropped columns:

1. A plain (non-split) UPDATE padded the subplan target list with NULL
   placeholders for dropped columns and listed their attnos in
   updateColnosLists; PG14's ExecBuildUpdateProjection() rejects
   assignments to dropped columns ("Query provides a value for a
   dropped column").  Skip the padding for non-split updates and build
   updateColnosLists from the live columns' attribute numbers; the
   executor nulls dropped columns of the new tuple itself.

2. A Split Update (delete+insert, distribution key change) silently
   CORRUPTED rows: the insert half wrote misaligned values (a SET a=...
   update lost the other columns' values).  Until the PG14 insert path
   understands ORCA's padded rows, raise unsupported and fall back to
   the Postgres planner, which handles it correctly.

Verified under optimizer=on: AO and heap dropped-column UPDATEs return
correct results; a distribution-key UPDATE on a dropped-column table
falls back and preserves all column values.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A mis-merged brace in heap_create_with_catalog() attached the outer
"relkind has no rowtype" else-branch to the inner GPDB
"skip array type for AO relations" if-statement.  For every AO/AOCS
relation the composite type was created (pg_type row present, typrelid
correct) and then new_type_oid was reset to InvalidOid, so the pg_class
tuple was written with reltype = 0.

Fallout: RenameRelationInternal() skips RenameTypeInternal() when
reltype is invalid, so renaming an AO table left its rowtype under the
old name.  ALTER TABLE EXCHANGE PARTITION decomposes into a three-way
rename and collided with the stale type ("type <partition> already
exists"), breaking 15 regress tests (partition, partition1,
distributed_transactions, alter_table_ao*, alter_ao_part_*,
column_compression, oid_consistency, portals_updatable); anything
consulting an AO relation's rowtype (whole-row Vars, "relation does not
have a composite type") misbehaved too.

Move the else to the outer if, where upstream has it.  Newly created
clusters/tables get correct catalogs; existing AO tables keep reltype=0
until recreated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nodes (PG14)

PG14's CREATE FUNCTION/PROCEDURE ... BEGIN ATOMIC / RETURN stores the
body in CreateFunctionStmt.sql_body.  The field was copied and compared
(copyfuncs/equalfuncs) but never serialized, so the QD dispatched the
statement without a body and QEs failed with "no function body
specified" (create_procedure, create_function_3).

Dispatching the raw body surfaced further binary-serialization gaps:

- ReturnStmt and RawStmt had no readers at all; add _readReturnStmt /
  _readRawStmt and wire both node types into the outfast/readfast
  switches.
- _readSelectStmt did not read the PG14 groupDistinct bool that
  _outSelectStmt writes, desynchronizing the stream one byte
  ("could not deserialize unrecognized node type: <garbage>").
- ParamRef had a writer but no reader, breaking RETURN $1-style bodies.

Verified: BEGIN ATOMIC procedure inserts through dispatch, RETURN $1*2
function evaluates on segments, GROUP BY DISTINCT round-trips.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
dimoffon and others added 30 commits June 20, 2026 22:05
…n struct)

The PG15 merge backend was built --disable-orca, so the ORCA translator
(src/backend/gpopt) had never been compiled against the PG15-merged node
structures. Enabling ORCA surfaces two PG15 node-API changes:

1. Value node removed (split into Integer/Float/String/Boolean/BitString;
   T_Null gone). gpdb::MakeStringValue/MakeIntegerValue now return Node*
   (makeString->String*, makeInteger->Integer*); the column-name list
   callers take Node*/String* (they feed LAppend or strVal()); the CTAS
   storage-option sentinel uses T_Invalid instead of T_Null (only stored,
   read solely under !is_null, so the value is immaterial).

2. SeqScan is now its own struct embedding Scan (was a typedef of Scan).
   seq_scan->scanrelid/plan become ->scan.scanrelid/->scan.plan; the
   GGDB DynamicSeqScan (embeds SeqScan) becomes ->seqscan.scan.{scanrelid,plan}.

Also fixed the stale optimizer/plan/objfiles.txt that omitted orca.o (built
only when enable_orca=yes) -> undefined reference to optimize_query at link;
regenerated. ORCA now builds, links, and produces GPORCA plans on PG15
(verified: a user-table group-by explains as "Pivotal Optimizer (GPORCA)").

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
With ORCA now building/running on PG15, the optimizer=on regress matrix
failed 68/202 (full parallel) — but ~25 were segment memory-pressure
cascades (trivial queries OOMing under parallel ORCA load); the true set is
43 (MAX_CONNECTIONS=4). Triage (success->error gate + comparing ORCA results
to the green planner base and the prior _optimizer.out):

- 36 are stale _optimizer.out predating the PG15 test-suite reorg (the base
  .out files were regenerated for PG15 but the _optimizer.out were not, e.g.
  create_index missing \getenv abs_srcdir / the inline slow_emp4000 block).
- The handful with extra ORCA errors are longstanding ORCA/planner divergences,
  not regressions or wrong results: "cannot display a value of type
  anycompatible" and "could not identify a hash function for type money" were
  already in the prior _optimizer.out; "UNIQUE and DISTRIBUTED RANDOMLY are
  incompatible" comes from ORCA defaulting a CTAS-without-DISTRIBUTED-BY to a
  NULL/random policy where the planner picks the first column (the test that
  hits it, with's `m` table, is new in PG15 and uses no DISTRIBUTED BY, so
  there is no PG14-ORCA baseline showing otherwise) — all clean errors, no
  crashes, no wrong data rows.

Regenerated and verified green (4-of-202 in a MAX_CONNECTIONS=6 re-run, the 4
being flaky/pressure: portals = fork-memory pressure; with/select_parallel/gist
= row-order / EXPLAIN-ANALYZE per-segment actual-rows flutter, same class as
the planner flaky tail). Those 4 are left for separate stabilization.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tion + cost-tie)

Running the optimizer=on matrix many times (for the ORCA bring-up) exposed a
flaky tail whose common root cause is that ORCA gives a table created without
an explicit DISTRIBUTED BY a NULL/random policy (the Postgres planner instead
picks the first column).  Randomly-distributed data has non-deterministic
per-segment placement, which destabilizes per-segment EXPLAIN ANALYZE counts,
unordered SELECT row order, and LIMIT-over-ties.  Fixed at the source so both
matrices are deterministic:

- test_setup: CREATE TABLE tenk2 ... DISTRIBUTED BY (unique1) (matches tenk1;
  was random under ORCA -> select_parallel tenk2 actual-rows flutter, and a
  redistribute motion that vanishes once co-located).
- with: ORDER BY 1 on the SELECT * FROM y / yy result-checks (y is DISTRIBUTED
  RANDOMLY; atmsort does not sort these GP_IGNORE blocks).
- gist: add a point-coordinate tiebreaker to "order by circle(p,1) <-> point(0,0)
  limit 1" — point(0,0) is inside many points' unit circles so the lossy
  distance ties at 0, and gist_tbl has no hash-distributable column.
- select_parallel: the merge-join's tenk1 input is a cost-tie between an Index
  Only Scan and a Seq Scan + Sort; pin enable_seqscan=off (scoped) for that one
  query so the plan is stable (same approach as the earlier planner flaky tail).

Verified stable across 5 ORCA runs (optimizer=on) and 2 planner runs
(optimizer=off), All 202 passed each.  Safety-gated: no test turned a success
into an error (gist's lossy-distance error becomes a deterministic result).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…drift

These three carry EXPLAIN ANALYZE plans whose volatile annotations differ from
the committed answer files on the rebuilt (--enable-orca) cluster: the GGDB
per-operator "Executor Memory: NkB" line appears only when a Sort's branch is
actually executed, and runtime partition pruning shows "(never executed)" vs
"(actual rows=N)" depending on which partitions are probed.  No data rows or
error behavior change (success->error gate clean); the EXPLAIN VERBOSE
"Settings:" line that now lists optimizer='off' is already ignored by init_file
(m/^ Settings:.*/).

Regenerated against the clean post-fix gpdemo and verified stable across the
planner runs (P/Q/M/N) and ORCA runs (H/I/J/S/T), All 202 passed.  Note: these
runtime-pruning/memory annotations are sensitive to the cluster instance; if
they flutter in CI they should be hardened with GP_IGNORE rather than re-pinned.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Our PG15 base (15beta2, merge target adadae4) carries a recovery WAL
prefetcher bug: lrq_complete_lsn()'s readahead can advance the XLogReader past
the record just returned by XLogNextRecord(), tripping
Assert(record == prefetcher->reader->record) at xlogprefetcher.c:1061 and
aborting the startup process (signal 6) during WAL replay.  In an MPP cluster
this crashes mirror segments as they replay WAL under load, degrading the
cluster (cascading "could not connect to the primary"/FTS-down).  It is
cassert-only but reliably reproducible during the optimizer=on regress matrix.

xlogprefetcher.c is otherwise upstream (GPDB only changed smgropen to the
3-arg SMGR_MD form), so this is an upstream beta2 bug fixed later on
REL_15_STABLE.  Until that fix is backported, default this recovery-only
prefetch optimization to off; re-enable to RECOVERY_PREFETCH_TRY afterwards.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… to PG15

Two fixes to bring up isolation2 on the PG15 merge:

1. pg_regress.c: the PG15 merge dropped the convert_sourcefiles() call from
   initialize_environment() (PG14 had it right before load_resultmap()).  That
   function generates sql/*.sql and expected/*.out from the input/ and output/
   *.source templates at run start; without it, any .source-based test fails
   with "cannot open .../sql/<name>.sql: No such file".  The main regress only
   survived because its generated files were left over from the initial build;
   isolation2 (run for the first time here) had none, so it aborted at the
   first .source test (autovacuum-analyze).  Restore the call.

2. workfile_mgr_test.c (isolation2 harness): port logicaltape_test() to the
   PG15 LogicalTape-as-object API — LogicalTapeSetCreate is now 3-arg, tapes
   are created with LogicalTapeCreate(set), and Tell/Write/Freeze/Seek/Read
   take a LogicalTape* instead of (set, tapenum).  The file never compiled
   before because isolation2 was built --disable-orca-era and not exercised.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…5 merge)

The PG15 merge adopted the callers' half of an upstream change but not the
matching relaxation in recordDependencyOnCurrentExtension():

- PG15 made makeOperatorDependencies() and GenerateTypeDependencies() pass
  isReplace=true unconditionally when recording extension membership (the
  merge took these correctly in pg_operator.c / pg_type.c).
- PG15 *also* removed the "free-standing object, so reject" branch from
  recordDependencyOnCurrentExtension(): with isReplace=true, an object that
  is not yet anyone's member is now absorbed into the current extension
  rather than rejected.  The merge kept GPDB's PG14 version, which still
  rejects.

The broken combination breaks CREATE EXTENSION for any extension that
creates a shell operator (a CREATE OPERATOR referencing a not-yet-existing
COMMUTATOR/NEGATOR makes a shell, recorded with isReplace=true) or replaces
a shell type: e.g. citext fails with "(null) is not a member of extension
\"citext\" / An extension is not allowed to replace an object that it does
not own" (the (null) is a shell operator, whose getObjectDescription is
empty).  The function's own header comment already documents the PG15
behavior ("the object will be absorbed into the extension ... desirable for
cases such as replacing a shell type") — the code just didn't match it.

Adopt upstream PG15's recordDependencyOnCurrentExtension.  Verified:
CREATE EXTENSION citext succeeds and contrib/citext installcheck passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PG15 forbids RequestAddinShmemSpace() outside the new shmem_request_hook;
orafce called it directly in _PG_init(), so CREATE EXTENSION orafce (or first
use) crashed every backend with FATAL "cannot request additional shared
memory outside shmem_request_hook" (ipci.c), cascading the whole orafce test
suite (12 of 13 failed).

Move the request into an orafce_shmem_request() assigned to shmem_request_hook
(chaining any previous hook), guarded by PG_VERSION_NUM >= 150000 so the
pre-15 path is unchanged.  Verified: the installed orafce.so now references
shmem_request_hook and the orafce test runs without the shmem crash.

(Separate orafce follow-ups remain: add_months type-resolution answer-file
drift, and the nlssort test.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The PG15 merge moved ReadCheckpointRecord() from xlog.c into the new
xlogrecovery.c (the xlog→xlogrecovery split) but dropped GGDB's call to
XLogProcessCheckpointRecord() and the helper's definition along the way;
only the orphaned forward declaration survived in xlog.c.

GGDB writes an *extended* checkpoint record that appends the in-doubt
distributed-committed transactions (getDtxCheckPointInfo) after the
CheckPoint struct.  Recovery must extract that DTX payload from the
checkpoint it starts from and feed it to redoDtxCheckPoint() so the
second phase of 2PC can complete.  Without it, a distributed transaction
that has committed on the coordinator is treated as an orphaned prepared
transaction and aborted on the segments after crash recovery — the table
then exists only on the coordinator ("could not open relation with OID
N" when a later query is dispatched to the segments).

Re-graft the helper into xlogrecovery.c next to ReadCheckpointRecord and
call it from the report==true path (the starting-checkpoint read), matching
the pre-merge GGDB behavior; remove the now-dead declaration from xlog.c.
The control file is updated to point at the new checkpoint before the
keep_log_seg panic fires, so crash recovery starts from the checkpoint
carrying the DTX info and this single call site is sufficient.

Fixes isolation2 test checkpoint_dtx_info (issue #12977 coverage).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… drift

The PG15 test version sets the imported snapshot via
  psql -Atc "BEGIN TRANSACTION ...; SET TRANSACTION SNAPSHOT '...';"
Since the PG14 psql change, "psql -c" with multiple semicolon-separated
commands prints each command's status tag, so the output is now
"BEGIN\nSET" rather than just "SET".  Deterministic, functionally correct
(the snapshot is still imported); add the BEGIN tag line to the expected
output.  The run-varying snapshot token is masked by the existing
substitution rule and left as-is.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…merge)

The PG15 merge adopted upstream's cluster_rel() which asserts the target
relkind is one of RELATION / MATVIEW / TOASTVALUE.  GPDB's pre-merge
cluster_rel() had no such assert.

VACUUM FULL on an append-optimized table recurses into its heap-backed
auxiliary relations — aoseg, block directory and visimap (vacuum.c
vacuum_rel) — with VACOPT_FULL still set.  Each aux relation has
is_appendoptimized == false (they are plain heaps), so it takes the
"VACUUM FULL is a variant of CLUSTER" path and is rewritten via
cluster_rel(), exactly like the TOAST table.  But those relations carry
the GPDB-specific relkinds RELKIND_AOSEGMENTS / RELKIND_AOBLOCKDIR /
RELKIND_AOVISIMAP, which the upstream assert rejects → FailedAssertion at
cluster.c:514, crashing the segment ("server closed the connection
unexpectedly" on "VACUUM FULL <ao table>", triggering cluster recovery).

The rewrite itself is correct (make_new_heap uses the relation's heap
access method and the swap preserves the original relkind); only the
assert was too strict.  Extend it to accept the three AO aux relkinds.

Surfaced by isolation2 uao/compaction_full_stats_{column,row}; assert-only
crash, so it bites debug/cassert builds.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… merge)

TRUNCATE on an append-optimized table assigns a fresh relfilenode to the
table and its auxiliary relations via RelationSetNewRelfilenode().  The
PG15 merge replaced GPDB's explicit relkind switch there with upstream's
form, which gates the table-AM path on RELKIND_HAS_TABLE_AM():

    if (RELKIND_HAS_TABLE_AM(relkind))
        table_relation_set_new_filenode(..., &freezeXid, ...);
    else if (RELKIND_HAS_STORAGE(relkind))
        RelationCreateStorage(...);          /* leaves freezeXid invalid */

The upstream RELKIND_HAS_TABLE_AM() macro lists only RELATION / TOASTVALUE
/ MATVIEW, not GPDB's AO auxiliary relkinds (AOSEGMENTS / AOBLOCKDIR /
AOVISIMAP).  Those aux relations are heaps with a heap table AM, so they
fell into the storage-only branch and TRUNCATE stored relfrozenxid = 0
(InvalidTransactionId) for them.  A later plain VACUUM of the AO table
recurses into the aux relations through heap_vacuum_rel(), whose early
wraparound-failsafe precheck asserts TransactionIdIsNormal(relfrozenxid)
(vacuumlazy.c) and crashes the segment.

GPDB's pattern elsewhere (RelationBuildLocalRelation, the relcache reload
paths, plancat.c) is to keep the upstream macro and add an explicit
AO-aux relkind check alongside it; RelationSetNewRelfilenode() was the one
site that missed it.  Add the same check so the aux relations get a valid
relfrozenxid on TRUNCATE.

Surfaced by isolation2 uao/vacuum_cleanup_{column,row} (TRUNCATE followed
by VACUUM with a concurrent writer); assert-only crash.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PG15 revoked the default CREATE privilege on the public schema from
PUBLIC, and GreengageDB adopts that (initdb grants only USAGE on public
to PUBLIC).  The core regression suite compensates by running
"GRANT ALL ON SCHEMA public TO public;" in src/test/regress/sql/
test_setup.sql, but the isolation2 suite's setup.sql had no equivalent.

As a result, isolation2 tests that create objects in the public schema
as a non-superuser role fail with "permission denied for schema public".
In resource_queue_deadlock this is especially nasty: session 1 creates
"t_deadlock_test" as the non-superuser role_deadlock_test, the CREATE
TABLE fails, so the subsequent INSERT (and its auto-stats ANALYSE) never
runs, the before_auto_stats fault never triggers, and session 0's
gp_wait_until_triggered_fault() hangs forever (the whole isolation2 run
stalls).

Mirror test_setup.sql: GRANT ALL ON SCHEMA public TO public in the
isolation2 setup (which isolation2_main.c always runs as a prerequisite),
and regenerate setup.out.  This fixes resource_queue_deadlock and the
other isolation2 specs that create objects as non-superuser roles.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The pg_basebackup() PL/Python helper in setup.sql formats a failed
command's output in its error handler as:

    results = str(e) + "\ncommand output: " + e.output

Under Python 3, subprocess.CalledProcessError.output is bytes, so the
concatenation raises "TypeError: can only concatenate str (not 'bytes')
to str".  The error handler thus crashes instead of reporting the real
pg_basebackup failure, masking it in every test that uses the helper
(e.g. segwalrep/master_wal_switch, pg_basebackup_large_database_oid).

Decode e.output before concatenating (matching the success path, which
already .decode()s), and guard against None.  Regenerate setup.out.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…G15 merge)

pg_basebackup's GPDB --target-gp-dbid option must write the new segment's
dbid into internal.auto.conf in the freshly-created data directory: the
backup stream carries the *source* segment's internal.auto.conf (its
gp_dbid), so a mirror/standby created from a primary/coordinator would
otherwise inherit the source's gp_dbid.

The PG15 merge rewrote pg_basebackup's receive path around bbstreamer and
dropped the call to WriteInternalConfFile(); the function survived but was
never invoked.  As a result every pg_basebackup-created segment kept the
source's gp_dbid (e.g. content-0 mirror got gp_dbid=2 instead of 5, the
standby got 1 instead of 8).

This is dormant in normal operation — FTS probes the primary, not the
mirror directly — but breaks the moment a mirror is promoted (or the
standby activated): FTS probes the new primary with its catalog dbid, the
segment compares against its (wrong) configured dbid, and rejects the
probe with "PROBE received dbid:N doesn't match this segments configured
dbid" (ftsmessagehandler.c).  The segment never finishes promotion and
stays in recovery, hanging every mirror-promotion / failover / pg_rewind
based test (segwalrep/mirror_promotion, recoverseg_from_file,
twophase_tolerance_with_mirror_promotion, failover_with_many_records,
prepared_xact_deadlock_pg_rewind, ...) and real HA failover.

Re-graft the WriteInternalConfFile() call after BaseBackup() completes,
for plain-format backups (tar mode adds the file manually, per the
existing note).  Verified: a recreated demo cluster now gives mirrors
gp_dbid 5/6/7 and the standby 8.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… merge)

GPDB's pg_basebackup --force-overwrite extracts a backup into an existing
data directory (e.g. gprecoverseg full recovery in place).  Before PG15
the tar-receive code tolerated pre-existing directories when
forceoverwrite was set.  The PG15 merge moved extraction into the new
bbstreamer extractor (bbstreamer_file.c), whose extract_directory() only
ignores EEXIST for pg_wal / pg_xlog / archive_status — every other
existing directory makes it pg_fatal("could not create directory ...:
File exists").

As a result gprecoverseg full recovery fails as soon as the target data
directory still contains e.g. pg_serial, which blocks recovering a downed
segment after a mirror promotion (segwalrep/mirror_promotion,
recoverseg_from_file, ...).

Thread forceoverwrite into the bbstreamer extractor and tolerate EEXIST
for any directory when it is set (files are already truncated via
fopen "wb"), restoring the pre-merge behavior.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ittest (PG15)

PG15 requires RequestNamedLWLockTranche() to run only during the
shmem-request phase, so it was moved out of SharedSnapshotShmemSize() and
CreateSharedSnapshotArray() into the shmem_request hook
(CreateSharedMemoryAndSemaphores).  sharedsnapshot_test.c still set up
expect_string/expect_value/will_be_called for RequestNamedLWLockTranche
around both calls, so cmockery failed with "Remaining item(s) declared at
sharedsnapshot_test.c:29 / :41" (the expectations were never consumed),
breaking `make unittest-check` in CI.

Remove the two stale RequestNamedLWLockTranche expectation blocks; the
test's actual purpose (shared snapshot array slot/lock/xip boundary
checks) is unchanged and now passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
test_GetNewTransactionId_xid_warn_limit exercises the warn-limit path,
which (unlike the stop-limit path that ereport(ERROR)s first) continues
into the XID-assignment code.  There GetNewTransactionId() indexes
ProcGlobal->xids[MyProc->pgxactoff] and
ProcGlobal->subxidStates[MyProc->pgxactoff].  The test left its stack
PGPROC uninitialized, so pgxactoff was garbage; it happened to be 0 (a
valid index into the size-1 arrays) before, but the PG15 PGPROC layout
change turned it into an out-of-bounds index, segfaulting the test and
breaking `make unittest-check` in CI.

Zero-initialize the stack PGPROC and PROC_HDR so pgxactoff is 0 and
MyProc->subxidStatus is empty (satisfying the asserts in that path).  The
test logic is otherwise unchanged and now passes 5/5.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…pile fix)

PG15 moved the WAL record's main data out of XLogReaderState into the
decoded record (XLogReaderState->record, a DecodedXLogRecord); the main
data is now reached via XLogRecGetData()/XLogRecGetDataLen() which
dereference record->record->main_data[_len].  cdbappendonlyxlog_test.c
still assigned mockrecord->main_data directly, so it failed to compile
("'XLogReaderState' has no member named 'main_data'"), breaking
`make unittest-check`.

Point the mock reader at a stack DecodedXLogRecord (header + main_data +
main_data_len, max_block_id = -1) so ao_insert_replay/ao_truncate_replay
read the data through the PG15 accessors.  Tests pass 2/2.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… (PG15)

Back-to-back segwalrep failover/recovery tests race a just-promoted or
just-recovered segment that is still transiently unavailable.  Two
distinct race classes were diagnosed:

1. The direct "NU: select 1" promotion-waits connect to the freshly
   promoted mirror before it finishes recovery and fail with
   "FATAL: the database system is not accepting connections".  This raw
   connection rejection is NOT covered by gp_gang_creation_retry.  Add a
   plpython helper wait_until_segment_accepts_connections(content_id) that
   polls pg_isready against the content's current primary (nudging FTS)
   until it is ready, and call it before the 1U/0U promotion-waits in
   recoverseg_from_file.

2. gprecoverseg's gang creation in mirror_promotion fails with
   "Segments are in reset/recovery mode" because a segment is still
   recovering.  mirror_promotion was missing the gp_gang_creation_retry
   bump that twophase_tolerance_with_mirror_promotion and
   failover_with_many_records already use; add it (120 x 1000ms ~= 120s)
   via gpconfig + gpstop -u, reset at the end.

The default gp_gang_creation_retry is only 5 x 2s = 10s, too short for an
in-order run.  Note: keep plpython helper bodies comment-free and free of
any trailing ';' -- the isolation2 harness splits commands on ';' at end
of line, which corrupts the function definition.

mirror_promotion's second, fault-injection scenario can still flake in
back-to-back in-order runs (the whole cluster goes transiently into
reset/recovery, where even coordinator-only helper queries error
uncatchably); that residual is environmental and tracked separately.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Upstream PG15 commit cc50080 ("Rearrange core regression tests to
reduce cross-script dependencies") moved the shared C helper functions,
including binary_coercible(), into test_setup.sql, which runs before
create_function_0.  The PG15 merge kept test_setup.sql's definition but
did not remove create_function_0's now-duplicate one, so create_function_0
failed with:

    ERROR:  function "binary_coercible" already exists with same argument types

That failure (and the resulting missing-object cascade into downstream
tests) shows up in any regression run, and dominated the JIT
(jit=on jit_above_cost=0) installcheck matrix.  Remove the duplicate from
both the input and output .source templates; test_setup.sql's definition
(which runs first) serves every later test.  Verified: test_setup,
create_function_0, create_function_c and opr_sanity (a binary_coercible
consumer) all pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…_conversion, greenplum test_setup)

Two more PG15-merge dedup misses that broke `make installcheck`
(installcheck-good runs parallel_schedule then greenplum_schedule in one
database), surfaced while triaging the JIT installcheck matrix:

* test_enc_conversion() is created by conversion.sql in PG15 (upstream
  commit cc50080), but create_function_0.source still defined it too,
  so conversion failed "function already exists".  Remove it from
  create_function_0 (only conversion uses it, and it creates its own).

* test_setup is the first test of parallel_schedule (PG15 upstream); the
  merge also added it to greenplum_schedule.  Since installcheck-good runs
  parallel_schedule first in the same database, the greenplum copy's CREATE
  TABLEs failed "already exists" and its INSERTs double-loaded the shared
  read-only tables, cascading into the greenplum tests.  Drop it from
  greenplum_schedule; parallel_schedule's run serves both.

With these plus the earlier binary_coercible dedup, the core
parallel_schedule passes under JIT (jit=on jit_above_cost=0) except misc;
the remaining greenplum_schedule failures are pre-existing PG15 answer
drift unrelated to JIT.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PG15 adopted upstream commit 6867f96, which hardened
pg_get_expr_worker() by walking the input node with pull_varnos() to
reject expressions containing Vars. The gp_partition_template.template
catalog column stores a serialized GpPartitionDefinition node tree (a
GPDB-specific node with no Vars); expression_tree_walker() has no case
for the GPDB partition nodes, so pull_varnos() errored with
"unrecognized node type: 740". The deparse path already handles
T_GpPartitionDefinition (get_rule_expr), so skip the Var-safety check
for it, restoring pre-PG15 behavior.

Fixes regress tests AOCO_Compression, bfv_partition, column_compression,
gp_partition_template, partition (opt=on and opt=off).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PG15 adopted upstream's ATWrongRelkindError path in ATSimplePermissions,
which calls alter_table_type_to_string(cmdtype) and, when it returns
NULL, falls through to the internal-error elog "invalid ALTER action
attempted on relation". The GPDB-specific AT_ExpandTable /
AT_ExpandPartitionTablePrepare values had no case, so EXPAND TABLE on a
wrong relkind (e.g. a view) raised that internal error instead of a
clean message. Add the cases and regenerate expand_table.out.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The PG15 merge left two complete SELECT...FROM pg_class blocks in
getTables(), both appending to the same query buffer, producing
malformed SQL. The first block was upstream PG15's rewritten minimal
query (no WHERE/ORDER/execute); the second is the GPDB query that the
result parsing actually depends on (it reads distclause, parrelid,
parlevel, relstorage, partclause, parttemplate via PQfnumber) and which
carries the full FROM/JOINs/WHERE/ORDER and ExecuteSqlQuery. Drop the
leftover upstream block, keeping the GPDB query.

Fixes regress test pg_dump_binary_upgrade (opt=on and opt=off).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The PG15 createdb rewrite (WAL_LOG/FILE_COPY strategies) dropped two
GPDB pieces from the new strategy functions:
  - ScheduleDbDirDelete(): registers the destination DB directory on the
    PendingDBDelete list so a failed/aborted CREATE DATABASE removes it
    (GPDB removes the dir via pending-deletes; createdb_failure_callback's
    upstream remove_dbtablespaces is #if 0'd out for GPDB).
  - the create_db_after_file_copy / after_xlog_create_database fault
    injection points used by the createdb regress test.
Re-graft both into CreateDatabaseUsingFileCopy (and ScheduleDbDirDelete
into CreateDatabaseUsingWalLog), matching adb-6.x.

PG15 defaults to the wal_log strategy, but these faults are file-copy-path
specific, so the createdb test requests STRATEGY file_copy for every fault
case (db_with_leftover_files, db2, db3, db4).  This also avoids a buffer
leak: the wal_log strategy copies relations through the shared buffer
cache, and createdb_failure_callback only drops those buffers when an error
is caught inside its PG_ENSURE_ERROR_CLEANUP block.  db4's CASE 4 aborts via
an end_prepare_two_phase panic during 2PC commit -- after the block has
ended -- so the callback never runs; with wal_log, ScheduleDbDirDelete would
then unlink the directory while its dirty buffers remained, orphaning them
('could not write block ... No such file or directory').  file_copy uses
copydir (no buffer-cache load), so there are no buffers to orphan.

Fixes regress test createdb (opt=on and opt=off).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A backend that spilled to a shared work file (FileSet) crashed on exit:
pgstat_shutdown_hook() runs as a before_shmem_exit callback and tears down
the cumulative-stats state, then dsm_backend_shutdown() (an on_shmem_exit
callback, so it runs later) detaches the DSM segment, whose cleanup deletes
the FileSet's temporary files:

  pgstat_report_tempfile        <- asserts pgstat is up (pgstat.c:1227)
  ReportTemporaryFileUsage
  PathNameDeleteTemporaryFile
  FileSetDeleteAll
  dsm_detach
  dsm_backend_shutdown

Reporting temp-file usage after the stats subsystem is shut down trips
pgstat_assert_is_up() under assertions, and touches detached stats shared
memory in any build.  Because only backends that spilled hit this, and more
queries spill under load, it crashed segments intermittently during
installcheck-good and cascaded ('terminating connection because of crash of
another server process') into dozens of unrelated tests.

Skip the temp-file stats report once proc_exit is in progress; per-file
accounting is moot for a backend that is leaving, and query-time temp-file
deletions still report normally (proc_exit_inprogress is false then).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
COPY ... IGNORE EXTERNAL PARTITIONS over a partitioned table that has an
external (foreign) partition crashed the QD planner with
FailedAssertion("child_rel != NULL", planner.c:9046).

expand_partitioned_rtentry() walks live_parts and, for the GPDB
"skip foreign partitions" hack, does 'continue' for a foreign partition
without building its part_rels[] entry -- but it left that partition's index
in live_parts.  PG15's apply_scanjoin_target_to_paths() now iterates
live_parts and asserts a non-NULL part_rels[] entry for every live member
(PG14 tolerated NULL slots), so the skipped external partition tripped the
assert (and would dereference a NULL RelOptInfo in a non-assert build).

Remove the skipped partition from live_parts too, keeping live_parts and
part_rels[] consistent for all downstream consumers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A QD backend could SIGSEGV in mppExecutorCleanup():

  mppExecutorCleanup        estate->dispatcherState  (execUtils.c:1727)
  standard_ExecutorStart    PG_CATCH (execMain.c:335)
  PortalStart / exec_simple_query

standard_ExecutorStart() runs the resource-manager operator-memory
assignment (PolicyAutoAssignOperatorMemoryKB / PolicyEagerFree...) inside a
PG_TRY, and on error calls mppExecutorCleanup() from the PG_CATCH.  That
assignment happens before queryDesc->estate is created, so mppExecutorCleanup
dereferenced a NULL estate and crashed the backend -- which on the QD takes
down the whole coordinator (crash recovery) and cascades into concurrently
running tests.  It was hit intermittently under load by a complex query whose
operator-memory needs exceeded the (contention-reduced) query_mem, e.g. the
psql \d publications query (a 3-way UNION) -- 'insufficient memory reserved
for statement' was thrown during executor start.

Return early from mppExecutorCleanup() when queryDesc->estate is NULL: the
executor state was never built, so there is nothing to tear down, and the
original error then propagates as a normal ERROR.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
GPDB extends SQL window frames to allow non-constant (column-valued) start/end
offsets, e.g. ROWS BETWEEN <expr> PRECEDING where <expr> references the current
row.  compute_start_end_offsets() only (re)computes an offset when its
start_offset_valid / end_offset_valid flag is false, and those flags mean
"valid for the current row".  The PG15 merge kept upstream's advance-current-row
and begin_partition logic (which resets framehead_valid/frametail_valid but not
the GPDB offset-valid flags, since upstream offsets are constant), so the flags
were only ever set true and never reset.  The frame offset was therefore frozen
at the first row's value, producing wrong window-aggregate results whenever the
offset varied across rows (e.g. SUM over ROWS BETWEEN off PRECEDING gave the
off-of-the-first-row frame for every row).

Re-graft the per-row resets (matching adb-6.x): in begin_partition() and when
advancing the current row -- unconditionally for RANGE framing, and for
ROWS/GROUPS only when the offset is not var-free (non-constant).

Fixes regress test qp_olap_window (and related non-constant-frame cases).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant