Skip to content

pool: add firefly onStart marker for MoverProtocol-based transfers#8055

Merged
kofemann merged 1 commit intodCache:masterfrom
ShawnMcKee:fix/firefly-tpc-onstart-marker
Apr 9, 2026
Merged

pool: add firefly onStart marker for MoverProtocol-based transfers#8055
kofemann merged 1 commit intodCache:masterfrom
ShawnMcKee:fix/firefly-tpc-onstart-marker

Conversation

@ShawnMcKee
Copy link
Copy Markdown
Contributor

Problem

AbstractMoverProtocolTransferService, the base class for MoverProtocol-based transfer services (used by RemoteHttpTransferService for FTS third-party copy movers), never calls transferLifeCycle.onStart(). This means firefly flow-start UDP markers are never emitted for TPC movers, even though:

  • DefaultPostTransferService correctly sends onEnd() markers (using the deriveLocalEndpoint() fallback added in 11.2.3)
  • NettyTransferService (direct client connections via xrootd/HTTP) correctly calls onStart() in its executeMover() method

Impact

ESnet Stardust dashboards show incomplete flow information for TPC transfers. Specifically:

  • TPC onEnd markers are present with correct experiment/activity IDs from transfer tags (e.g., activityId=11, 15, 16 for ATLAS categories)
  • Zero TPC onStart markers are emitted from pool nodes for TPC transfers
  • Direct-read onStart markers (from NettyTransferService) are present but use FQAN fallback (activityId=1) since worker nodes don't send SciTag headers

Evidence from ATLAS AGLT2 pool logs:

# 194 TPC end markers with correct transfer-tag classifier
scitags event=marker state=end protocol=remotehttpsdatatransfer ... transferTag=143 experimentId=2 activityId=15 classifier=transfer-tag

# 0 TPC start markers
grep -c 'protocol=remote.*state=start' → 0

# 141 direct-read start markers, all FQAN/activityId=1
scitags event=marker state=start protocol=https ... experimentId=2 activityId=1 classifier=fqan

Fix

Add a startTransferLifeCycle() call in MoverTask.run() (before I/O begins) that:

  1. Checks _transferLifeCycle is configured and protocol is IP-based
  2. Derives the local endpoint from the remote address using NetworkUtils.getLocalAddress() — the same approach used by DefaultPostTransferService.deriveLocalEndpoint() for onEnd
  3. Calls transferLifeCycle.onStart() with the remote endpoint, derived local endpoint, protocol info, and subject
  4. Handles SocketException gracefully with debug logging

This mirrors the pattern already established in NettyTransferService.executeMover() for direct connections, extending it to all MoverProtocol-based transfers.

Testing

  • Compilation verified (mvn -pl modules/dcache -am compile -DskipTests)
  • Consistent with existing patterns in NettyTransferService and DefaultPostTransferService

Signed-off-by: Shawn McKee smckee@umich.edu

Copy link
Copy Markdown
Member

@kofemann kofemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that suggested change is wrong. It might preserve the pool IP, but (a) uses the wrong port and (b) will work incorrectly on multi-home hosts.

I would suggest updating the RemoteHttpTransferService and passing the TransferLifeCycle to the mover when it is created by the createMoverProtocol method. The RemoteHttpDataTransferProtocol#doGet then should call onStart as soon as the local endpoint is defined.

@ShawnMcKee ShawnMcKee force-pushed the fix/firefly-tpc-onstart-marker branch from 21f3ab9 to 873c491 Compare April 8, 2026 15:05
Copy link
Copy Markdown
Member

@kofemann kofemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we are almost there!

private Subject _subject;
private boolean _startMarkerSent;

public RemoteHttpDataTransferProtocol(CloseableHttpClient client) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructor can be removed as all places that create a new instance of RemoteHttpDataTransferProtocol after the proposed change always provide a TransferLifeCycle

}

public RemoteHttpDataTransferProtocol(CloseableHttpClient client,
@Nullable TransferLifeCycle transferLifeCycle) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a comment above, the transferLifeCycle can't be null

public RemoteHttpDataTransferProtocol(CloseableHttpClient client,
@Nullable TransferLifeCycle transferLifeCycle) {
_client = requireNonNull(client);
_transferLifeCycle = transferLifeCycle;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thus, requireNonNull(transferLifeCycle)

public Mover<?> createMover(ReplicaDescriptor handle, PoolIoFileMessage message,
CellPath pathToDoor) throws CacheException {
Mover<?> mover = super.createMover(handle, message, pathToDoor);
if (mover instanceof MoverProtocolMover mpm
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just cast the mover to RemoteHttpDataTransferProtocol; any other return type indicates a code bug.

@ShawnMcKee ShawnMcKee force-pushed the fix/firefly-tpc-onstart-marker branch from 873c491 to 5a5ad01 Compare April 9, 2026 14:50
Move the firefly flow-start marker emission from
AbstractMoverProtocolTransferService.MoverTask into
RemoteHttpDataTransferProtocol, where the actual HTTP connection's
local socket address (correct IP and port) is available.

Previously, the start marker was emitted in MoverTask.run() before
the HTTP connection was established, using NetworkUtils.getLocalAddress()
to derive the local endpoint. This produced the wrong port (0) and
could select the wrong interface on multi-homed hosts.

Now, RemoteHttpTransferService passes the TransferLifeCycle to
RemoteHttpDataTransferProtocol at construction time and sets the
Subject via the overridden createMover(). The protocol calls
onStart() in doGet() and sendFile() immediately after capturing the
local endpoint from HttpInetConnection, which provides the real
bound address and port.

Signed-off-by: Shawn McKee <smckee@umich.edu>
@ShawnMcKee ShawnMcKee force-pushed the fix/firefly-tpc-onstart-marker branch from 5a5ad01 to 07fb840 Compare April 9, 2026 14:50
@kofemann
Copy link
Copy Markdown
Member

kofemann commented Apr 9, 2026

@ShawnMcKee, thanks for the contribution. The updated PR looks good.

@kofemann kofemann merged commit 5380ee9 into dCache:master Apr 9, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants