pool: add firefly onStart marker for MoverProtocol-based transfers#8055
Conversation
kofemann
left a comment
There was a problem hiding this comment.
I think that suggested change is wrong. It might preserve the pool IP, but (a) uses the wrong port and (b) will work incorrectly on multi-home hosts.
I would suggest updating the RemoteHttpTransferService and passing the TransferLifeCycle to the mover when it is created by the createMoverProtocol method. The RemoteHttpDataTransferProtocol#doGet then should call onStart as soon as the local endpoint is defined.
21f3ab9 to
873c491
Compare
kofemann
left a comment
There was a problem hiding this comment.
looks like we are almost there!
| private Subject _subject; | ||
| private boolean _startMarkerSent; | ||
|
|
||
| public RemoteHttpDataTransferProtocol(CloseableHttpClient client) { |
There was a problem hiding this comment.
This constructor can be removed as all places that create a new instance of RemoteHttpDataTransferProtocol after the proposed change always provide a TransferLifeCycle
| } | ||
|
|
||
| public RemoteHttpDataTransferProtocol(CloseableHttpClient client, | ||
| @Nullable TransferLifeCycle transferLifeCycle) { |
There was a problem hiding this comment.
With a comment above, the transferLifeCycle can't be null
| public RemoteHttpDataTransferProtocol(CloseableHttpClient client, | ||
| @Nullable TransferLifeCycle transferLifeCycle) { | ||
| _client = requireNonNull(client); | ||
| _transferLifeCycle = transferLifeCycle; |
There was a problem hiding this comment.
thus, requireNonNull(transferLifeCycle)
| public Mover<?> createMover(ReplicaDescriptor handle, PoolIoFileMessage message, | ||
| CellPath pathToDoor) throws CacheException { | ||
| Mover<?> mover = super.createMover(handle, message, pathToDoor); | ||
| if (mover instanceof MoverProtocolMover mpm |
There was a problem hiding this comment.
I would just cast the mover to RemoteHttpDataTransferProtocol; any other return type indicates a code bug.
873c491 to
5a5ad01
Compare
Move the firefly flow-start marker emission from AbstractMoverProtocolTransferService.MoverTask into RemoteHttpDataTransferProtocol, where the actual HTTP connection's local socket address (correct IP and port) is available. Previously, the start marker was emitted in MoverTask.run() before the HTTP connection was established, using NetworkUtils.getLocalAddress() to derive the local endpoint. This produced the wrong port (0) and could select the wrong interface on multi-homed hosts. Now, RemoteHttpTransferService passes the TransferLifeCycle to RemoteHttpDataTransferProtocol at construction time and sets the Subject via the overridden createMover(). The protocol calls onStart() in doGet() and sendFile() immediately after capturing the local endpoint from HttpInetConnection, which provides the real bound address and port. Signed-off-by: Shawn McKee <smckee@umich.edu>
5a5ad01 to
07fb840
Compare
|
@ShawnMcKee, thanks for the contribution. The updated PR looks good. |
Problem
AbstractMoverProtocolTransferService, the base class forMoverProtocol-based transfer services (used byRemoteHttpTransferServicefor FTS third-party copy movers), never callstransferLifeCycle.onStart(). This means firefly flow-start UDP markers are never emitted for TPC movers, even though:DefaultPostTransferServicecorrectly sendsonEnd()markers (using thederiveLocalEndpoint()fallback added in 11.2.3)NettyTransferService(direct client connections via xrootd/HTTP) correctly callsonStart()in itsexecuteMover()methodImpact
ESnet Stardust dashboards show incomplete flow information for TPC transfers. Specifically:
onEndmarkers are present with correct experiment/activity IDs from transfer tags (e.g., activityId=11, 15, 16 for ATLAS categories)onStartmarkers are emitted from pool nodes for TPC transfersonStartmarkers (fromNettyTransferService) are present but use FQAN fallback (activityId=1) since worker nodes don't send SciTag headersEvidence from ATLAS AGLT2 pool logs:
Fix
Add a
startTransferLifeCycle()call inMoverTask.run()(before I/O begins) that:_transferLifeCycleis configured and protocol is IP-basedNetworkUtils.getLocalAddress()— the same approach used byDefaultPostTransferService.deriveLocalEndpoint()foronEndtransferLifeCycle.onStart()with the remote endpoint, derived local endpoint, protocol info, and subjectSocketExceptiongracefully with debug loggingThis mirrors the pattern already established in
NettyTransferService.executeMover()for direct connections, extending it to allMoverProtocol-based transfers.Testing
mvn -pl modules/dcache -am compile -DskipTests)NettyTransferServiceandDefaultPostTransferServiceSigned-off-by: Shawn McKee smckee@umich.edu