Skip to content

HPCC-36059 Enable copy using Azure API#21136

Open
jackdelv wants to merge 16 commits intohpcc-systems:candidate-10.0.xfrom
jackdelv:HPCC-36059-enable-copy-using-azure-api
Open

HPCC-36059 Enable copy using Azure API#21136
jackdelv wants to merge 16 commits intohpcc-systems:candidate-10.0.xfrom
jackdelv:HPCC-36059-enable-copy-using-azure-api

Conversation

@jackdelv
Copy link
Copy Markdown
Contributor

@jackdelv jackdelv commented Mar 24, 2026

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Cloud-compatibility
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Smoketest:

  • Send notifications about my Pull Request position in Smoketest queue.
  • Test my draft Pull Request.

Testing:

Cache provider, otherwise each file part will create a new instance and hit a rate limit
The copy does not automatically get the managed identities credential, so we need to pass one in if using MI
StartCopy service does not automatically inject Managed Identity credential into the URL
@jackdelv jackdelv requested a review from ghalliday March 24, 2026 13:34
@github-actions
Copy link
Copy Markdown

Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-36059

Jirabot Action Result:
Workflow Transition To: Merge Pending
Updated PR

@ghalliday ghalliday requested a review from Copilot March 24, 2026 15:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enables copying files using the Azure Storage API with Azure Managed Identity support, improving compatibility for remote-storage scenarios and reducing IMDS throttling.

Changes:

  • Add a managed-identity flag to IStorageApiInfo and implement it for storage plane config.
  • Update Azure API copy to support managed identity (token intent, credential caching, delegation SAS for blob sources) and handle non-distributed source descriptors.
  • Add developer documentation for configuring remote storage planes for Azure API copy.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
system/jlib/jplane.cpp Implements useManagedIdentity() from environment.xml storage config.
system/jlib/jfile.hpp Extends IStorageApiInfo with managed-identity flag.
devdoc/RemoteStorageUsingAzureAPI.md Adds setup/configuration guide for Azure API copy with managed identity.
dali/ft/filecopy.ipp Stores a source IFileDescriptor for non-distributed source handling.
dali/ft/filecopy.cpp Uses saved descriptor to resolve source plane/path/stripe when IDistributedFile is absent.
common/remote/hooks/azure/azurefile.cpp Sets ShareTokenIntent when using token auth for Azure Files clients.
common/remote/hooks/azure/azureapiutils.cpp Caches managed identity credential to avoid repeated IMDS token requests.
common/remote/hooks/azure/azureapi.cpp Adds managed-identity aware clients, directory creation for Azure Files, delegation SAS generation, and path normalization.

Comment thread common/remote/hooks/azure/azureapi.cpp Outdated
Comment on lines +344 to +346
// Request a delegation key valid for 1 hour
auto expiresOn = Azure::DateTime(std::chrono::system_clock::now() + std::chrono::hours(1));
auto userDelegationKey = serviceClient.GetUserDelegationKey(expiresOn).Value;
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BlobServiceClient::GetUserDelegationKey is typically the 2-parameter overload (start + expiry). Calling it with only expiresOn is likely to fail to compile (or use an unintended overload if the SDK differs). Use the overload that supplies both startsOn and expiresOn (and consider a small clock-skew buffer for startsOn) to ensure the delegation SAS generation works reliably.

Suggested change
// Request a delegation key valid for 1 hour
auto expiresOn = Azure::DateTime(std::chrono::system_clock::now() + std::chrono::hours(1));
auto userDelegationKey = serviceClient.GetUserDelegationKey(expiresOn).Value;
// Request a delegation key valid for 1 hour, with a small clock-skew buffer on the start time
auto now = std::chrono::system_clock::now();
auto startsOn = Azure::DateTime(now - std::chrono::minutes(5));
auto expiresOn = Azure::DateTime(now + std::chrono::hours(1));
auto userDelegationKey = serviceClient.GetUserDelegationKey(startsOn, expiresOn).Value;

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It uses this overload

/**
     * @brief Retrieves a key that can be used to delegate Active Directory authorization to
     * shared access signatures.
     *
     * @param expiresOn Expiration of the key's validity. The time should be specified in UTC, and
     * will be truncated to second.
     * @param options Optional parameters to execute this function.
     * @param context Context for cancelling long running operations.
     * @return A deserialized UserDelegationKey instance.
     */
    Azure::Response<Models::UserDelegationKey> GetUserDelegationKey(
        const Azure::DateTime& expiresOn,
        const GetUserDelegationKeyOptions& options = GetUserDelegationKeyOptions(),
        const Azure::Core::Context& context = Azure::Core::Context()) const;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on the clock skew? The idea is that if clocks are out of sync the token will still work as long as they are not too far out of sync,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think clock skew would be an issue unless it could be skewed by more than an hour, right?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What it is trying to solve is if the time on local computer is slightly ahead of the resource you are requesting:

You request a key which can be used from 15:00:00 to 16:00:00. It is immediately sent to a resource where the time is a second out - 14:59:59. Because the time doesn't fit into the range that is requested it would be refused.

Comment thread common/remote/hooks/azure/azureapi.cpp
Comment thread dali/ft/filecopy.cpp Outdated
Comment thread dali/ft/filecopy.cpp Outdated
Comment on lines +2849 to +2851
unsigned lfnHash = savedSource->queryProperties().getPropInt("@lfnHash");
unsigned numDevices = storagePlane->numDevices();
sourceStripeNum = calcStripeNumber(srcPart->queryPartIndex(), lfnHash, numDevices);
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the non-distributed source branch, stripe selection depends on @lfnHash. If @lfnHash is not present in the descriptor properties (defaulting to 0), stripe calculation will deterministically pick the wrong device for most files, causing source lookups/copies to fail or read the wrong blob. Consider requiring @lfnHash to be present (throw if missing) or deriving the hash from a reliable identifier available in the descriptor (e.g., logical name) to keep stripe mapping correct.

Suggested change
unsigned lfnHash = savedSource->queryProperties().getPropInt("@lfnHash");
unsigned numDevices = storagePlane->numDevices();
sourceStripeNum = calcStripeNumber(srcPart->queryPartIndex(), lfnHash, numDevices);
int lfnHash = savedSource->queryProperties().getPropInt("@lfnHash", -1);
if (lfnHash < 0)
{
const char *logicalName = savedSource->queryLogicalName();
if (isEmptyString(logicalName))
throw makeStringException(DFSERR_StripeSelectionFailed, "Missing @lfnHash and logical name when determining source stripe");
lfnHash = (int)hashc((const unsigned char *)logicalName, 0, strlen(logicalName));
}
unsigned numDevices = storagePlane->numDevices();
sourceStripeNum = calcStripeNumber(srcPart->queryPartIndex(), (unsigned)lfnHash, numDevices);

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add fallback to build hash if missing from properties.

Comment thread common/remote/hooks/azure/azureapi.cpp
Comment thread common/remote/hooks/azure/azureapi.cpp Outdated
- The `managed="1"` attribute on `<storageapi>` tells HPCC to use Azure Managed Identity for authentication rather than storage keys or SAS tokens.
- The `<remote>` element's `service` URL must point to the DFS service of the source environment. You can find this URL by inspecting the Kubernetes service or pod YAML in the source cluster.
- The `<planes>` element within `<remote>` maps `remote="data"` (data is the plane name on the source side) to `local="plane1"` (the plane name on the target side).
- The `striped=1` is only necessary if reading from striped storage on a bare-metal cluster
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The document uses mixed forms for the striped attribute (striped="true" in the XML example vs striped=1 in the bullet), which could confuse readers copying configuration. Pick one form and keep it consistent with the expected environment.xml schema/examples used elsewhere in the repo.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change to 1 to be consistent

@github-actions
Copy link
Copy Markdown

🔄 Upmerge Test Results

Status: ✅ All branches merged successfully
PR: #21136 - HPCC-36059 Enable copy using Azure API
Base Branch: candidate-10.0.x
Test Time: 2026-03-24 16:21:44 UTC

✅ Successful Branches (2)

  • candidate-10.2.x
  • master
    This comment was automatically generated by the upmerge test workflow.

@github-actions
Copy link
Copy Markdown

🔄 Upmerge Test Results

Status: ✅ All branches merged successfully
PR: #21136 - HPCC-36059 Enable copy using Azure API
Base Branch: candidate-10.0.x
Test Time: 2026-03-24 19:19:27 UTC

✅ Successful Branches (2)

  • candidate-10.2.x
  • master
    This comment was automatically generated by the upmerge test workflow.

Copy link
Copy Markdown
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments from a separate copilot review:

  1. Critical Bug: Dropped SAS Token in ensureParentDirectories()
    When useManagedIdentity is false, Azure authentication relies on a Shared Access Signature (SAS) token embedded in the URL as a query string.
    In ensureParentDirectories(), dirUrl is built exclusively using the scheme, host, and path components:

This inadvertently drops the query string (parsed.GetQuery()) containing the SAS token from the URL. Consequently, Shares::ShareDirectoryClient(dirUrl, clientOptions) is created without credentials and without a SAS token, causing CreateIfNotExists() to always fail with a 403 Authorization error for non-managed identity flows.

Fix: Extract the query string from the parsed URL and re-append it when instantiating ShareDirectoryClient:

  1. Inefficiency: Re-fetching the User Delegation Key
    While the TokenCredential is properly cached to prevent token endpoint rate limiting, generateUserDelegationSas() introduces a new rate-limit risk:

GetUserDelegationKey causes a synchronous blocking network request to the Azure Storage REST API. Because this is called from getCopyClient for every file part, spraying a file with 400 parts will result in 400 redundant, serial HTTP network requests. This will significantly slow down the start of the copy and risk hitting Azure Blob Storage rate limitations.

Fix: A UserDelegationKey is inherently designed to create multiple SAS tokens (valid for up to an hour). It should be cached to reuse across all parts, at minimum scoped to the CAzureApiCopyClient lifetime, or globally cached based on the accountName.

Please review the comments.

// Cache of Azure Files directory URLs already created/verified, to avoid
// redundant CreateIfNotExists metadata calls across multi-part copies.
// Scoped to the owning CAzureApiCopyClient so it is freed when the copy session ends.
struct CreatedDirsCache
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be common to jlib. I have a related idea to only try and create the parent directory if a call to create fails. 90% of the time it will already exist.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only creating parent if create fails would be a good idea.

// redundant CreateIfNotExists metadata calls across multi-part copies.
// Scoped to the owning CAzureApiCopyClient so it is freed when the copy session ends.
struct CreatedDirsCache
{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakesmith thoughts?
This could probably be a global variable within this module if the associated entries timed out. Something to revisit.

Comment thread common/remote/hooks/azure/azureapi.cpp Outdated
Comment on lines +344 to +346
// Request a delegation key valid for 1 hour
auto expiresOn = Azure::DateTime(std::chrono::system_clock::now() + std::chrono::hours(1));
auto userDelegationKey = serviceClient.GetUserDelegationKey(expiresOn).Value;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on the clock skew? The idea is that if clocks are out of sync the token will still work as long as they are not too far out of sync,

Comment thread dali/ft/filecopy.cpp Outdated
@jackdelv jackdelv requested a review from ghalliday March 25, 2026 20:09
@github-actions
Copy link
Copy Markdown

🔄 Upmerge Test Results

Status: ✅ All branches merged successfully
PR: #21136 - HPCC-36059 Enable copy using Azure API
Base Branch: candidate-10.0.x
Test Time: 2026-03-25 20:55:29 UTC

✅ Successful Branches (2)

  • candidate-10.2.x
  • master
    This comment was automatically generated by the upmerge test workflow.

Comment thread common/remote/hooks/azure/azureapi.cpp Outdated
{
dirUrl += "/" + components[i];
{
CriticalBlock block(createdDirsCache.lock);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code structure: Whenever you have code which references the same object multiple times it is worth wondering if it should be converted into a method of the object. In this case:

class CreatedDirsCache
{
    bool contains(const char * path) const
    {
        CriticalBlock block(lock);
        return dirs.count(path) != 0;
   }

It is better encapsulated, and allows for cleaner reuse. Same goes for insert logic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added contains and insert methods

Comment thread common/remote/hooks/azure/azureapi.cpp Outdated

// Return a cached UserDelegationKey for the given account, refreshing if expired or absent.
// The key is valid for 1 hour; we refresh with 5 minutes of margin to avoid using a nearly-expired key.
std::pair<Blobs::Models::UserDelegationKey, Azure::DateTime> getCachedDelegationKey(const char *accountName) const
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big is a Blobs::Models::UserDelegationKey? As it stands this will clone the object each time it is returned. Should this use a shared pointer?

Do you need to return a pair, or can you use the SignedExpiresOn member of the delegation key?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not large (5 std::strings and 2 DateTime objects), but could use a shared pointer to avoid copying every time. Also, it doesn't need to return a pair as it can use the SignedExpiresOn member. Will add shared_ptr and remove std::pair.

Comment thread dali/ft/filecopy.cpp Outdated
sourcePath.remove(0, strlen(storagePlane->queryPrefix()));
const char *planePrefix = storagePlane->queryPrefix();
size_t prefixLen = strlen(planePrefix);
if (prefixLen > sourcePath.length() || strncmp(sourcePath.str(), planePrefix, prefixLen) != 0)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use

if (!startsWith(sourcePath, planePrefix))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

Comment thread dali/ft/filecopy.cpp
// Calculate which stripe (i.e., which storage account/device) this part
// lives on, using the logical file name hash and the number of devices
unsigned lfnHash = savedSource->queryProperties().getPropInt("@lfnHash");
unsigned lfnHash = savedSource->queryProperties().getPropInt("@lfnHash", -1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should really be a member function IFileDescriptor::getFilenameHash(). Possibly better in a separate PR though to avoid holding this up.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and could use same logic if lfnHash is missing from properties and build it from the trace name.

@jackdelv jackdelv requested a review from ghalliday March 31, 2026 19:17
@github-actions
Copy link
Copy Markdown

🔄 Upmerge Test Results

Status: ✅ All branches merged successfully
PR: #21136 - HPCC-36059 Enable copy using Azure API
Base Branch: candidate-10.0.x
Test Time: 2026-03-31 19:17:35 UTC

✅ Successful Branches (2)

  • candidate-10.2.x
  • master
    This comment was automatically generated by the upmerge test workflow.

Copy link
Copy Markdown
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackdelv looks very close.
One comment about a thrown exception.
The other question is what did you think about copilot's comment on clock jitter. I think its comment was based on the idea that the clocks could be out of sync, so if it was requested from a node that had a clock that was ahead, and it was then used somewhere where it was behind it could be tagged as invalid.
I don't know if that is an issue or not.

Comment thread common/remote/hooks/azure/azureapi.cpp Outdated
catch (const Azure::Core::RequestFailedException& e)
{
IERRLOG("Failed to create directory '%s': %s (code %d)", dirUrl.c_str(), e.ReasonPhrase.c_str(), static_cast<int>(e.StatusCode));
throw;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this throw makeStringExceptionV() instead - so that other items will catch it cleanly?

@github-actions
Copy link
Copy Markdown

🔄 Upmerge Test Results

Status: ✅ All branches merged successfully
PR: #21136 - HPCC-36059 Enable copy using Azure API
Base Branch: candidate-10.0.x
Test Time: 2026-04-13 09:38:07 UTC

✅ Successful Branches (2)

  • candidate-10.2.x
  • master
    This comment was automatically generated by the upmerge test workflow.

@jackdelv
Copy link
Copy Markdown
Contributor Author

@ghalliday I do think the clocks being slightly out of sync could cause issues. I added the StartOn option to back date the key that is created by 5 mins. Also, I added a throw makeStringExceptionV and left the IERRLOG. Was that what you meant?

@jackdelv jackdelv requested a review from ghalliday April 13, 2026 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants