HPCC-36059 Enable copy using Azure API#21136
HPCC-36059 Enable copy using Azure API#21136jackdelv wants to merge 16 commits intohpcc-systems:candidate-10.0.xfrom
Conversation
Cache provider, otherwise each file part will create a new instance and hit a rate limit
The copy does not automatically get the managed identities credential, so we need to pass one in if using MI
…fore file can be writting
StartCopy service does not automatically inject Managed Identity credential into the URL
|
Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-36059 Jirabot Action Result: |
There was a problem hiding this comment.
Pull request overview
Enables copying files using the Azure Storage API with Azure Managed Identity support, improving compatibility for remote-storage scenarios and reducing IMDS throttling.
Changes:
- Add a managed-identity flag to
IStorageApiInfoand implement it for storage plane config. - Update Azure API copy to support managed identity (token intent, credential caching, delegation SAS for blob sources) and handle non-distributed source descriptors.
- Add developer documentation for configuring remote storage planes for Azure API copy.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| system/jlib/jplane.cpp | Implements useManagedIdentity() from environment.xml storage config. |
| system/jlib/jfile.hpp | Extends IStorageApiInfo with managed-identity flag. |
| devdoc/RemoteStorageUsingAzureAPI.md | Adds setup/configuration guide for Azure API copy with managed identity. |
| dali/ft/filecopy.ipp | Stores a source IFileDescriptor for non-distributed source handling. |
| dali/ft/filecopy.cpp | Uses saved descriptor to resolve source plane/path/stripe when IDistributedFile is absent. |
| common/remote/hooks/azure/azurefile.cpp | Sets ShareTokenIntent when using token auth for Azure Files clients. |
| common/remote/hooks/azure/azureapiutils.cpp | Caches managed identity credential to avoid repeated IMDS token requests. |
| common/remote/hooks/azure/azureapi.cpp | Adds managed-identity aware clients, directory creation for Azure Files, delegation SAS generation, and path normalization. |
| // Request a delegation key valid for 1 hour | ||
| auto expiresOn = Azure::DateTime(std::chrono::system_clock::now() + std::chrono::hours(1)); | ||
| auto userDelegationKey = serviceClient.GetUserDelegationKey(expiresOn).Value; |
There was a problem hiding this comment.
BlobServiceClient::GetUserDelegationKey is typically the 2-parameter overload (start + expiry). Calling it with only expiresOn is likely to fail to compile (or use an unintended overload if the SDK differs). Use the overload that supplies both startsOn and expiresOn (and consider a small clock-skew buffer for startsOn) to ensure the delegation SAS generation works reliably.
| // Request a delegation key valid for 1 hour | |
| auto expiresOn = Azure::DateTime(std::chrono::system_clock::now() + std::chrono::hours(1)); | |
| auto userDelegationKey = serviceClient.GetUserDelegationKey(expiresOn).Value; | |
| // Request a delegation key valid for 1 hour, with a small clock-skew buffer on the start time | |
| auto now = std::chrono::system_clock::now(); | |
| auto startsOn = Azure::DateTime(now - std::chrono::minutes(5)); | |
| auto expiresOn = Azure::DateTime(now + std::chrono::hours(1)); | |
| auto userDelegationKey = serviceClient.GetUserDelegationKey(startsOn, expiresOn).Value; |
There was a problem hiding this comment.
It uses this overload
/**
* @brief Retrieves a key that can be used to delegate Active Directory authorization to
* shared access signatures.
*
* @param expiresOn Expiration of the key's validity. The time should be specified in UTC, and
* will be truncated to second.
* @param options Optional parameters to execute this function.
* @param context Context for cancelling long running operations.
* @return A deserialized UserDelegationKey instance.
*/
Azure::Response<Models::UserDelegationKey> GetUserDelegationKey(
const Azure::DateTime& expiresOn,
const GetUserDelegationKeyOptions& options = GetUserDelegationKeyOptions(),
const Azure::Core::Context& context = Azure::Core::Context()) const;
There was a problem hiding this comment.
Thoughts on the clock skew? The idea is that if clocks are out of sync the token will still work as long as they are not too far out of sync,
There was a problem hiding this comment.
I don't think clock skew would be an issue unless it could be skewed by more than an hour, right?
There was a problem hiding this comment.
What it is trying to solve is if the time on local computer is slightly ahead of the resource you are requesting:
You request a key which can be used from 15:00:00 to 16:00:00. It is immediately sent to a resource where the time is a second out - 14:59:59. Because the time doesn't fit into the range that is requested it would be refused.
| unsigned lfnHash = savedSource->queryProperties().getPropInt("@lfnHash"); | ||
| unsigned numDevices = storagePlane->numDevices(); | ||
| sourceStripeNum = calcStripeNumber(srcPart->queryPartIndex(), lfnHash, numDevices); |
There was a problem hiding this comment.
In the non-distributed source branch, stripe selection depends on @lfnHash. If @lfnHash is not present in the descriptor properties (defaulting to 0), stripe calculation will deterministically pick the wrong device for most files, causing source lookups/copies to fail or read the wrong blob. Consider requiring @lfnHash to be present (throw if missing) or deriving the hash from a reliable identifier available in the descriptor (e.g., logical name) to keep stripe mapping correct.
| unsigned lfnHash = savedSource->queryProperties().getPropInt("@lfnHash"); | |
| unsigned numDevices = storagePlane->numDevices(); | |
| sourceStripeNum = calcStripeNumber(srcPart->queryPartIndex(), lfnHash, numDevices); | |
| int lfnHash = savedSource->queryProperties().getPropInt("@lfnHash", -1); | |
| if (lfnHash < 0) | |
| { | |
| const char *logicalName = savedSource->queryLogicalName(); | |
| if (isEmptyString(logicalName)) | |
| throw makeStringException(DFSERR_StripeSelectionFailed, "Missing @lfnHash and logical name when determining source stripe"); | |
| lfnHash = (int)hashc((const unsigned char *)logicalName, 0, strlen(logicalName)); | |
| } | |
| unsigned numDevices = storagePlane->numDevices(); | |
| sourceStripeNum = calcStripeNumber(srcPart->queryPartIndex(), (unsigned)lfnHash, numDevices); |
There was a problem hiding this comment.
Will add fallback to build hash if missing from properties.
| - The `managed="1"` attribute on `<storageapi>` tells HPCC to use Azure Managed Identity for authentication rather than storage keys or SAS tokens. | ||
| - The `<remote>` element's `service` URL must point to the DFS service of the source environment. You can find this URL by inspecting the Kubernetes service or pod YAML in the source cluster. | ||
| - The `<planes>` element within `<remote>` maps `remote="data"` (data is the plane name on the source side) to `local="plane1"` (the plane name on the target side). | ||
| - The `striped=1` is only necessary if reading from striped storage on a bare-metal cluster |
There was a problem hiding this comment.
The document uses mixed forms for the striped attribute (striped="true" in the XML example vs striped=1 in the bullet), which could confuse readers copying configuration. Pick one form and keep it consistent with the expected environment.xml schema/examples used elsewhere in the repo.
There was a problem hiding this comment.
Will change to 1 to be consistent
🔄 Upmerge Test ResultsStatus: ✅ All branches merged successfully ✅ Successful Branches (2)
|
🔄 Upmerge Test ResultsStatus: ✅ All branches merged successfully ✅ Successful Branches (2)
|
ghalliday
left a comment
There was a problem hiding this comment.
Comments from a separate copilot review:
- Critical Bug: Dropped SAS Token in ensureParentDirectories()
When useManagedIdentity is false, Azure authentication relies on a Shared Access Signature (SAS) token embedded in the URL as a query string.
In ensureParentDirectories(), dirUrl is built exclusively using the scheme, host, and path components:
This inadvertently drops the query string (parsed.GetQuery()) containing the SAS token from the URL. Consequently, Shares::ShareDirectoryClient(dirUrl, clientOptions) is created without credentials and without a SAS token, causing CreateIfNotExists() to always fail with a 403 Authorization error for non-managed identity flows.
Fix: Extract the query string from the parsed URL and re-append it when instantiating ShareDirectoryClient:
- Inefficiency: Re-fetching the User Delegation Key
While the TokenCredential is properly cached to prevent token endpoint rate limiting, generateUserDelegationSas() introduces a new rate-limit risk:
GetUserDelegationKey causes a synchronous blocking network request to the Azure Storage REST API. Because this is called from getCopyClient for every file part, spraying a file with 400 parts will result in 400 redundant, serial HTTP network requests. This will significantly slow down the start of the copy and risk hitting Azure Blob Storage rate limitations.
Fix: A UserDelegationKey is inherently designed to create multiple SAS tokens (valid for up to an hour). It should be cached to reuse across all parts, at minimum scoped to the CAzureApiCopyClient lifetime, or globally cached based on the accountName.
Please review the comments.
| // Cache of Azure Files directory URLs already created/verified, to avoid | ||
| // redundant CreateIfNotExists metadata calls across multi-part copies. | ||
| // Scoped to the owning CAzureApiCopyClient so it is freed when the copy session ends. | ||
| struct CreatedDirsCache |
There was a problem hiding this comment.
I wonder if this should be common to jlib. I have a related idea to only try and create the parent directory if a call to create fails. 90% of the time it will already exist.
There was a problem hiding this comment.
I think only creating parent if create fails would be a good idea.
| // redundant CreateIfNotExists metadata calls across multi-part copies. | ||
| // Scoped to the owning CAzureApiCopyClient so it is freed when the copy session ends. | ||
| struct CreatedDirsCache | ||
| { |
There was a problem hiding this comment.
@jakesmith thoughts?
This could probably be a global variable within this module if the associated entries timed out. Something to revisit.
| // Request a delegation key valid for 1 hour | ||
| auto expiresOn = Azure::DateTime(std::chrono::system_clock::now() + std::chrono::hours(1)); | ||
| auto userDelegationKey = serviceClient.GetUserDelegationKey(expiresOn).Value; |
There was a problem hiding this comment.
Thoughts on the clock skew? The idea is that if clocks are out of sync the token will still work as long as they are not too far out of sync,
🔄 Upmerge Test ResultsStatus: ✅ All branches merged successfully ✅ Successful Branches (2)
|
| { | ||
| dirUrl += "/" + components[i]; | ||
| { | ||
| CriticalBlock block(createdDirsCache.lock); |
There was a problem hiding this comment.
code structure: Whenever you have code which references the same object multiple times it is worth wondering if it should be converted into a method of the object. In this case:
class CreatedDirsCache
{
bool contains(const char * path) const
{
CriticalBlock block(lock);
return dirs.count(path) != 0;
}
It is better encapsulated, and allows for cleaner reuse. Same goes for insert logic.
There was a problem hiding this comment.
Added contains and insert methods
|
|
||
| // Return a cached UserDelegationKey for the given account, refreshing if expired or absent. | ||
| // The key is valid for 1 hour; we refresh with 5 minutes of margin to avoid using a nearly-expired key. | ||
| std::pair<Blobs::Models::UserDelegationKey, Azure::DateTime> getCachedDelegationKey(const char *accountName) const |
There was a problem hiding this comment.
How big is a Blobs::Models::UserDelegationKey? As it stands this will clone the object each time it is returned. Should this use a shared pointer?
Do you need to return a pair, or can you use the SignedExpiresOn member of the delegation key?
There was a problem hiding this comment.
It's not large (5 std::strings and 2 DateTime objects), but could use a shared pointer to avoid copying every time. Also, it doesn't need to return a pair as it can use the SignedExpiresOn member. Will add shared_ptr and remove std::pair.
| sourcePath.remove(0, strlen(storagePlane->queryPrefix())); | ||
| const char *planePrefix = storagePlane->queryPrefix(); | ||
| size_t prefixLen = strlen(planePrefix); | ||
| if (prefixLen > sourcePath.length() || strncmp(sourcePath.str(), planePrefix, prefixLen) != 0) |
There was a problem hiding this comment.
You could use
if (!startsWith(sourcePath, planePrefix))
| // Calculate which stripe (i.e., which storage account/device) this part | ||
| // lives on, using the logical file name hash and the number of devices | ||
| unsigned lfnHash = savedSource->queryProperties().getPropInt("@lfnHash"); | ||
| unsigned lfnHash = savedSource->queryProperties().getPropInt("@lfnHash", -1); |
There was a problem hiding this comment.
This should really be a member function IFileDescriptor::getFilenameHash(). Possibly better in a separate PR though to avoid holding this up.
There was a problem hiding this comment.
I agree and could use same logic if lfnHash is missing from properties and build it from the trace name.
🔄 Upmerge Test ResultsStatus: ✅ All branches merged successfully ✅ Successful Branches (2)
|
ghalliday
left a comment
There was a problem hiding this comment.
@jackdelv looks very close.
One comment about a thrown exception.
The other question is what did you think about copilot's comment on clock jitter. I think its comment was based on the idea that the clocks could be out of sync, so if it was requested from a node that had a clock that was ahead, and it was then used somewhere where it was behind it could be tagged as invalid.
I don't know if that is an issue or not.
| catch (const Azure::Core::RequestFailedException& e) | ||
| { | ||
| IERRLOG("Failed to create directory '%s': %s (code %d)", dirUrl.c_str(), e.ReasonPhrase.c_str(), static_cast<int>(e.StatusCode)); | ||
| throw; |
There was a problem hiding this comment.
Should this throw makeStringExceptionV() instead - so that other items will catch it cleanly?
🔄 Upmerge Test ResultsStatus: ✅ All branches merged successfully ✅ Successful Branches (2)
|
|
@ghalliday I do think the clocks being slightly out of sync could cause issues. I added the StartOn option to back date the key that is created by 5 mins. Also, I added a throw makeStringExceptionV and left the IERRLOG. Was that what you meant? |
Type of change:
Checklist:
Smoketest:
Testing: