Caching Rework, Timeouts & Memory Improvements#1317
Conversation
astaykov
left a comment
There was a problem hiding this comment.
This is not a stable PR.
Neither Import-Module, nor Connect-ZtAssessment, nor Invoke-ZtAssessment complete successfully. The main issue is a warning:
PS C:\tdb\zerotrustassessment> Import-Module .\src\powershell\ZeroTrustAssessment.psd1 -Force
Resolving 7 dependencies...
✅ Module PSFramework found (v1.14.449).
# trimmed for clarity
Asserting MSAL loading order for dependencies...
✅ Loading Microsoft.IdentityModel.Abstractions.dll
# trimmed for clarity
WARNING: [12:38:43][ZeroTrustAssessment.psm1] Failed to import file C:\tdb\zerotrustassessment\src\powershell\scripts\variables.ps1 | Cannot convert argument "Lifetime", with value: "PSFramework.Configuration.Config", for "SetLifetime" to type "PSFramework.Parameter.TimeSpanParameter": "Cannot convert value "PSFramework.Configuration.Config" to type "PSFramework.Parameter.TimeSpanParameter". Error: "Could not interpret PSFramework.Configuration.Config""
PS C:\tdb\zerotrustassessment> Connect-ZtAssessment -TenantId abc.com
🔑 Authentication to Graph, Azure, AipService, ExchangeOnline, SecurityCompliance, SharePointOnline.
During the next steps, you may be prompted to authenticate separately for several services.
Connecting to Microsoft Graph
✅ Connected
WARNING: [12:41:38][Connect-ZtAssessment] Could not resolve the initial tenant domain from Graph. EXO, S&C, and SPO CBA connections may fail.
WARNING: [12:41:38][Test-ZtContext] Unable to verify user roles | Cannot index into a null array.
Ref. automated validation run (but I also tested this on a clean VM): https://github.com/microsoft/zerotrustassessment/actions/runs/27681160160/job/81868692154
|
Thanks for catching that! Fixed. |
There was a problem hiding this comment.
Pull request overview
This PR reworks in-memory persistence and parallel test execution to reduce memory pressure in large tenants, leveraging newer PSFramework runspace workflow features (timeouts/retries) and moving request caching to PSFramework cache primitives.
Changes:
- Raised the required PSFramework version and migrated Graph/Azure request caching to PSFramework cache objects with configurable max-items and lifetime.
- Updated parallel test execution to use PSFramework runspace workflow timeouts (including timeout type) and added retry support for failed tests.
- Added memory controls around test statistics message retention and refactored per-test completion/error handling into shared helpers.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/powershell/ZeroTrustAssessment.psd1 | Bumps required PSFramework version to enable new runspace workflow/cache features. |
| src/powershell/scripts/variables.ps1 | Switches Graph/Azure caches to PSF cache objects and applies configured cache limits. |
| src/powershell/scripts/configuration.ps1 | Adds config surface for retries, timeout type, stats message limits, and cache limits. |
| src/powershell/public/Invoke-ZtGraphRequest.ps1 | Simplifies paging loop nextLink detection. |
| src/powershell/public/Invoke-ZtAzureRequest.ps1 | Simplifies Az REST parameter forwarding via ConvertTo-PSFHashtable. |
| src/powershell/public/Invoke-ZtAssessment.ps1 | Moves timeouts to [PSFTimeSpan] and improves config file validation behavior. |
| src/powershell/private/tests/Write-ZtTestFinish.ps1 | New helper to finalize test statistics, progress updates, and per-test logs with message soft-limit. |
| src/powershell/private/tests/Write-ZtTestError.ps1 | New helper to standardize error handling/reporting for failed tests. |
| src/powershell/private/tests/Start-ZtTestExecution.ps1 | Reworks parallel test worker to use workflow timeout/retry features. |
| src/powershell/private/tests/Invoke-ZtTests.ps1 | Updates test timeout type, passes timespans through, and improves workflow cleanup handling. |
| src/powershell/private/tests/Invoke-ZtTestParallel.ps1 | New parallel-runspace test executor that integrates with shared result/statistics handling. |
| src/powershell/private/tests/Invoke-ZtTest.ps1 | Refactors end-of-test handling to the new finish helper. |
| src/powershell/private/tests/Get-ZtTestResult.ps1 | New helper to persist a per-test result object across retry attempts. |
| src/powershell/private/export/Export-ZtTenantData.ps1 | Adds Debug.KeepWorkflows guard to keep/remove export workflows. |
| src/powershell/private/core/Invoke-ZtGraphRequestCache.ps1 | Avoids “Using graph cache” debug log unless a cached value is actually returned. |
Alright, this took a bit longer than expected, but it's finally there:
A major update to how data is persisted within memory to help with large tenant setups and require less configuration. Should work better out-of-the-box.
Changes
Test Timeout update
This update only applies to the tests executed in the background.
It uses the latest PSFramework update to the Runspace Workflows to support timeouts. The key difference to the previous implementation is that these timeouts can be based on activity, rather than "since the start". Activity is measured whenever you use
Write-PSFMessage.As the tests running in the main runspace cannot use this feature, the test timeout is kept at the previous one hour by default.
A future update will address that if possible.
Retry Failed Tests
This update only applies to the tests executed in the background.
It uses the latest PSFramework update to the Runspace Workflows to support retries on test failure. The number of retries can be defined via configuration and defaults to 0 extra attempts.
A future update will support customizing this count on a per-test basis.
Migrated Caches to PSFCache
We cache all requests to Graph or Azure unless configured otherwise.
This can add a huge load in large tenants.
The new system will automatically expire values too old or when too many values are cached.
Both Azure and Graph request caches default to 30 minutes of age or 50,000 entries, whichever happens first. All limits can be customized via configuration.
Added soft-limit for Test Statistics messages
When you run
Get-ZtTestStatistics, you can collect information about the last test execution.This includes all log messages generated.
Since some of our tests have scaling message output based on tenant size, this could lead to lots of memory being used by these messages.
With this new configuration setting, once too many messages are logged for a test - 1024 by default - the data retained is implified significantly to reduce memory overhead, at the loss of some data.
Warning: Untested
Due to technical limitations I am currently unable to test this code on my own. Given the pressing nature of the large tenant performance, I chose to push this now, rather than wait until I can make the time to restore my own testing capability.
Do not merge right away without testing it first!!!
Thoughts on Cache Changes & Paged Requests
As part of the design change, I considered whether this change will lead to issues with our habit of caching paged requests separately, one cache entry per page.
This might theoretically lead to issues with partial cache invalidation.
In case you want to consider the same issues, here's the potential issues my mind went to and my conclusions on them:
No.
Whether by max items or by lifetime, cache content is invalidated from the oldest entries.
In this situation, the first request is never going to be from cache, its "next page" link will have an entirely different continuation token and thus not read from cache at all.
The only situation where an issue might occur is when reconfiguring cache limits while it is reading from cache.
A situation that is purely theoretical, since the cache is only used during the full processing where the console is locked.
Theoretically, other runspaces could add new data too fast and - while one runspace is reading cached pages, others could add new entries and expire the next page from the cache.
The continuation token is unlikely to still be valid and the read fails.
Theoretically an issue, in practice exceedingly unlikely.
Reads from cache are far faster than rest requests, which means enough other runspaces would need to add new cache entries in parallel to outpace that.
This would not be an issue in almost any conceivable scale configuration.