Skip to content

Caching Rework, Timeouts & Memory Improvements#1317

Open
FriedrichWeinmann wants to merge 8 commits into
microsoft:devfrom
FriedrichWeinmann:dev-hang
Open

Caching Rework, Timeouts & Memory Improvements#1317
FriedrichWeinmann wants to merge 8 commits into
microsoft:devfrom
FriedrichWeinmann:dev-hang

Conversation

@FriedrichWeinmann

Copy link
Copy Markdown
Member

Alright, this took a bit longer than expected, but it's finally there:
A major update to how data is persisted within memory to help with large tenant setups and require less configuration. Should work better out-of-the-box.

Changes

  • Upd: Raised minimum PSFramework version, due to feature usage
  • Upd: Migrated test timeouts for parallelized tests to Runspace Workflow timeouts, enabling activity-based timeouts
  • Upd: Added support to retry failed tests
  • Upd: Migrated Caches to PSFCache - limiting data retention by count and time
  • Upd: Added soft-limit for Test Statistics messages
  • More minor tweaks

Test Timeout update

This update only applies to the tests executed in the background.
It uses the latest PSFramework update to the Runspace Workflows to support timeouts. The key difference to the previous implementation is that these timeouts can be based on activity, rather than "since the start". Activity is measured whenever you use Write-PSFMessage.

As the tests running in the main runspace cannot use this feature, the test timeout is kept at the previous one hour by default.
A future update will address that if possible.

Retry Failed Tests

This update only applies to the tests executed in the background.
It uses the latest PSFramework update to the Runspace Workflows to support retries on test failure. The number of retries can be defined via configuration and defaults to 0 extra attempts.
A future update will support customizing this count on a per-test basis.

Note: In the implementation I retry once more than specified. This is so a last round of the retry validation can process the "test failed" handling.

Migrated Caches to PSFCache

We cache all requests to Graph or Azure unless configured otherwise.
This can add a huge load in large tenants.

The new system will automatically expire values too old or when too many values are cached.
Both Azure and Graph request caches default to 30 minutes of age or 50,000 entries, whichever happens first. All limits can be customized via configuration.

Added soft-limit for Test Statistics messages

When you run Get-ZtTestStatistics, you can collect information about the last test execution.
This includes all log messages generated.

Since some of our tests have scaling message output based on tenant size, this could lead to lots of memory being used by these messages.
With this new configuration setting, once too many messages are logged for a test - 1024 by default - the data retained is implified significantly to reduce memory overhead, at the loss of some data.

Warning: Untested

Due to technical limitations I am currently unable to test this code on my own. Given the pressing nature of the large tenant performance, I chose to push this now, rather than wait until I can make the time to restore my own testing capability.

Do not merge right away without testing it first!!!

Thoughts on Cache Changes & Paged Requests

As part of the design change, I considered whether this change will lead to issues with our habit of caching paged requests separately, one cache entry per page.
This might theoretically lead to issues with partial cache invalidation.

In case you want to consider the same issues, here's the potential issues my mind went to and my conclusions on them:

When only part of a paged request is invalidated from cache, will results become inconsistent?

No.
Whether by max items or by lifetime, cache content is invalidated from the oldest entries.
In this situation, the first request is never going to be from cache, its "next page" link will have an entirely different continuation token and thus not read from cache at all.

The only situation where an issue might occur is when reconfiguring cache limits while it is reading from cache.
A situation that is purely theoretical, since the cache is only used during the full processing where the console is locked.

What about concurrency messing with paged cache entries?

Theoretically, other runspaces could add new data too fast and - while one runspace is reading cached pages, others could add new entries and expire the next page from the cache.
The continuation token is unlikely to still be valid and the read fails.

Theoretically an issue, in practice exceedingly unlikely.
Reads from cache are far faster than rest requests, which means enough other runspaces would need to add new cache entries in parallel to outpace that.
This would not be an issue in almost any conceivable scale configuration.

@astaykov astaykov left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a stable PR.
Neither Import-Module, nor Connect-ZtAssessment, nor Invoke-ZtAssessment complete successfully. The main issue is a warning:

PS C:\tdb\zerotrustassessment> Import-Module .\src\powershell\ZeroTrustAssessment.psd1 -Force

Resolving 7 dependencies...
    ✅ Module PSFramework found (v1.14.449).
# trimmed for clarity

Asserting MSAL loading order for dependencies...
    ✅ Loading Microsoft.IdentityModel.Abstractions.dll
# trimmed for clarity
WARNING: [12:38:43][ZeroTrustAssessment.psm1] Failed to import file C:\tdb\zerotrustassessment\src\powershell\scripts\variables.ps1 | Cannot convert argument "Lifetime", with value: "PSFramework.Configuration.Config", for "SetLifetime" to type "PSFramework.Parameter.TimeSpanParameter": "Cannot convert value "PSFramework.Configuration.Config" to type "PSFramework.Parameter.TimeSpanParameter". Error: "Could not interpret PSFramework.Configuration.Config""

PS C:\tdb\zerotrustassessment> Connect-ZtAssessment -TenantId abc.com
🔑 Authentication to Graph, Azure, AipService, ExchangeOnline, SecurityCompliance, SharePointOnline.
During the next steps, you may be prompted to authenticate separately for several services.

Connecting to Microsoft Graph
   ✅ Connected
WARNING: [12:41:38][Connect-ZtAssessment] Could not resolve the initial tenant domain from Graph. EXO, S&C, and SPO CBA connections may fail.
WARNING: [12:41:38][Test-ZtContext] Unable to verify user roles | Cannot index into a null array.

Ref. automated validation run (but I also tested this on a clean VM): https://github.com/microsoft/zerotrustassessment/actions/runs/27681160160/job/81868692154

@FriedrichWeinmann

Copy link
Copy Markdown
Member Author

Thanks for catching that! Fixed.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reworks in-memory persistence and parallel test execution to reduce memory pressure in large tenants, leveraging newer PSFramework runspace workflow features (timeouts/retries) and moving request caching to PSFramework cache primitives.

Changes:

  • Raised the required PSFramework version and migrated Graph/Azure request caching to PSFramework cache objects with configurable max-items and lifetime.
  • Updated parallel test execution to use PSFramework runspace workflow timeouts (including timeout type) and added retry support for failed tests.
  • Added memory controls around test statistics message retention and refactored per-test completion/error handling into shared helpers.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/powershell/ZeroTrustAssessment.psd1 Bumps required PSFramework version to enable new runspace workflow/cache features.
src/powershell/scripts/variables.ps1 Switches Graph/Azure caches to PSF cache objects and applies configured cache limits.
src/powershell/scripts/configuration.ps1 Adds config surface for retries, timeout type, stats message limits, and cache limits.
src/powershell/public/Invoke-ZtGraphRequest.ps1 Simplifies paging loop nextLink detection.
src/powershell/public/Invoke-ZtAzureRequest.ps1 Simplifies Az REST parameter forwarding via ConvertTo-PSFHashtable.
src/powershell/public/Invoke-ZtAssessment.ps1 Moves timeouts to [PSFTimeSpan] and improves config file validation behavior.
src/powershell/private/tests/Write-ZtTestFinish.ps1 New helper to finalize test statistics, progress updates, and per-test logs with message soft-limit.
src/powershell/private/tests/Write-ZtTestError.ps1 New helper to standardize error handling/reporting for failed tests.
src/powershell/private/tests/Start-ZtTestExecution.ps1 Reworks parallel test worker to use workflow timeout/retry features.
src/powershell/private/tests/Invoke-ZtTests.ps1 Updates test timeout type, passes timespans through, and improves workflow cleanup handling.
src/powershell/private/tests/Invoke-ZtTestParallel.ps1 New parallel-runspace test executor that integrates with shared result/statistics handling.
src/powershell/private/tests/Invoke-ZtTest.ps1 Refactors end-of-test handling to the new finish helper.
src/powershell/private/tests/Get-ZtTestResult.ps1 New helper to persist a per-test result object across retry attempts.
src/powershell/private/export/Export-ZtTenantData.ps1 Adds Debug.KeepWorkflows guard to keep/remove export workflows.
src/powershell/private/core/Invoke-ZtGraphRequestCache.ps1 Avoids “Using graph cache” debug log unless a cached value is actually returned.

Comment thread src/powershell/private/tests/Start-ZtTestExecution.ps1
Comment thread src/powershell/private/tests/Invoke-ZtTests.ps1
Comment thread src/powershell/public/Invoke-ZtAssessment.ps1 Outdated
Comment thread src/powershell/private/tests/Get-ZtTestResult.ps1
Comment thread src/powershell/private/tests/Invoke-ZtTestParallel.ps1 Outdated
Comment thread src/powershell/scripts/variables.ps1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants