|
| 1 | +# Stale LB DSR Rules Cleanup |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This mitigation script automatically detects and removes stale Load Balancer Direct Server Return (LB DSR) rules from VFP (Virtual Filtering Platform) that reference non-existent backend endpoints. It runs continuously to maintain network health by cleaning up orphaned rules that can cause connectivity issues. |
| 6 | + |
| 7 | +## Problem Statement |
| 8 | + |
| 9 | +When backend endpoints are removed or become unavailable, the corresponding LB DSR rules in VFP may not be cleaned up properly. These stale rules can: |
| 10 | +- Cause packet routing failures |
| 11 | +- Lead to connection timeouts |
| 12 | +- Create unnecessary overhead in the networking stack |
| 13 | +- Result in traffic being sent to non-existent endpoints |
| 14 | + |
| 15 | +## Solution |
| 16 | + |
| 17 | +The `cleanup-stale-lb-rules.ps1` script: |
| 18 | +1. Checks and sets the required registry configuration for LB DSR feature management |
| 19 | +2. Continuously monitors VFP LB DSR rules (both IPv4 and IPv6) |
| 20 | +3. Compares rule destination IPs (DIPs) against active HNS endpoints |
| 21 | +4. Automatically removes rules that reference non-existent endpoints |
| 22 | + |
| 23 | +## Prerequisites |
| 24 | + |
| 25 | +- Windows Server with HNS (Host Network Service) enabled |
| 26 | +- VFP control utilities (`vfpctrl.exe`) available |
| 27 | +- PowerShell with administrator privileges |
| 28 | +- HNS PowerShell module |
| 29 | + |
| 30 | +## Usage |
| 31 | + |
| 32 | +### Running the Script on a Single Node |
| 33 | + |
| 34 | +```powershell |
| 35 | +.\cleanup-stale-lb-rules.ps1 |
| 36 | +``` |
| 37 | + |
| 38 | +The script will: |
| 39 | +1. Check registry key `HKLM:\SYSTEM\CurrentControlSet\Policies\Microsoft\FeatureManagement\Overrides\140377743` |
| 40 | +2. If the key value is 1, set it to 0 and restart the node (this disables PR 13179278 which is causing delete LB RPC calls from KubeProxy to fail with Invalid IP Error - ICM: 719903780) |
| 41 | +3. Start a continuous monitoring loop with 10-second intervals |
| 42 | +4. Clean up any stale LB DSR rules found |
| 43 | + |
| 44 | +**Note:** This approach fixes issues on a single node. If the issue is widespread across the cluster, deploy the solution using a DaemonSet: |
| 45 | + |
| 46 | +```powershell |
| 47 | +kubectl create -f cleanup-stale-lb-rules.yaml |
| 48 | +``` |
| 49 | + |
| 50 | +This will run the mitigation script as HPC pods on all affected nodes. |
| 51 | + |
| 52 | +### Configuration |
| 53 | + |
| 54 | +You can modify these parameters at the top of the script: |
| 55 | + |
| 56 | +- **`$groups`**: VFP groups to monitor (default: `LB_DSR_IPv4_OUT`, `LB_DSR_IPv6_OUT`) |
| 57 | +- **`$refreshIntervalSeconds`**: Time between cleanup iterations (default: 10 seconds) |
| 58 | + |
| 59 | +## How It Works |
| 60 | + |
| 61 | +### 1. Registry Check |
| 62 | +The script first ensures the feature flag registry key (140377743) is set to 0. If not, it sets the value and restarts the node. |
| 63 | + |
| 64 | +### 2. Endpoint Collection |
| 65 | +- Retrieves all HNS policies |
| 66 | +- Extracts endpoint references |
| 67 | +- Builds a dictionary of valid endpoint IP addresses |
| 68 | + |
| 69 | +### 3. Rule Validation |
| 70 | +For each VFP port and LB DSR group: |
| 71 | +- Lists all rules in the `LB_DSR` layer |
| 72 | +- Extracts DIP (Destination IP) ranges from each rule |
| 73 | +- Compares DIPs against the valid endpoint dictionary |
| 74 | + |
| 75 | +### 4. Cleanup |
| 76 | +- Rules with DIPs not found in active endpoints are flagged as stale |
| 77 | +- Stale rules are automatically deleted using `vfpctrl /remove-rule` |
| 78 | + |
| 79 | +## Output Examples |
| 80 | + |
| 81 | +### Healthy State |
| 82 | +``` |
| 83 | +All DIP ranges are present in the dictionary. |
| 84 | +``` |
| 85 | + |
| 86 | +### Stale Rules Detected |
| 87 | +``` |
| 88 | +Missing DIP ranges: |
| 89 | + - 10.244.0.25 |
| 90 | + - fdf5:5d67:b9ce:b28f::13f |
| 91 | +Deleting rule : ruleId: ABC123, port: Port1, group: LB_DSR_IPv4_OUT |
| 92 | +``` |
| 93 | + |
| 94 | +## Monitoring |
| 95 | + |
| 96 | +The script provides color-coded output: |
| 97 | +- **Green**: Healthy state, all rules valid |
| 98 | +- **Yellow**: Configuration changes or rule deletion in progress |
| 99 | +- **Red**: Stale rules detected |
| 100 | +- **Cyan**: Status updates and iteration markers |
| 101 | + |
| 102 | +## Important Notes |
| 103 | + |
| 104 | +- The script runs indefinitely until manually stopped (Ctrl+C) |
| 105 | +- Node restart may occur on first run if registry configuration is incorrect |
| 106 | +- Ensure no legitimate endpoint updates are in progress during cleanup to avoid false positives |
| 107 | +- The script requires elevated privileges to modify VFP rules and registry settings |
| 108 | + |
| 109 | +## Troubleshooting |
| 110 | + |
| 111 | +### Script doesn't detect stale rules |
| 112 | +- Verify VFP and HNS are functioning correctly |
| 113 | +- Check that `vfpctrl.exe` is accessible in the system PATH |
| 114 | +- Ensure HNS endpoints are properly registered |
| 115 | + |
| 116 | +### Node restarts unexpectedly |
| 117 | +- This is expected behavior if the registry key is not set to 0 |
| 118 | +- After restart, the script will continue normal operation |
| 119 | + |
| 120 | +### Permission errors |
| 121 | +- Run PowerShell as Administrator |
| 122 | +- Verify account has rights to modify VFP rules and registry |
| 123 | + |
| 124 | +## Related Documentation |
| 125 | + |
| 126 | +- [VFP Documentation](../../helper/VFP.psm1) |
| 127 | +- [HNS Module](../HNS/) |
| 128 | +- [Network Health Monitoring](../../networkhealth/) |
| 129 | + |
| 130 | +## Support |
| 131 | + |
| 132 | +For issues or questions, please refer to the main repository documentation or open an issue. |
0 commit comments