ActivationRebalancerMonitor.OnStart can throw an uncaught exception if the initial rebalancer report fails (could also be a problem in the grain timer). If that happens, the silo startup will then fail.
After an unclean shutdown of all silos the membership table will only have invalid entries. This causes the initial report to always fail (as it tries to activate the grain on an invalid silo). Then the startup is cancelled before the silo has a chance to check for expired membership entries, thus resulting in perpetual failure to start silos until manual intervention.
Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Connection attempt to endpoint S10.52.3.99:22222:132113324 timed out after 00:00:05
at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 201
at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 223
at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 81
at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 117
at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 51
at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 88
at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 88
at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync[TResult](GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 90
at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.LookupAsync(GrainId grainId, Int32 hopCount) in /_/src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:line 701
at Orleans.Runtime.GrainDirectory.DhtGrainLocator.Lookup(GrainId grainId) in /_/src/Orleans.Runtime/GrainDirectory/DhtGrainLocator.cs:line 29
at Orleans.Runtime.Placement.PlacementService.PlacementWorker.GetOrPlaceActivationAsync(Message firstMessage) in /_/src/Orleans.Runtime/Placement/PlacementService.cs:line 373
at Orleans.Runtime.Placement.PlacementService.PlacementWorker.AddressWaitingMessages(GrainPlacementWorkItem completedWorkItem) in /_/src/Orleans.Runtime/Placement/PlacementService.cs:line 339
--- End of stack trace from previous location ---
at Orleans.Runtime.Messaging.MessageCenter.<AddressAndSendMessage>g__SendMessageAsync|40_0(Task addressMessageTask, Message m) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 476
at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 81
at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 117
at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 51
at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 88
at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 88
at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync[TResult](GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 90
at Orleans.Runtime.Placement.Rebalancing.ActivationRebalancerMonitor.<>c__DisplayClass11_0.<<OnStart>b__0>d.MoveNext() in /_/src/Orleans.Runtime/Placement/Rebalancing/ActivationRebalancerMonitor.cs:line 85
--- End of stack trace from previous location ---
at Orleans.Runtime.Scheduler.AsyncClosureWorkItem.Execute() in /_/src/Orleans.Runtime/Scheduler/ClosureWorkItem.cs:line 33
at Orleans.Runtime.Placement.Rebalancing.ActivationRebalancerMonitor.OnStart(CancellationToken cancellationToken) in /_/src/Orleans.Runtime/Placement/Rebalancing/ActivationRebalancerMonitor.cs:line 72
at Orleans.Runtime.SiloLifecycleSubject.MonitoredObserver.OnStart(CancellationToken ct) in /_/src/Orleans.Runtime/Lifecycle/SiloLifecycleSubject.cs:line 113
at Orleans.LifecycleSubject.OnStart(CancellationToken cancellationToken) in /_/src/Orleans.Core/Lifecycle/LifecycleSubject.cs:line 110
at Orleans.Runtime.Scheduler.AsyncClosureWorkItem.Execute() in /_/src/Orleans.Runtime/Scheduler/ClosureWorkItem.cs:line 33
at Orleans.Runtime.Silo.StartAsync(CancellationToken cancellationToken) in /_/src/Orleans.Runtime/Silo/Silo.cs:line 150
at Orleans.Hosting.SiloHostedService.StartAsync(CancellationToken cancellationToken) in /_/src/Orleans.Runtime/Hosting/SiloHostedService.cs:line 28
at Microsoft.Extensions.Hosting.Internal.Host.<StartAsync>b__14_1(IHostedService service, CancellationToken token)
at Microsoft.Extensions.Hosting.Internal.Host.ForeachService[T](IEnumerable`1 services, CancellationToken token, Boolean concurrent, Boolean abortOnFirstException, List`1 exceptions, Func`3 operation)
at Microsoft.Extensions.Hosting.Internal.Host.<StartAsync>g__LogAndRethrow|14_3(<>c__DisplayClass14_0&)
at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)
at PerBlue.Game.Server.Silo.Program.Main() in /src/Server/Silo/Program.cs:line 229
at PerBlue.Game.Server.Silo.Program.<Main>()
Environment:
- Orleans 9.2.1
- DynamoDB membership
- Activate repartitioning and rebalancing enabled
- Default directory
ActivationRebalancerMonitor.OnStartcan throw an uncaught exception if the initial rebalancer report fails (could also be a problem in the grain timer). If that happens, the silo startup will then fail.After an unclean shutdown of all silos the membership table will only have invalid entries. This causes the initial report to always fail (as it tries to activate the grain on an invalid silo). Then the startup is cancelled before the silo has a chance to check for expired membership entries, thus resulting in perpetual failure to start silos until manual intervention.
Environment: