Skip to content

Unhandled exception in ActivationRebalancerMonitor after an unclean shutdown #9964

@jsteinich

Description

@jsteinich

ActivationRebalancerMonitor.OnStart can throw an uncaught exception if the initial rebalancer report fails (could also be a problem in the grain timer). If that happens, the silo startup will then fail.

After an unclean shutdown of all silos the membership table will only have invalid entries. This causes the initial report to always fail (as it tries to activate the grain on an invalid silo). Then the startup is cancelled before the silo has a chance to check for expired membership entries, thus resulting in perpetual failure to start silos until manual intervention.

Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Connection attempt to endpoint S10.52.3.99:22222:132113324 timed out after 00:00:05
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 201
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 223
   at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 81
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 117
   at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 51
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 88
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 88
   at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync[TResult](GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 90
   at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.LookupAsync(GrainId grainId, Int32 hopCount) in /_/src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:line 701
   at Orleans.Runtime.GrainDirectory.DhtGrainLocator.Lookup(GrainId grainId) in /_/src/Orleans.Runtime/GrainDirectory/DhtGrainLocator.cs:line 29
   at Orleans.Runtime.Placement.PlacementService.PlacementWorker.GetOrPlaceActivationAsync(Message firstMessage) in /_/src/Orleans.Runtime/Placement/PlacementService.cs:line 373
   at Orleans.Runtime.Placement.PlacementService.PlacementWorker.AddressWaitingMessages(GrainPlacementWorkItem completedWorkItem) in /_/src/Orleans.Runtime/Placement/PlacementService.cs:line 339
--- End of stack trace from previous location ---
   at Orleans.Runtime.Messaging.MessageCenter.<AddressAndSendMessage>g__SendMessageAsync|40_0(Task addressMessageTask, Message m) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 476
   at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 81
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 117
   at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 51
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 88
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 88
   at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync[TResult](GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 90
   at Orleans.Runtime.Placement.Rebalancing.ActivationRebalancerMonitor.<>c__DisplayClass11_0.<<OnStart>b__0>d.MoveNext() in /_/src/Orleans.Runtime/Placement/Rebalancing/ActivationRebalancerMonitor.cs:line 85
--- End of stack trace from previous location ---
   at Orleans.Runtime.Scheduler.AsyncClosureWorkItem.Execute() in /_/src/Orleans.Runtime/Scheduler/ClosureWorkItem.cs:line 33
   at Orleans.Runtime.Placement.Rebalancing.ActivationRebalancerMonitor.OnStart(CancellationToken cancellationToken) in /_/src/Orleans.Runtime/Placement/Rebalancing/ActivationRebalancerMonitor.cs:line 72
   at Orleans.Runtime.SiloLifecycleSubject.MonitoredObserver.OnStart(CancellationToken ct) in /_/src/Orleans.Runtime/Lifecycle/SiloLifecycleSubject.cs:line 113
   at Orleans.LifecycleSubject.OnStart(CancellationToken cancellationToken) in /_/src/Orleans.Core/Lifecycle/LifecycleSubject.cs:line 110
   at Orleans.Runtime.Scheduler.AsyncClosureWorkItem.Execute() in /_/src/Orleans.Runtime/Scheduler/ClosureWorkItem.cs:line 33
   at Orleans.Runtime.Silo.StartAsync(CancellationToken cancellationToken) in /_/src/Orleans.Runtime/Silo/Silo.cs:line 150
   at Orleans.Hosting.SiloHostedService.StartAsync(CancellationToken cancellationToken) in /_/src/Orleans.Runtime/Hosting/SiloHostedService.cs:line 28
   at Microsoft.Extensions.Hosting.Internal.Host.<StartAsync>b__14_1(IHostedService service, CancellationToken token)
   at Microsoft.Extensions.Hosting.Internal.Host.ForeachService[T](IEnumerable`1 services, CancellationToken token, Boolean concurrent, Boolean abortOnFirstException, List`1 exceptions, Func`3 operation)
   at Microsoft.Extensions.Hosting.Internal.Host.<StartAsync>g__LogAndRethrow|14_3(<>c__DisplayClass14_0&)
   at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)
   at PerBlue.Game.Server.Silo.Program.Main() in /src/Server/Silo/Program.cs:line 229
   at PerBlue.Game.Server.Silo.Program.<Main>()

Environment:

  • Orleans 9.2.1
  • DynamoDB membership
  • Activate repartitioning and rebalancing enabled
  • Default directory

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions