Some windows container pods restarting on AKS & throwing RabbitMq connection failure from 60 pods and after several restarts come to running state

3/3/2020

After deploying around 60 pods on AKS which uses Rebus RabbitMq. During the initialization, some around 15 pods restart several times and then come into running state. Below error thrown by the components,

*Unhandled Exception: Rebus.Injection.ResolutionException: Could not resolve Rebus.Bus.IBus with decorator depth 0 - registrations: Rebus.Injection.Injectionist+Handler ---> RabbitMQ.Client.Exceptions.BrokerUnreachableException: None of the specified endpoints were reachable ---> System.AggregateException: One or more errors occurred. ---> RabbitMQ.Client.Exceptions.ConnectFailureException: Connection failed ---> System.Net.Sockets.SocketException: No such host is known
   at System.Net.Dns.HostResolutionEndHelper(IAsyncResult asyncResult)
   at System.Net.Dns.EndGetHostAddresses(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at RabbitMQ.Client.TcpClientAdapter.<ConnectAsync>d__2.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at RabbitMQ.Client.Impl.TaskExtensions.<TimeoutAfter>d__1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at RabbitMQ.Client.Impl.SocketFrameHandler.ConnectOrFail(ITcpClient socket, AmqpTcpEndpoint endpoint, Int32 timeout)
   --- End of inner exception stack trace ---
   at RabbitMQ.Client.Impl.SocketFrameHandler.ConnectUsingAddressFamily(AmqpTcpEndpoint endpoint, Func`2 socketFactory, Int32 timeout, AddressFamily family)
   at RabbitMQ.Client.Impl.SocketFrameHandler..ctor(AmqpTcpEndpoint endpoint, Func`2 socketFactory, Int32 connectionTimeout, Int32 readTimeout, Int32 writeTimeout)
   at RabbitMQ.Client.ConnectionFactory.CreateFrameHandler(AmqpTcpEndpoint endpoint)
   at RabbitMQ.Client.EndpointResolverExtensions.SelectOne[T](IEndpointResolver resolver, Func`2 selector)
   --- End of inner exception stack trace ---
   at RabbitMQ.Client.EndpointResolverExtensions.SelectOne[T](IEndpointResolver resolver, Func`2 selector)
   at RabbitMQ.Client.Framing.Impl.AutorecoveringConnection.Init(IEndpointResolver endpoints)
   at RabbitMQ.Client.ConnectionFactory.CreateConnection(IEndpointResolver endpointResolver, String clientProvidedName)
   --- End of inner exception stack trace ---
   at RabbitMQ.Client.ConnectionFactory.CreateConnection(IEndpointResolver endpointResolver, String clientProvidedName)
   at Rebus.Internals.ConnectionManager.GetConnection()
   at Rebus.RabbitMq.RabbitMqTransport.CreateQueue(String address)
   at Rebus.Config.RebusConfigurer.<>c__DisplayClass12_0.<Start>b__26(IResolutionContext c)
   at Rebus.Injection.Injectionist.ResolutionContext.Get[TService]()
   --- End of inner exception stack trace ---
   at Rebus.Injection.Injectionist.ResolutionContext.Get[TService]()
   at Rebus.Injection.Injectionist.Get[TService]()
   at Rebus.Config.RebusConfigurer.Start()
   at Castle.Windsor.Installer.AssemblyInstaller.Install(IWindsorContainer container, IConfigurationStore store)
   at Castle.Windsor.WindsorContainer.Install(IWindsorInstaller[] installers, DefaultComponentInstaller scope)
   at Castle.Windsor.WindsorContainer.Install(IWindsorInstaller[] installers)
   at RebusHost.Main(String[] args)*

Although there is a connection available to RabbitMq server but some pods on start give this error and after 3 to 5 restarts they are in successful running state. So not sure what will be causing pod to not get connected on first attempt itself. Any clue will be appreciated.

We are using Rebus 4.0 & RabbitMq 5.1.0.0 versions. Deploying the components(pods) on windows node of AKS. And on AKS running docker image of "rabbitmq:3-management" under linux node ofcourse.

-- Nirav31
azure-aks
kubernetes
rabbitmq
rebus
windows-container

0 Answers