What are the best practives for a health check API and probes in micro-services Kubernetes environment?

10/22/2019

We are developing tons of micro-services. They all run in Kubernetes. As ops, I need to define probes for each micro-service. So we will create a health check API for each micro-service. What are the best practices for this API? What are the best practices for probes? Do we need to check the service's health only or the database connection too (and more)? Is it redundant? The databases are in Kubernetes too, and have their own probes too. Can we just use the /version API as the probe?

I'm looking for feedback and documentation. Thank you.

-- Antoine
health-monitoring
kubernetes
kubernetes-health-check
microservices
probe

2 Answers

10/22/2019

A microservice generally calls other microservices/services to retrieve data, and there is the chance that the downstream service may be down. You can use the "Circuit Breaker Pattern". This pattern is suited to, prevent an application from trying to invoke a remote service or access a shared resource if this operation is highly likely to fail.

You will find a pattern in Observability Patterns (/Health Check) in Microservices. Each service needs to have an endpoint that can be used to check the health of the application, such as /health.1

-- madhuka
Source: StackOverflow

10/22/2019

An argument for including databases and other downstream dependencies in the health check is the following:

Assume you have a load balancer exposing some number of micro-services to the outside world. If due to a large amount of load the database of one of these micro-services goes down, and this is not included in the health check of the micro-service, the load balancer will still try to direct traffic to micro-service, further increasing the problem the database is experiencing.

If instead the health-check included downstream dependencies, the load-balancer would stop directing traffic to the micro-service (and hopefully show a nice error message to the user). This would give the database time to restore from the increase in load (and ops time to react).

So I would argue that using a basic /version is not a good idea.

-- Blokje5
Source: StackOverflow