Cassandra Kubernetes Statefulset NoHostAvailableException

10/24/2020

I have an application deployed in kubernetes, it consists of cassandra, a go client, and a java client (and other things, but they are not relevant for this discussion). We have used helm to do our deployment. We are using a stateful set and a headless service for cassandra. We have configured the clients to use the headless service dns as a contact point for cluster creation.

Everything works great. Until all of the nodes go down, or some other nefarious combination of nodes going down, I am simulating it by deleting all pods using kubectl delete in succession on all of the cassandra nodes.

When I do this the clients throw NoHostAvailableException in java its

    "java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.200.23.151:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency LOCAL_QUORUM (1 required but only 0 alive)), /10.200.152.130:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency ONE (1 required but only 0 alive)))"
    which eventually becomes
    "java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)"
in go its
   "gocql: no hosts available in the pool"

I can query cassandra using cqlsh, the node seems fine using nodetool status, all of the new ips are there the image I am using doesnt have netstat so I have not yet confirmed its listening on the expected port.

Via executing bash on the two client pods I can see the dns makes sense using nslookup, but... netstat does not show any established connections to cassandra (they are present before I take the nodes down)

If I restart my clients everything works fine.

I have googled a lot (I mean a lot), most of what I have found is related to never having a working connection, the most relevant things seem very old (like 2014, 2016).

So a node going down is very basic and I would expect everything to work, the cassandra cluster manages itself, it discovers new nodes as they come online, it balances the load, etc. etc.

If I take my all of my cassandra nodes down slowly, one at a time, everything works fine (I have not confirmed that the load is distributed appropriately and to the correct node, but at least it works)

So, is there a point where this behaviour is expected? ie I have taken everything down, nothing was up and running before the last from the first cluster was taken down.. is this behaviour expected?

To me it seems like it should be an easy issue to resolve, not sure whats missing / incorrect, I am surprised that both clients show the same symptoms, makes me think something is not happening with our statefulset and service

-- Mike
cassandra
kubernetes
statefulset

1 Answer

10/26/2020

I think the problem might lie in the headless DNS service. If all of the nodes go down completely and there are no nodes at all available via the service until pods are replaced, it could cause the driver to hang.

I've noted that you've used Helm for your deployments but you may be interested in this document for connecting to Cassandra clusters in Kubernetes from the authors of the cass-operator.

I'm going to contact some of the authors and get them to respond here. Cheers!

-- Erick Ramirez
Source: StackOverflow