Gridgain Partitions loss with node scale down

5/16/2021

We deployed Gridgain cluster in Google Kubernetes cluster and it working properly with persistence enable. We need to auto scale enable. At scale up no any errors, but at the scale down given "Partition loss". We need to recover this loss partitions using control.sh script. But it is not possible at every time.

What is the solution for this? Is scale down not working for Gridgain nodes?

-- Nuwan Sameera
google-cloud-platform
gridgain
ignite
kubernetes

2 Answers

5/17/2021

Usually you should have backup factor sufficient to offset lost nodes (such as, if you have backups=2, you can lose at most 2 nodes at the same time).

Coupled with baselineAutoAdjust set to reasonable value it should provide scale down feature.

Scale down with data loss and persistence enabled will indeed require resetting lost partitions.

-- alamar
Source: StackOverflow

5/17/2021

In addition to the @alamar's answer, you need to ensure that your nodes are being stopped gracefully. Graceful shutdown performs additional verification on your data and ensures that you won't have lost partitions in your cluster when a node leaves.

You might verify graceful shutdown by searching for the following messages:

Invoking shutdown hook...
...
Ensuring that caches have sufficient backups and local rebalance completion...

You can set it with the following options:

It feels that sometimes I can see that #1 approach with the system property might not work well in k8s world. Please check the latter one with explicit Ignite config adjustment if the first one seems to be not working properly.

In addition to the above, you might want to disable baslientAdutoAdjustment (if enabled) to prevent data rebalancing for short scaling up and down.

-- Alexandr Shapkin
Source: StackOverflow