The tdengine image (2.4.0.3) cannot be shut down normally when pulled up with kubernetes

2/14/2022

I created a three node cluster through k8s deployment using the tdengine version 2.4.0.3 image. View the pod information:

kubectl get pods -n mytdengine
NAME          READY       STATUS        RESTART       AGE
tdengine-01    1/1         Running       0             1m45s
tdengine-02    1/1         Running       0             1m45s
tdengine-03    1/1         Running       0             1m45s

Everthing was going well.

However, when I tried to stop the pods with delete operation:

kubectl delete pod tdengine-03 -n mytdengine

The target pod is not deleted as expect. The status turns to:

NAME          READY       STATUS        RESTART       AGE
tdengine-01    1/1         Running       0             2m35s
tdengine-02    1/1         Running       0             2m35s
tdengine-03    1/1         Terminating   0             2m35s

After several tests, pod will successfully deleted until 3 mins, which is unnormal. I didn't actually use the tdengine instance, which means there are no excessive load or storage occupation. I cannot find a reason to explain why it cost 3mins to shut down.

-- naissance
kubernetes
tdengine

1 Answer

2/14/2022

After testing, I eliminated the problem of kubernetes configuration. Moreover, I found that the parameter ‘terminationgraceperiodseconds’ configured in the yaml file of Pod: 180

terminationgraceperiodseconds:180

This means that the pod was not shut down gracefully, but was forcibly removed after timeout.

Generally speaking, the stop of pod usually sends a signal of signterm. The container processes the signal correctly and makes an elegant shutdown. However, if it does not stop or the container does not respond to the signal and exceeds the timeout set by the above parameter 'termination graceriodseconds', the container will receive the signal of signkill and forcibly kill the container. Ref: https://tasdikrahman.me/2019/04/24/handling-singals-for-applications-in-kubernetes-docker/

The reason for this is tdengine2 4.0.3 the startup script of the image pulls up taosadapter first and then taosd, but it does not rewrite the processing method of signterm signal. Due to the particularity of Linux PID 1, only PID 1 receives the signterm signal after k8s sends it to the pod content container (as shown in the figure below, PID 1 is the startup script) and does not notify taosadapter and taosd (become zombie processes).

    PID USER     PR   NI VIRT     RES    SHR  S  %CPU  %MEM  TIME+    COMMAND
     9 root      20   0 2873404  80144   2676 S   2.3  0.5 112:30.81 taosadapter                     
     8 root      20   0 2439240  41364   2996 S   1.0  0.3 130:53.67 taosd                           
     1 root      20   0   20044   1648   1368 S   0.0  0.0   0:00.01 run_taosd.sh                    
     7 root      20   0   20044    476    200 S   0.0  0.0   0:00.00 run_taosd.sh                    
   135 root      20   0   20176   2052   1632 S   0.0  0.0   0:00.00 bash                            
   146 root      20   0   38244   1788   1356 R   0.0  0.0   0:00.00 top 

I personally choose the way to rewrite hook function in k8s yaml file to immediately delete the container:

lifecycle:
   preStop:
      command:
      - /bin/bash
      - -c
      - procnum=`ps aux | grep taosd | grep -v -e grep -e entrypoint -e run_taosd
            | awk '{print $2}'`; kill -15 $procnum; if ["$?" -eq 0]; then echo "kill

Of course, once we know the cause of the problem, there are other solutions which are not discussed here.

-- naissance
Source: StackOverflow