MySQL master nodes on Kubernetes (AKS) dies for no apparent reason

5/2/2019

EDIT: I'm using MySQL version is 8.0.14 and using official MySQL docker images.

I have a master-slave MySQL topology running on top of AKS. After a some time running, master and intermediate-master containers die (exit code 137 == SIGKILL). Slaves are just fine, having all nodes (master, intermediate-master and slaves) the exact same configuration and VM capacities (Standard DS14 v2 - 16 vcpus, 112 GB memory). There is no indication that the container is OOMKilled and overall memory usage is never more that 60%.

kubectl describe pod xxx Only shows that container was restarted because container exited with error code 137 and error type Error.

There is no error shown in MySQL logs, and the only errors I see on the node's kernel logs are the following:

May 01 14:40:23 aks-myeuscluster-14926762-0 dockerd[1268]: time="2019-05-01T14:40:23.033211822Z" level=info msg="shim reaped" id=48e8b87334bbb0353e9d04230790e154ad68fa01c7159201ffc50c2eee61067a
May 01 14:40:23 aks-myeuscluster-14926762-0 dockerd[1268]: time="2019-05-01T14:40:23.033259823Z" level=warning msg="cleaning up after killed shim" id=48e8b87334bbb0353e9d04230790e154ad68fa01c7159201ffc50c2eee61067a namespace=moby
May 01 14:40:23 aks-myeuscluster-14926762-0 dockerd[1268]: time="2019-05-01T14:40:23.231095965Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
May 01 14:40:23 aks-myeuscluster-14926762-0 dockerd[1268]: time="2019-05-01T14:40:23.231730670Z" level=warning msg="48e8b87334bbb0353e9d04230790e154ad68fa01c7159201ffc50c2eee61067a cleanup: failed to unmount IPC: umount /var/lib/docker/containers/48e8b87334bbb0353e9d04230790e154ad68fa01c7159201ffc50c2eee61067a/mounts/shm, flags: 0x2: no such file or directory"
May 01 14:40:23 aks-myeuscluster-14926762-0 kubelet[1481]: I0501 14:40:23.973507    1481 kubelet.go:1928] SyncLoop (PLEG): "mysql-master-0_mysql-masters(ce93a7a5-6543-11e9-b093-062e6bba92a2)", event: &pleg.PodLifecycleEvent{ID:"ce93a7a5-6543-11e9-b093-062e6bba92a2", Type:"ContainerDied", Data:"48e8b87334bbb0353e9d04230790e154ad68fa01c7159201ffc50c2eee61067a"}
May 01 14:40:23 aks-myeuscluster-14926762-0 dockerd[1268]: time="2019-05-01T14:40:23.986599174Z" level=warning msg="Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap."
May 01 14:40:24 aks-myeuscluster-14926762-0 dockerd[1268]: time="2019-05-01T14:40:24.246558869Z" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/b2d111216341c80b7b5d5aba9f3d81bb81923da583b2c57d99803831575b0596/shim.sock" debug=false pid=23364
May 01 14:40:25 aks-myeuscluster-14926762-0 kubelet[1481]: I0501 14:40:25.007553    1481 kubelet.go:1928] SyncLoop (PLEG): "mysql-master-0_mysql-masters(ce93a7a5-6543-11e9-b093-062e6bba92a2)", event: &pleg.PodLifecycleEvent{ID:"ce93a7a5-6543-11e9-b093-062e6bba92a2", Type:"ContainerStarted", Data:"b2d111216341c80b7b5d5aba9f3d81bb81923da583b2c57d99803831575b0596"}

A docker inspect on the node for the terminated container shows:

[
    {
        "Id": "48e8b87334bbb0353e9d04230790e154ad68fa01c7159201ffc50c2eee61067a",
        "Created": "2019-05-01T04:07:23.816102331Z",
        "Path": "docker-entrypoint.sh",
        "Args": [
            "mysqld"
        ],
        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 137,
            "Error": "",
            "StartedAt": "2019-05-01T04:07:24.192620731Z",
            "FinishedAt": "2019-05-01T14:40:23.230693162Z"
        },
        "Image": "sha256:71b5c7e10f9b20f4bd37c348872899cac828b1d2edad269fc8b93c9d43682241",
        "ResolvConfPath": "/var/lib/docker/containers/d4c359fa3c05b26eed1259288f2073ae98c7dee0d050fb141d5c6f40b2000c09/resolv.conf",
        "HostnamePath": "/var/lib/docker/containers/d4c359fa3c05b26eed1259288f2073ae98c7dee0d050fb141d5c6f40b2000c09/hostname",
        "HostsPath": "/var/lib/kubelet/pods/ce93a7a5-6543-11e9-b093-062e6bba92a2/etc-hosts",
        "LogPath": "/var/lib/docker/containers/48e8b87334bbb0353e9d04230790e154ad68fa01c7159201ffc50c2eee61067a/48e8b87334bbb0353e9d04230790e154ad68fa01c7159201ffc50c2eee61067a-json.log",
        "Name": "/k8s_mysql_mysql-master-0_mysql-masters_ce93a7a5-6543-11e9-b093-062e6bba92a2_3",
        "RestartCount": 0,
        "Driver": "overlay2",
        "Platform": "linux",
        "MountLabel": "",
        "ProcessLabel": "",
        "AppArmorProfile": "docker-default",
        "ExecIDs": null,
        "HostConfig": {
            "Binds": [
                "/var/lib/kubelet/pods/ce93a7a5-6543-11e9-b093-062e6bba92a2/volumes/kubernetes.io~empty-dir/data:/var/lib/mysql",
                "/var/lib/kubelet/pods/ce93a7a5-6543-11e9-b093-062e6bba92a2/volumes/kubernetes.io~empty-dir/conf:/etc/mysql/conf.d",
                "/var/lib/kubelet/pods/ce93a7a5-6543-11e9-b093-062e6bba92a2/volumes/kubernetes.io~secret/default-token-96h8m:/var/run/secrets/kubernetes.io/serviceaccount:ro",
                "/var/lib/kubelet/pods/ce93a7a5-6543-11e9-b093-062e6bba92a2/etc-hosts:/etc/hosts",
                "/var/lib/kubelet/pods/ce93a7a5-6543-11e9-b093-062e6bba92a2/containers/mysql/65c11b38:/dev/termination-log"
            ],
            "ContainerIDFile": "",
            "LogConfig": {
                "Type": "json-file",
                "Config": {
                    "max-file": "5",
                    "max-size": "50m"
                }
            },
            "NetworkMode": "container:d4c359fa3c05b26eed1259288f2073ae98c7dee0d050fb141d5c6f40b2000c09",
            "PortBindings": null,
            "RestartPolicy": {
                "Name": "",
                "MaximumRetryCount": 0
            },
            "AutoRemove": false,
            "VolumeDriver": "",
            "VolumesFrom": null,
            "CapAdd": null,
            "CapDrop": null,
            "Dns": null,
            "DnsOptions": null,
            "DnsSearch": null,
            "ExtraHosts": null,
            "GroupAdd": null,
            "IpcMode": "container:d4c359fa3c05b26eed1259288f2073ae98c7dee0d050fb141d5c6f40b2000c09",
            "Cgroup": "",
            "Links": null,
            "OomScoreAdj": 999,
            "PidMode": "",
            "Privileged": false,
            "PublishAllPorts": false,
            "ReadonlyRootfs": false,
            "SecurityOpt": [
                "seccomp=unconfined"
            ],
            "UTSMode": "",
            "UsernsMode": "",
            "ShmSize": 67108864,
            "Runtime": "runc",
            "ConsoleSize": [
                0,
                0
            ],
            "Isolation": "",
            "CpuShares": 102,
            "Memory": 120259084288,
            "NanoCpus": 0,
            "CgroupParent": "/kubepods/burstable/podce93a7a5-6543-11e9-b093-062e6bba92a2",
            "BlkioWeight": 0,
            "BlkioWeightDevice": null,
            "BlkioDeviceReadBps": null,
            "BlkioDeviceWriteBps": null,
            "BlkioDeviceReadIOps": null,
            "BlkioDeviceWriteIOps": null,
            "CpuPeriod": 100000,
            "CpuQuota": 1600000,
            "CpuRealtimePeriod": 0,
            "CpuRealtimeRuntime": 0,
            "CpusetCpus": "",
            "CpusetMems": "",
            "Devices": [],
            "DeviceCgroupRules": null,
            "DiskQuota": 0,
            "KernelMemory": 0,
            "KernelMemoryTCP": 0,
            "MemoryReservation": 0,
            "MemorySwap": -1,
            "MemorySwappiness": null,
            "OomKillDisable": false,
            "PidsLimit": 0,
            "Ulimits": null,
            "CpuCount": 0,
            "CpuPercent": 0,
            "IOMaximumIOps": 0,
            "IOMaximumBandwidth": 0,
            "MaskedPaths": [
                "/proc/acpi",
                "/proc/kcore",
                "/proc/keys",
                "/proc/latency_stats",
                "/proc/timer_list",
                "/proc/timer_stats",
                "/proc/sched_debug",
                "/proc/scsi",
                "/sys/firmware"
            ],
            "ReadonlyPaths": [
                "/proc/asound",
                "/proc/bus",
                "/proc/fs",
                "/proc/irq",
                "/proc/sys",
                "/proc/sysrq-trigger"
            ]
        },
        "GraphDriver": {
            "Data": {
                "LowerDir": "/var/lib/docker/overlay2/e881867313c0c44e1c344f051c1ed2bdbb2d7060987503ee654bdbb52298e22f-init/diff:/var/lib/docker/overlay2/02dbd10b6ce555ab3e7adfa20c38a11e6efa59ca2a48630f9d9511982c6ad679/diff:/var/lib/docker/overlay2/f90a222989df3b42aecf7bed9005b1126a73b8b0fd4842bd5c7ecdd7f9f011e6/diff:/var/lib/docker/overlay2/08d3f18230cdb2f65212d1a30e7f6b06b5720778dbb1c3d4a672fc4d804f7704/diff:/var/lib/docker/overlay2/aed1b032114dd084eae7b4d4427a3805f96427be0dce0f8b729840f1f230fd91/diff:/var/lib/docker/overlay2/966072e7d9ffe343e1f00b87c8cb8399f05084052cced72644123b19269aa80e/diff:/var/lib/docker/overlay2/154369c09c2853f5d77d05ba917c7c7f5cdb071f5c328ac403b30c7b4ab98a62/diff:/var/lib/docker/overlay2/067d055c87bf0ccb8e151cd8a2ac7c53da7215f4e5bb99e72f56d321b9f236f5/diff:/var/lib/docker/overlay2/22ddb21cd2fdfffc9e3757f89d866b94e471dd5baf81ef2481742a55ca04f3d3/diff:/var/lib/docker/overlay2/4523ae9e068b9dd7d0d6fb9e280cadb1c17e7101f7f280bfd9d4b9b54ca0616d/diff:/var/lib/docker/overlay2/32e5c4244779f2008e9f020f97afd80809c3549f956432e7116e446a59bc079d/diff:/var/lib/docker/overlay2/bfe5c64218049c0c90d4eb2f9bf2356a83e8c44802df0e48d6f55204efaac6a6/diff:/var/lib/docker/overlay2/a0c23bd39e1c83e67f3af090af495de289297d5332d8f7766ee197e0d4ca6a0c/diff",
                "MergedDir": "/var/lib/docker/overlay2/e881867313c0c44e1c344f051c1ed2bdbb2d7060987503ee654bdbb52298e22f/merged",
                "UpperDir": "/var/lib/docker/overlay2/e881867313c0c44e1c344f051c1ed2bdbb2d7060987503ee654bdbb52298e22f/diff",
                "WorkDir": "/var/lib/docker/overlay2/e881867313c0c44e1c344f051c1ed2bdbb2d7060987503ee654bdbb52298e22f/work"
            },
            "Name": "overlay2"
        },
        "Mounts": [
            {
                "Type": "bind",
                "Source": "/var/lib/kubelet/pods/ce93a7a5-6543-11e9-b093-062e6bba92a2/containers/mysql/65c11b38",
                "Destination": "/dev/termination-log",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/var/lib/kubelet/pods/ce93a7a5-6543-11e9-b093-062e6bba92a2/volumes/kubernetes.io~empty-dir/data",
                "Destination": "/var/lib/mysql",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/var/lib/kubelet/pods/ce93a7a5-6543-11e9-b093-062e6bba92a2/volumes/kubernetes.io~empty-dir/conf",
                "Destination": "/etc/mysql/conf.d",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/var/lib/kubelet/pods/ce93a7a5-6543-11e9-b093-062e6bba92a2/volumes/kubernetes.io~secret/default-token-96h8m",
                "Destination": "/var/run/secrets/kubernetes.io/serviceaccount",
                "Mode": "ro",
                "RW": false,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/var/lib/kubelet/pods/ce93a7a5-6543-11e9-b093-062e6bba92a2/etc-hosts",
                "Destination": "/etc/hosts",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            }
        ],
        "Config": {
            "Hostname": "mysql-master-0",
            "Domainname": "",
            "User": "0",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "ExposedPorts": {
                "3306/tcp": {},
                "33060/tcp": {}
            },
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "KUBERNETES_PORT_443_TCP_ADDR=mysql-east-mysql-eastus-bdfb38-0e053371.hcp.eastus.azmk8s.io",
                "KUBERNETES_PORT=tcp://mysql-east-mysql-eastus-bdfb38-0e053371.hcp.eastus.azmk8s.io:443",
                "KUBERNETES_PORT_443_TCP=tcp://mysql-east-mysql-eastus-bdfb38-0e053371.hcp.eastus.azmk8s.io:443",
                "KUBERNETES_SERVICE_HOST=mysql-east-mysql-eastus-bdfb38-0e053371.hcp.eastus.azmk8s.io",
                "MYSQL_ROOT_PASSWORD=YpBwWhZyUnQ8ERR8",
                "KUBERNETES_PORT_443_TCP_PROTO=tcp",
                "KUBERNETES_PORT_443_TCP_PORT=443",
                "MYSQL_MASTER_NODE_0_PORT_3306_TCP_PROTO=tcp",
                "MYSQL_MASTER_NODE_0_PORT_30008_TCP=tcp://10.0.204.173:30008",
                "MYSQL_MASTER_NODE_0_PORT_30008_TCP_PROTO=tcp",
                "MYSQL_MASTER_NODE_0_SERVICE_HOST=10.0.204.173",
                "MYSQL_MASTER_NODE_0_SERVICE_PORT_MYSQL=3306",
                "MYSQL_MASTER_NODE_0_PORT_3306_TCP=tcp://10.0.204.173:3306",
                "MYSQL_MASTER_NODE_0_PORT_3306_TCP_PORT=3306",
                "MYSQL_MASTER_NODE_0_PORT_3306_TCP_ADDR=10.0.204.173",
                "MYSQL_MASTER_NODE_0_PORT_30008_TCP_PORT=30008",
                "MYSQL_MASTER_NODE_0_SERVICE_PORT_NCAT=30008",
                "MYSQL_MASTER_NODE_0_PORT_30008_TCP_ADDR=10.0.204.173",
                "MYSQL_MASTER_NODE_0_SERVICE_PORT=3306",
                "KUBERNETES_SERVICE_PORT_HTTPS=443",
                "MYSQL_MASTER_NODE_0_PORT=tcp://10.0.204.173:3306",
                "KUBERNETES_SERVICE_PORT=443",
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "GOSU_VERSION=1.7",
                "MYSQL_MAJOR=8.0",
                "MYSQL_VERSION=8.0.14-1debian9"
            ],
            "Cmd": [
                "mysqld"
            ],
            "Healthcheck": {
                "Test": [
                    "NONE"
                ]
            },
            "ArgsEscaped": true,
            "Image": "sha256:71b5c7e10f9b20f4bd37c348872899cac828b1d2edad269fc8b93c9d43682241",
            "Volumes": {
                "/var/lib/mysql": {}
            },
            "WorkingDir": "",
            "Entrypoint": [
                "docker-entrypoint.sh"
            ],
            "OnBuild": null,
            "Labels": {
                "annotation.io.kubernetes.container.hash": "d5e52a59",
                "annotation.io.kubernetes.container.ports": "[{\"name\":\"mysql\",\"containerPort\":3306,\"protocol\":\"TCP\"}]",
                "annotation.io.kubernetes.container.restartCount": "3",
                "annotation.io.kubernetes.container.terminationMessagePath": "/dev/termination-log",
                "annotation.io.kubernetes.container.terminationMessagePolicy": "File",
                "annotation.io.kubernetes.pod.terminationGracePeriod": "30",
                "io.kubernetes.container.logpath": "/var/log/pods/ce93a7a5-6543-11e9-b093-062e6bba92a2/mysql/3.log",
                "io.kubernetes.container.name": "mysql",
                "io.kubernetes.docker.type": "container",
                "io.kubernetes.pod.name": "mysql-master-0",
                "io.kubernetes.pod.namespace": "mysql-masters",
                "io.kubernetes.pod.uid": "ce93a7a5-6543-11e9-b093-062e6bba92a2",
                "io.kubernetes.sandbox.id": "d4c359fa3c05b26eed1259288f2073ae98c7dee0d050fb141d5c6f40b2000c09"
            }
        },
        "NetworkSettings": {
            "Bridge": "",
            "SandboxID": "",
            "HairpinMode": false,
            "LinkLocalIPv6Address": "",
            "LinkLocalIPv6PrefixLen": 0,
            "Ports": {},
            "SandboxKey": "",
            "SecondaryIPAddresses": null,
            "SecondaryIPv6Addresses": null,
            "EndpointID": "",
            "Gateway": "",
            "GlobalIPv6Address": "",
            "GlobalIPv6PrefixLen": 0,
            "IPAddress": "",
            "IPPrefixLen": 0,
            "IPv6Gateway": "",
            "MacAddress": "",
            "Networks": {}
        }
    }
]

MySQL configuration for all nodes (master, intermediate master and slaves) is as follows:

[mysqld]
report_host=master-00.masters.eastus.mysql.shared.bestday.net
server_id=1000
read_only=OFF
default_authentication_plugin=mysql_native_password
innodb_random_read_ahead=ON
log_bin=binlog
relay_log=relay-bin
log_slave_updates=true
skip_slave_start=true
skip_name_resolve=true
binlog_format=ROW
gtid_mode=ON
enforce_gtid_consistency=true
transaction_write_set_extraction=XXHASH64
binlog_transaction_dependency_tracking=WRITESET
binlog_group_commit_sync_delay=10000
binlog_group_commit_sync_no_delay_count=1000
slave_parallel_workers=16
slave_parallel_type=LOGICAL_CLOCK
slave_preserve_commit_order=ON
master_info_repository=TABLE
max_connections=100000
max_connect_errors=4294967295
innodb_buffer_pool_size=64G
innodb_buffer_pool_instances=16
innodb_log_file_size=1G
innodb_log_buffer_size=64M
innodb_read_io_threads=16
innodb_write_io_threads=16
innodb_flush_method=O_DIRECT
temptable_max_ram=26G
max_heap_table_size=512M
tmp_table_size=512M
join_buffer_size=1M
character_set_server=utf8mb4
collation_server=utf8mb4_general_ci

Requests and limits for containers are set as follows:

requests:
  cpu: 100m
  memory: 128Mi
limits:
  cpu: 16
  memory: 112Gi

Requests are set low because pods have a hard anti-affinity setup so a single MySQL instance is placed per node. I suspect this might be an issue (see https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/resource-qos.md), but then again, containers are not being OOMKilled.

Again, masters and slaves have the exact same configuration and slaves do not die.

My guess is that some setting I'm using comes into play when a node is serving slaves that in conjunction with docker / vm kernel setup cause a crash in mysqld process.

If anyone can help me track this odd behaviour, I will much appreciate it.

Thanks!

-- Nicolas Garfinkiel
azure-aks
docker
kubernetes
mysql

2 Answers

5/23/2020

Hope you already fixed this issue. By chance, I studied on this error, and it turned out to be a bug of containerd and was fixed since containerd 1.2.6 (docker 19.03.1).

Key information from your log:

level=info msg="shim reaped"
level=warning msg="cleaning up after killed shim"

References:

-- Geoffrey Chen
Source: StackOverflow

5/3/2019

You must check "kubectl get events" when this happens. Your pods use Burstable QoS type, so it can be Evicted in response to resource pressure. Evicted is not the same as OOMKill. Generally Evicted pods are shown in "kubectl get pod" output but only starting with some version of Kubernetes.

What you should do is to make Mysql pods Guaranteed, that means make requests and limits equal. Such important services like Mysql simply must not run as Burstable or Best Effort type.

-- Vasily Angapov
Source: StackOverflow