SQL Server AG on Kubernetes

4/16/2019

I have a Kubernetes cluster and I deploy SQL Server Always On Availability Groups operator on it, but after 2 or 3 days the SQL Server pods get restarting rapidly and they don't work till I delete these pods and they deploying by the Statefulset again and they working for 2 or 3 days again.

What is happening to them?

These are my logs:

[health] ERROR: 2019/04/16 14:49:11 Could not connect to local SQL instance: Unresponsive or down Unable to open tcp connection with host '127.0.0.1:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
[supervisor] 2019/04/16 14:49:11 could not connect to local sqlservr: Unresponsive or down Unable to open tcp connection with host '127.0.0.1:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
[health] ERROR: 2019/04/16 14:49:12 Could not connect to local SQL instance: Unresponsive or down Unable to open tcp connection with host '127.0.0.1:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
[supervisor] 2019/04/16 14:49:12 could not connect to local sqlservr: Unresponsive or down Unable to open tcp connection with host '127.0.0.1:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
[health] ERROR: 2019/04/16 14:49:13 Could not connect to local SQL instance: Unresponsive or down Unable to open tcp connection with host '127.0.0.1:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
[supervisor] 2019/04/16 14:49:13 could not connect to local sqlservr: Unresponsive or down Unable to open tcp connection with host '127.0.0.1:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
[health] ERROR: 2019/04/16 14:49:14 Could not connect to local SQL instance: Unresponsive or down Unable to open tcp connection with host '127.0.0.1:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
[supervisor] 2019/04/16 14:49:14 could not connect to local sqlservr: Unresponsive or down Unable to open tcp connection with host '127.0.0.1:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
[supervisor] 2019/04/16 14:49:15 Getting replica name...
[supervisor] 2019/04/16 14:49:15 Replica name [mssql3-0]
[supervisor] 2019/04/16 14:49:16 Getting replica name...
[supervisor] 2019/04/16 14:49:16 Received a notification of type ADDED for secret mssql3-statefulset-secret with ResourceVersion 328866
[supervisor] 2019/04/16 14:49:16 Updating Ag Secret for ag ag1
[supervisor] 2019/04/16 14:49:16 Cached resource version: 0, current resource version: 639780
[health] 2019/04/16 14:49:16 Attempt 1 to connect to the instance at 127.0.0.1:1433 and run sp_server_diagnostics
[supervisor] 2019/04/16 14:49:16 Synchronizing users and certificates from cert secret...
[supervisor] 2019/04/16 14:49:16 Reading cert secret for mssql1-0...
[supervisor] 2019/04/16 14:49:16 Creating login dbm-mssql1...
[health] 2019/04/16 14:49:16 Connected to the instance at 127.0.0.1:1433
[supervisor] 2019/04/16 14:49:16 Creating user dbm-mssql1...
[supervisor] 2019/04/16 14:49:17 Local certificate matches the one in the cert secret
[supervisor] 2019/04/16 14:49:17 Reading cert secret for mssql2-0...
[supervisor] 2019/04/16 14:49:17 Creating login dbm-mssql2...
[supervisor] 2019/04/16 14:49:17 Creating user dbm-mssql2...
[supervisor] 2019/04/16 14:49:18 Local certificate matches the one in the cert secret
[supervisor] 2019/04/16 14:49:18 Target AGs: [{ag1 1 false}]
[supervisor] 2019/04/16 14:49:18 There is already a pod, mssql3-0, on node worker2 in the ag ag1, this statefulset will be updated with the necessary pod anti-affinity
[supervisor] 2019/04/16 14:49:18 existingAgAffinities: map[ag-service.mssql.microsoft.com/ag1:true]
[supervisor] 2019/04/16 14:49:18 agLabelsToAdd: []
[supervisor] 2019/04/16 14:49:18 Updating statefulset mssql3
[supervisor] 2019/04/16 14:49:18 Waiting for pod to be restarted...
[supervisor] 2019/04/16 14:49:19 Waiting for pod to be restarted...
[supervisor] 2019/04/16 14:49:20 Waiting for pod to be restarted...
[supervisor] 2019/04/16 14:49:21 Waiting for pod to be restarted...
[supervisor] 2019/04/16 14:49:22 Waiting for pod to be restarted...
[supervisor] 2019/04/16 14:49:23 Waiting for pod to be restarted...
[supervisor] 2019/04/16 14:49:24 Waiting for pod to be restarted...
[supervisor] 2019/04/16 14:49:25 Waiting for pod to be restarted...

And my kubectl get all is like this:

root@master:/home/ubuntu# kubectl get all -n ag1 
NAME                                  READY   STATUS             RESTARTS   AGE
pod/mssql-initialize-mssql1-hd6rd     0/1     Completed          0          3d20h
pod/mssql-initialize-mssql2-gd9hz     0/1     Completed          0          3d20h
pod/mssql-operator-6f9c99cc89-hzlsb   1/1     Running            15         2d1h
pod/mssql1-0                          1/2     CrashLoopBackOff   179        2d
pod/mssql2-0                          1/2     CrashLoopBackOff   165        3d20h
pod/mssql3-0                          1/2     CrashLoopBackOff   163        3d20h

NAME                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
service/ag1             ClusterIP   None             <none>        1433/TCP,5022/TCP   3d20h
service/ag1-primary     NodePort    10.106.244.51    <none>        1433:31080/TCP      3d20h
service/ag1-secondary   NodePort    10.105.101.171   <none>        1433:32497/TCP      3d20h
service/mssql1          NodePort    10.97.52.124     <none>        1433:31859/TCP      3d20h
service/mssql2          NodePort    10.100.173.32    <none>        1433:30943/TCP      3d20h
service/mssql3          NodePort    10.99.238.238    <none>        1433:32406/TCP      3d20h

NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/mssql-operator   1/1     1            1           3d20h

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/mssql-operator-6f9c99cc89   1         1         1       3d20h

NAME                      READY   AGE
statefulset.apps/mssql1   0/1     3d20h
statefulset.apps/mssql2   0/1     3d20h
statefulset.apps/mssql3   0/1     3d20h

NAME                                COMPLETIONS   DURATION   AGE
job.batch/mssql-initialize-mssql1   1/1           5m38s      3d20h
job.batch/mssql-initialize-mssql2   1/1           5m35s      3d20h
job.batch/mssql-initialize-mssql3   1/1           5m22s      3d20h

One of statefulset's manifest :

apiVersion: apps/v1
kind: StatefulSet
metadata:
  creationTimestamp: "2019-04-12T18:43:23Z"
  generation: 1
  labels:
    name: mssql1
    type: sqlservr
  name: mssql1
  namespace: ag1
  ownerReferences:
  - apiVersion: mssql.microsoft.com/v1
    controller: false
    kind: ReplicationController
    name: mssql1
    uid: d88e739e-5d52-11e9-9f0d-5254001850dc
  resourceVersion: "1064877"
  selfLink: /apis/apps/v1/namespaces/ag1/statefulsets/mssql1
  uid: d9c01112-5d52-11e9-9f0d-5254001850dc
spec:
  podManagementPolicy: OrderedReady
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      mssql.microsoft.com/sql-instance: mssql1
  serviceName: ""
  template:
    metadata:
      creationTimestamp: null
      labels:
        ag-service.mssql.microsoft.com/ag1: ""
        mssql.microsoft.com/sql-instance: mssql1
        name: mssql1
        type: sqlservr
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: ag-service.mssql.microsoft.com/ag1
                operator: Exists
            topologyKey: kubernetes.io/hostname
      containers:
      - env:
        - name: ACCEPT_EULA
          value: "y"
        - name: MSSQL_PID
          value: Developer
        - name: MSSQL_SA_PASSWORD
          valueFrom:
            secretKeyRef:
              key: initsapassword
              name: mssql1-statefulset-secret
        - name: MSSQL_ENABLE_HADR
          value: "1"
        image: mcr.microsoft.com/mssql/server:2019-CTP2.1-ubuntu
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 2
          successThreshold: 1
          timeoutSeconds: 1
        name: mssql-server
        ports:
        - containerPort: 1433
          name: tds
          protocol: TCP
        - containerPort: 5022
          name: dbm
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/opt/mssql
          name: instance-root
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: no-api-access
          readOnly: true
      - command:
        - /mssql-server-k8s-ag-agent-supervisor
        env:
        - name: MSSQL_K8S_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: MSSQL_K8S_POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: MSSQL_K8S_SQL_SERVER_NAME
          value: mssql1
        - name: MSSQL_K8S_POD_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        - name: MSSQL_K8S_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: MSSQL_K8S_MONITOR_POLICY
          value: "3"
        - name: MSSQL_K8S_HEALTH_CONNECTION_REBOOT_TIMEOUT
        - name: MSSQL_K8S_SKIP_AG_ANTI_AFFINITY
        - name: MSSQL_K8S_MONITOR_PERIOD_SECONDS
        - name: MSSQL_K8S_LEASE_DURATION_SECONDS
        - name: MSSQL_K8S_RENEW_DEADLINE_SECONDS
        - name: MSSQL_K8S_RETRY_PERIOD_SECONDS
        - name: MSSQL_K8S_ACQUIRE_PERIOD_SECONDS
        - name: MSSQL_K8S_SQL_WRITE_LEASE_PERIOD_SECONDS
        image: mcr.microsoft.com/mssql/ha:2019-CTP2.1-ubuntu
        imagePullPolicy: IfNotPresent
        name: mssql-ha-supervisor
        ports:
        - containerPort: 8080
          name: liveliness
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: mssql1
      serviceAccountName: mssql1
      terminationGracePeriodSeconds: 30
      volumes:
      - emptyDir: {}
        name: no-api-access
  updateStrategy:
    rollingUpdate:
      partition: 0
    type: RollingUpdate
  volumeClaimTemplates:
  - metadata:
      creationTimestamp: null
      name: instance-root
      namespace: ag1
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
      volumeMode: Filesystem
    status:
      phase: Pending
status:
  collisionCount: 0
  currentReplicas: 1
  currentRevision: mssql1-795bb7f749
  observedGeneration: 1
  replicas: 1
  updateRevision: mssql1-795bb7f749
  updatedReplicas: 1
-- meisam bahrami
alwayson
availability-group
kubernetes
sql-server

0 Answers