I have a Kubernetes cluster and I deploy SQL Server Always On Availability Groups operator on it, but after 2 or 3 days the SQL Server pods get restarting rapidly and they don't work till I delete these pods and they deploying by the Statefulset
again and they working for 2 or 3 days again.
What is happening to them?
These are my logs:
[health] ERROR: 2019/04/16 14:49:11 Could not connect to local SQL instance: Unresponsive or down Unable to open tcp connection with host '127.0.0.1:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
[supervisor] 2019/04/16 14:49:11 could not connect to local sqlservr: Unresponsive or down Unable to open tcp connection with host '127.0.0.1:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
[health] ERROR: 2019/04/16 14:49:12 Could not connect to local SQL instance: Unresponsive or down Unable to open tcp connection with host '127.0.0.1:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
[supervisor] 2019/04/16 14:49:12 could not connect to local sqlservr: Unresponsive or down Unable to open tcp connection with host '127.0.0.1:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
[health] ERROR: 2019/04/16 14:49:13 Could not connect to local SQL instance: Unresponsive or down Unable to open tcp connection with host '127.0.0.1:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
[supervisor] 2019/04/16 14:49:13 could not connect to local sqlservr: Unresponsive or down Unable to open tcp connection with host '127.0.0.1:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
[health] ERROR: 2019/04/16 14:49:14 Could not connect to local SQL instance: Unresponsive or down Unable to open tcp connection with host '127.0.0.1:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
[supervisor] 2019/04/16 14:49:14 could not connect to local sqlservr: Unresponsive or down Unable to open tcp connection with host '127.0.0.1:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
[supervisor] 2019/04/16 14:49:15 Getting replica name...
[supervisor] 2019/04/16 14:49:15 Replica name [mssql3-0]
[supervisor] 2019/04/16 14:49:16 Getting replica name...
[supervisor] 2019/04/16 14:49:16 Received a notification of type ADDED for secret mssql3-statefulset-secret with ResourceVersion 328866
[supervisor] 2019/04/16 14:49:16 Updating Ag Secret for ag ag1
[supervisor] 2019/04/16 14:49:16 Cached resource version: 0, current resource version: 639780
[health] 2019/04/16 14:49:16 Attempt 1 to connect to the instance at 127.0.0.1:1433 and run sp_server_diagnostics
[supervisor] 2019/04/16 14:49:16 Synchronizing users and certificates from cert secret...
[supervisor] 2019/04/16 14:49:16 Reading cert secret for mssql1-0...
[supervisor] 2019/04/16 14:49:16 Creating login dbm-mssql1...
[health] 2019/04/16 14:49:16 Connected to the instance at 127.0.0.1:1433
[supervisor] 2019/04/16 14:49:16 Creating user dbm-mssql1...
[supervisor] 2019/04/16 14:49:17 Local certificate matches the one in the cert secret
[supervisor] 2019/04/16 14:49:17 Reading cert secret for mssql2-0...
[supervisor] 2019/04/16 14:49:17 Creating login dbm-mssql2...
[supervisor] 2019/04/16 14:49:17 Creating user dbm-mssql2...
[supervisor] 2019/04/16 14:49:18 Local certificate matches the one in the cert secret
[supervisor] 2019/04/16 14:49:18 Target AGs: [{ag1 1 false}]
[supervisor] 2019/04/16 14:49:18 There is already a pod, mssql3-0, on node worker2 in the ag ag1, this statefulset will be updated with the necessary pod anti-affinity
[supervisor] 2019/04/16 14:49:18 existingAgAffinities: map[ag-service.mssql.microsoft.com/ag1:true]
[supervisor] 2019/04/16 14:49:18 agLabelsToAdd: []
[supervisor] 2019/04/16 14:49:18 Updating statefulset mssql3
[supervisor] 2019/04/16 14:49:18 Waiting for pod to be restarted...
[supervisor] 2019/04/16 14:49:19 Waiting for pod to be restarted...
[supervisor] 2019/04/16 14:49:20 Waiting for pod to be restarted...
[supervisor] 2019/04/16 14:49:21 Waiting for pod to be restarted...
[supervisor] 2019/04/16 14:49:22 Waiting for pod to be restarted...
[supervisor] 2019/04/16 14:49:23 Waiting for pod to be restarted...
[supervisor] 2019/04/16 14:49:24 Waiting for pod to be restarted...
[supervisor] 2019/04/16 14:49:25 Waiting for pod to be restarted...
And my kubectl get all
is like this:
root@master:/home/ubuntu# kubectl get all -n ag1
NAME READY STATUS RESTARTS AGE
pod/mssql-initialize-mssql1-hd6rd 0/1 Completed 0 3d20h
pod/mssql-initialize-mssql2-gd9hz 0/1 Completed 0 3d20h
pod/mssql-operator-6f9c99cc89-hzlsb 1/1 Running 15 2d1h
pod/mssql1-0 1/2 CrashLoopBackOff 179 2d
pod/mssql2-0 1/2 CrashLoopBackOff 165 3d20h
pod/mssql3-0 1/2 CrashLoopBackOff 163 3d20h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/ag1 ClusterIP None <none> 1433/TCP,5022/TCP 3d20h
service/ag1-primary NodePort 10.106.244.51 <none> 1433:31080/TCP 3d20h
service/ag1-secondary NodePort 10.105.101.171 <none> 1433:32497/TCP 3d20h
service/mssql1 NodePort 10.97.52.124 <none> 1433:31859/TCP 3d20h
service/mssql2 NodePort 10.100.173.32 <none> 1433:30943/TCP 3d20h
service/mssql3 NodePort 10.99.238.238 <none> 1433:32406/TCP 3d20h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/mssql-operator 1/1 1 1 3d20h
NAME DESIRED CURRENT READY AGE
replicaset.apps/mssql-operator-6f9c99cc89 1 1 1 3d20h
NAME READY AGE
statefulset.apps/mssql1 0/1 3d20h
statefulset.apps/mssql2 0/1 3d20h
statefulset.apps/mssql3 0/1 3d20h
NAME COMPLETIONS DURATION AGE
job.batch/mssql-initialize-mssql1 1/1 5m38s 3d20h
job.batch/mssql-initialize-mssql2 1/1 5m35s 3d20h
job.batch/mssql-initialize-mssql3 1/1 5m22s 3d20h
One of statefulset's manifest :
apiVersion: apps/v1
kind: StatefulSet
metadata:
creationTimestamp: "2019-04-12T18:43:23Z"
generation: 1
labels:
name: mssql1
type: sqlservr
name: mssql1
namespace: ag1
ownerReferences:
- apiVersion: mssql.microsoft.com/v1
controller: false
kind: ReplicationController
name: mssql1
uid: d88e739e-5d52-11e9-9f0d-5254001850dc
resourceVersion: "1064877"
selfLink: /apis/apps/v1/namespaces/ag1/statefulsets/mssql1
uid: d9c01112-5d52-11e9-9f0d-5254001850dc
spec:
podManagementPolicy: OrderedReady
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
mssql.microsoft.com/sql-instance: mssql1
serviceName: ""
template:
metadata:
creationTimestamp: null
labels:
ag-service.mssql.microsoft.com/ag1: ""
mssql.microsoft.com/sql-instance: mssql1
name: mssql1
type: sqlservr
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: ag-service.mssql.microsoft.com/ag1
operator: Exists
topologyKey: kubernetes.io/hostname
containers:
- env:
- name: ACCEPT_EULA
value: "y"
- name: MSSQL_PID
value: Developer
- name: MSSQL_SA_PASSWORD
valueFrom:
secretKeyRef:
key: initsapassword
name: mssql1-statefulset-secret
- name: MSSQL_ENABLE_HADR
value: "1"
image: mcr.microsoft.com/mssql/server:2019-CTP2.1-ubuntu
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 2
successThreshold: 1
timeoutSeconds: 1
name: mssql-server
ports:
- containerPort: 1433
name: tds
protocol: TCP
- containerPort: 5022
name: dbm
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/opt/mssql
name: instance-root
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: no-api-access
readOnly: true
- command:
- /mssql-server-k8s-ag-agent-supervisor
env:
- name: MSSQL_K8S_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: MSSQL_K8S_POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: MSSQL_K8S_SQL_SERVER_NAME
value: mssql1
- name: MSSQL_K8S_POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: MSSQL_K8S_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: MSSQL_K8S_MONITOR_POLICY
value: "3"
- name: MSSQL_K8S_HEALTH_CONNECTION_REBOOT_TIMEOUT
- name: MSSQL_K8S_SKIP_AG_ANTI_AFFINITY
- name: MSSQL_K8S_MONITOR_PERIOD_SECONDS
- name: MSSQL_K8S_LEASE_DURATION_SECONDS
- name: MSSQL_K8S_RENEW_DEADLINE_SECONDS
- name: MSSQL_K8S_RETRY_PERIOD_SECONDS
- name: MSSQL_K8S_ACQUIRE_PERIOD_SECONDS
- name: MSSQL_K8S_SQL_WRITE_LEASE_PERIOD_SECONDS
image: mcr.microsoft.com/mssql/ha:2019-CTP2.1-ubuntu
imagePullPolicy: IfNotPresent
name: mssql-ha-supervisor
ports:
- containerPort: 8080
name: liveliness
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: mssql1
serviceAccountName: mssql1
terminationGracePeriodSeconds: 30
volumes:
- emptyDir: {}
name: no-api-access
updateStrategy:
rollingUpdate:
partition: 0
type: RollingUpdate
volumeClaimTemplates:
- metadata:
creationTimestamp: null
name: instance-root
namespace: ag1
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
volumeMode: Filesystem
status:
phase: Pending
status:
collisionCount: 0
currentReplicas: 1
currentRevision: mssql1-795bb7f749
observedGeneration: 1
replicas: 1
updateRevision: mssql1-795bb7f749
updatedReplicas: 1