How to fix incomplete pods when building kubeflow on kubernetes configured with rancher?

6/21/2021

Hello I have built kubernetes on rancher built with single docker and I want to install kubeflow additionally.

https://raw.githubusercontent.com/kubeflow/manifests/master/distributions/kfdef/kfctl_k8s_istio.v1.2.0.yaml

I have imported the yaml file and installed it with kfctl. But the problem is that although it is installed, it is installed incompletely and the main functions are not executed.

taeil-kubeflow# kubectl get all -n kubeflow
NAME                                                         READY   STATUS             RESTARTS   AGE
pod/admission-webhook-bootstrap-stateful-set-0               1/1     Running            2          32m
pod/admission-webhook-deployment-5cd7dc96f5-j8ptn            1/1     Running            0          31m
pod/application-controller-stateful-set-0                    1/1     Running            0          32m
pod/argo-ui-65df8c7c84-2bfhd                                 1/1     Running            0          31m
pod/cache-deployer-deployment-5f4979f45-kfhfg                1/2     CrashLoopBackOff   4          31m
pod/cache-server-7859fd67f5-9lmrt                            0/2     Init:0/1           0          31m
pod/centraldashboard-67767584dc-flb2n                        1/1     Running            0          31m
pod/jupyter-web-app-deployment-67fb955745-49vzj              1/1     Running            0          31m
pod/katib-controller-7fcc95676b-s9lwd                        1/1     Running            1          31m
pod/katib-db-manager-85db457c64-s4q5p                        0/1     Error              4          31m
pod/katib-mysql-6c7f7fb869-4228x                             0/1     Pending            0          31m
pod/katib-ui-65dc4cf6f5-pxs8g                                1/1     Running            0          31m
pod/kfserving-controller-manager-0                           2/2     Running            0          31m
pod/kubeflow-pipelines-profile-controller-797fb44db9-vstfv   1/1     Running            0          31m
pod/metacontroller-0                                         1/1     Running            0          32m
pod/metadata-db-6dd978c5b-qwbnz                              0/1     Pending            0          31m
pod/metadata-envoy-deployment-67bd5954c-dw6rx                1/1     Running            0          31m
pod/metadata-grpc-deployment-577c67c96f-872lf                0/1     CrashLoopBackOff   1          31m
pod/metadata-writer-756dbdd478-dwwpc                         2/2     Running            2          31m
pod/minio-54d995c97b-md886                                   0/1     Pending            0          31m
pod/ml-pipeline-7c56db5db9-856mr                             1/2     CrashLoopBackOff   14         31m
pod/ml-pipeline-persistenceagent-d984c9585-248b4             2/2     Running            0          31m
pod/ml-pipeline-scheduledworkflow-5ccf4c9fcc-mv4vs           2/2     Running            0          31m
pod/ml-pipeline-ui-7ddcd74489-qv2gp                          2/2     Running            0          31m
pod/ml-pipeline-viewer-crd-56c68f6c85-7rf6f                  2/2     Running            3          31m
pod/ml-pipeline-visualizationserver-5b9bd8f6bf-jj5st         2/2     Running            0          31m
pod/mpi-operator-d5bfb8489-nzl2k                             1/1     Running            0          31m
pod/mxnet-operator-7576d697d6-z2dc8                          1/1     Running            0          31m
pod/mysql-74f8f99bc8-rpxcc                                   0/2     Pending            0          31m
pod/notebook-controller-deployment-5bb6bdbd6d-dq5nl          1/1     Running            0          31m
pod/profiles-deployment-56bc5d7dcb-k5cph                     2/2     Running            0          31m
pod/pytorch-operator-847c8d55d8-8f79m                        1/1     Running            0          31m
pod/seldon-controller-manager-6bf8b45656-jd682               1/1     Running            0          31m
pod/spark-operatorsparkoperator-fdfbfd99-6mhst               1/1     Running            0          32m
pod/spartakus-volunteer-558f8bfd47-l67zg                     1/1     Running            0          31m
pod/tf-job-operator-58477797f8-qg2tl                         1/1     Running            0          31m
pod/workflow-controller-64fd7cffc5-md54d                     1/1     Running            0          31m

As you can see, problems occur in various pods such as cache server, kativ-mysql, metadata-db, minio, mysql, metadata-grpc, ml-pipeline, etc.

I'm guessing it's a persistent volume problem, but I don't know how to solve it specifically.

Please help me


Add an additional describe and log for each pod.

pod : cache-server-7859fd67f5-9lmrt

status : init:0/1

describe:

  Warning  FailedMount  10m (x72 over 18h)     kubelet  Unable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[istio-token kubeflow-pipelines-cachethe condition
  Warning  FailedMount  3m47s (x550 over 18h)  kubelet  MountVolume.SetUp failed for volume "webhook-tls-certs" : secret "webhook-server-tls" not found

log :

error: a container name must be specified for pod cache-server-7859fd67f5-9lmrt, choose one of: [server istio-proxy] or one of the init containers: [istio-init]

pod : katib-mysql-6c7f7fb869-4228x

status : Pending 0/1

describe :

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  18h   default-scheduler  0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.

pod : metadata-db-6dd978c5b-qwbnz

status : Pending 0/1

describe :

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  18h   default-scheduler  0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.

pod : minio-54d995c97b-md886

status : Pending 0/1

describe :

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  18h   default-scheduler  0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.

pod : mysql-74f8f99bc8-rpxcc

status : Pending 0/2

describe :

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  18h   default-scheduler  0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.

pod : cache-deployer-deployment-5f4979f45-kfhfg

status : crashloopbackoff 1/2

describe :

Events:
  Type     Reason   Age                   From     Message
  ----     ------   ----                  ----     -------
  Normal   Pulling  36m (x213 over 18h)   kubelet  Pulling image "gcr.io/ml-pipeline/cache-deployer:1.0.4"
  Warning  BackOff  68s (x4916 over 18h)  kubelet  Back-off restarting failed container

log :

error: a container name must be specified for pod cache-deployer-deployment-5f4979f45-kfhfg, choose one of: [main istio-proxy] or one of the init containers: [istio-init]

pod : katib-db-manager-85db457c64-s4q5p

status : crashloopbackoff 0/1

describe :

Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Warning  BackOff    7m16s (x4260 over 18h)  kubelet  Back-off restarting failed container
  Warning  Unhealthy  2m13s (x1190 over 18h)  kubelet  Readiness probe failed:

log :

E0622 02:00:41.686168       1 mysql.go:62] Ping to Katib db failed: dial tcp 10.43.25.138:3306: connect: connection refused
E0622 02:00:46.674159       1 mysql.go:62] Ping to Katib db failed: dial tcp 10.43.25.138:3306: connect: connection refused
E0622 02:00:51.666117       1 mysql.go:62] Ping to Katib db failed: dial tcp 10.43.25.138:3306: connect: connection refused
E0622 02:00:56.690171       1 mysql.go:62] Ping to Katib db failed: dial tcp 10.43.25.138:3306: connect: connection refused
E0622 02:01:01.682194       1 mysql.go:62] Ping to Katib db failed: dial tcp 10.43.25.138:3306: connect: connection refused
E0622 02:01:06.674132       1 mysql.go:62] Ping to Katib db failed: dial tcp 10.43.25.138:3306: connect: connection refused
E0622 02:01:11.666146       1 mysql.go:62] Ping to Katib db failed: dial tcp 10.43.25.138:3306: connect: connection refused
E0622 02:01:16.690230       1 mysql.go:62] Ping to Katib db failed: dial tcp 10.43.25.138:3306: connect: connection refused
E0622 02:01:21.686129       1 mysql.go:62] Ping to Katib db failed: dial tcp 10.43.25.138:3306: connect: connection refused
E0622 02:01:26.674431       1 mysql.go:62] Ping to Katib db failed: dial tcp 10.43.25.138:3306: connect: connection refused
E0622 02:01:31.670133       1 mysql.go:62] Ping to Katib db failed: dial tcp 10.43.25.138:3306: connect: connection refused
E0622 02:01:36.690492       1 mysql.go:62] Ping to Katib db failed: dial tcp 10.43.25.138:3306: connect: connection refused
F0622 02:01:36.690581       1 main.go:83] Failed to open db connection: DB open failed: Timeout waiting for DB conn successfully opened.
goroutine 1 [running]:
github.com/kubeflow/katib/vendor/k8s.io/klog.stacks(0xc000216200, 0xc000230000, 0x89, 0xd0)
        /go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:830 +0xb9
github.com/kubeflow/katib/vendor/k8s.io/klog.(*loggingT).output(0xc72b20, 0xc000000003, 0xc00022a000, 0xc14079, 0x7, 0x53, 0x0)
        /go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:781 +0x2da
github.com/kubeflow/katib/vendor/k8s.io/klog.(*loggingT).printf(0xc72b20, 0x3, 0x92bfcd, 0x20, 0xc0001edf48, 0x1, 0x1)
        /go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:678 +0x153
github.com/kubeflow/katib/vendor/k8s.io/klog.Fatalf(...)
        /go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:1209
main.main()
        /go/src/github.com/kubeflow/katib/cmd/db-manager/v1beta1/main.go:83 +0x166

pod : metadata-grpc-deployment-577c67c96f-872lf

status : crashloopbackoff 0/1 or Error

describe :

Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Warning  BackOff  3m46s (x5136 over 18h)  kubelet  Back-off restarting failed container

log :

2021-06-22 02:03:14.130165: F ml_metadata/metadata_store/metadata_store_server_main.cc:219] Non-OK-status: status status: Internal: mysql_real_connect failed: errno: 2002, error: Can't connect to MySQL server on 'metadata-db' (115)MetadataStore cannot be created with the given connection config.

pod : metadata-writer-756dbdd478-dwwpc

status : crashloopbackoff 1/2

describe :

Events:
  Type     Reason   Age                  From     Message
  ----     ------   ----                 ----     -------
  Warning  BackOff  6s (x3807 over 18h)  kubelet  Back-off restarting failed container

log :

error: a container name must be specified for pod metadata-writer-756dbdd478-dwwpc, choose one of: [main istio-proxy] or one of the init containers: [istio-init]

pod : ml-pipeline-7c56db5db9-856mr

status : crashloopbackoff 1/2

describe :

Events:
  Type     Reason     Age                    From     Message
  ----     ------     ----                   ----     -------
  Normal   Pulled     56m (x324 over 18h)    kubelet  Container image "gcr.io/ml-pipeline/api-server:1.0.4" already present on machine
  Warning  BackOff    6m1s (x4072 over 18h)  kubelet  Back-off restarting failed container
  Warning  Unhealthy  67s (x2939 over 18h)   kubelet  Readiness probe failed:

log :

error: a container name must be specified for pod ml-pipeline-7c56db5db9-856mr, choose one of: [ml-pipeline-api-server istio-proxy] or one of the init containers: [istio-init]

I found out there was a problem with dynamic volume provisioning. But I can't solve this. I tried to configure nfs server and client, but it doesn't work.

-- 윤태일
istio
kubernetes
rancher

0 Answers