Rook OSD after node failure

1/23/2020

In my kubernetes cluster (v1.14.7) after a cluster update one node didn't recoverd correctly. the rook osd from that node didn't get rescheduled ( as it's explained in the documentation) So im trying to add manually a new OSD.

my ceph status return this

here

and my ceph osd tree return this

here

I tried link the new osd with the node using ceph osd crush set osd.0 0.29199 root=default host=gke-dev-dev-110dd9ec-ntww

but it return: Error ENOENT: unable to set item id 0 name 'osd.0' weight 0.29199 at location {host=gke-dev-dev-110dd9ec-ntww,root=default}: does not exist

Do you got a clue on how to fix this ? Thanks in advance

-- Rémi F
ceph
kubernetes
rook-storage

2 Answers

5/15/2020

For the rook user: https://rook.io/docs/rook/master/ceph-osd-mgmt.html

A blog to have an explanation(中文读者):https://zhuanlan.zhihu.com/p/140486398

-- gemfield
Source: StackOverflow

1/23/2020

Here's what I suggest, instead of trying to add a new osd right away, fix/remove the defective one and it should re-create.

Try this:

1 - mark out osd: ceph osd out osd.0
2 - remove from crush map: ceph osd crush remove osd.0
3 - delete caps: ceph auth del osd.0
4 - remove osd: ceph osd rm osd.0
5 - delete the deployment: kubectl delete deployment -n your-cluster-namespace rook-ceph-osd-0
6 - edit out the config section of your osd id and underlying device.
      kubectl edit configmap -n your-cluster-namespace rook-ceph-osd-nodename-config
      delete {"/var/lib/rook":x}
7 - restart the rook-operator pod by deleting the rook-operator pod
8 - verify the health of your cluster: ceph -s; ceph osd tree

Hope this helps!

-- openJT
Source: StackOverflow