how to handle etcdserver: unhealthy cluster

9/13/2019

When I add node in master of etcd cluster using this command:

curl http://127.0.0.1:2379/v3beta/members \
-XPOST -H "Content-Type: application/json" \
-d '{"peerURLs": ["http://172.19.104.230:2380"]}'

It shows {"error":"etcdserver: unhealthy cluster","code":14}.

And I check the cluster status:

[root@iZuf63refzweg1d9dh94t8Z ~]# etcdctl member list
55a782166ce91d01, started, infra3, https://172.19.150.82:2380, https://172.19.150.82:2379
696a771758a889c4, started, infra1, https://172.19.104.231:2380, https://172.19.104.231:2379

It is fine. What should I do to make it works?

-- Dolphin
etcd
kubernetes

1 Answer

9/13/2019

According to etcd source code, it returns ErrUnhealthy error code if longestConnected method failed.

// longestConnected chooses the member with longest active-since-time.
// It returns false, if nothing is active.
func longestConnected(tp rafthttp.Transporter, membs []types.ID) (types.ID, bool) {
    var longest types.ID
    var oldest time.Time
    for _, id := range membs {
        tm := tp.ActiveSince(id)
        if tm.IsZero() { // inactive
            continue
        }

        if oldest.IsZero() { // first longest candidate
            oldest = tm
            longest = id
        }

        if tm.Before(oldest) {
            oldest = tm
            longest = id
        }
    }
    if uint64(longest) == 0 {
        return longest, false
    }
    return longest, true
}

So, ectd can't find appropriate member to connect.

Cluster's method VotingMemberIDs returns list of voting members:

transferee, ok := longestConnected(s.r.transport, s.cluster.VotingMemberIDs())
if !ok {
    return ErrUnhealthy
}
// VotingMemberIDs returns the ID of voting members in cluster.
func (c *RaftCluster) VotingMemberIDs() []types.ID {
    c.Lock()
    defer c.Unlock()
    var ids []types.ID
    for _, m := range c.members {
        if !m.IsLearner {
            ids = append(ids, m.ID)
        }
    }
    sort.Sort(types.IDSlice(ids))
    return ids
}

As we can see from you report, there are members in your cluster.

$ etcdctl member list
> 55a782166ce91d01, started, infra3, https://172.19.150.82:2380, https://172.19.150.82:2379
> 696a771758a889c4, started, infra1, https://172.19.104.231:2380, https://172.19.104.231:2379

So we should check members - are they voiting members, not learners, see etcd docs | Learner

Raft learner

// RaftAttributes represents the raft related attributes of an etcd member.
type RaftAttributes struct {
    // PeerURLs is the list of peers in the raft cluster.
    // TODO(philips): ensure these are URLs
    PeerURLs []string `json:"peerURLs"`
    // IsLearner indicates if the member is raft learner.
    IsLearner bool `json:"isLearner,omitempty"`
}

So, try to increase members count to provide a quorum etcd quorum

To force creating members try this ETCD_FORCE_NEW_CLUSTER=“true"

Quorum

See also this post: Understanding cluster and pool quorum

-- Yasen
Source: StackOverflow