I am trying to use the container from https://github.com/cybermaggedon/accumulo-docker to create a 3 node deployment in the Google Kubernetes Engine. My main problem is how to make the nodes aware of each other. For example, the accumulo/conf/slaves
config file contains a list of all the nodes (either names or IPs, one per line), and needs to be replicated across all the nodes. Also, a single Accumulo node is designated as a master, and all slave nodes point to it by making it the only name/IP in the conf/masters file.
The documentation for the Accumulo docker container configures each container in this manner by providing environment variables, which are in turn used by the container startup script to rewrite the configuration files for that container, e.g.
docker run -d --ip=10.10.10.11 --net my_network \
-e ZOOKEEPERS=10.10.5.10,10.10.5.11,10.10.5.12 \
-e HDFS_VOLUMES=hdfs://hadoop01:9000/accumulo \
-e NAMENODE_URI=hdfs://hadoop01:9000/ \
-e MY_HOSTNAME=10.10.10.11 \
-e GC_HOSTS=10.10.10.10 \
-e MASTER_HOSTS=10.10.10.10 \
-e SLAVE_HOSTS=10.10.10.10,10.10.10.11,10.10.10.12 \
-e MONITOR_HOSTS=10.10.10.10 \
-e TRACER_HOSTS=10.10.10.10 \
--link hadoop01:hadoop01 \
--name acc02 cybermaggedon/accumulo:1.8.1h
This is a startup of one of the slave nodes, it includes itself in SLAVE_HOSTS
and points to the master in MASTER_HOSTS
.
If I implement my scaling as a stateful set under Kubernetes, how I can achieve a similar result? I can modify the container as needed, I have no problem creating my own version.
Disclaimer: Just because it runs on docker it doesn't necessarily mean that it can run on Kubernetes. Accumulo is part of the Hadoop/HDFS ecosystem and lots of the components are not necessarily production ready. Check my other answers: 1, 2.
Kubernetes runs its pods using a PodCidr and it's only seen within the cluster. Also, the IP addresses in those for each pod is not fixed, meaning it can change as it moves from one cluster to another or as pods are stopped/started. The way services/pods are generally discovered in a cluster is using DNS. So, for example for the master and slave options, you will probably have to specify a Kubernetes DNS (and considering you are using a StatefulSet that uses ordinal numbers for pods)
MASTER_HOSTS=acummulo-0.accumulo.default.svc.cluster.local
SLAVE_HOSTS=acummulo-0.accumulo.default.svc.cluster.local,acummulo-1.accumulo.default.svc.cluster.local,acummulo-2.accumulo.default.svc.cluster.local
Since Accumulo is a distributed K/V store, you can take cues from how Cassandra could be deployed on a Kubernetes cluster. Hope it helps!