Yarn as resource manager in SPARK for linux cluster - inside Kubernetes and outside Kubernetes

2/14/2021

If I am using Kubernetes cluster to run spark, then I am using Kubernetes resource manager in Spark.

If I am using Hadoop cluster to run spark, then I am using Yarn resource manager in Spark.

But my question is, if I am spawning multiple linux nodes in kebernetes, and use one of the node as spark maste and three other as worker, what resource manager should I use? can I use yarn over here?

Second question, in case of any 4 node linux spark cluster (not in kubernetes and not hadoop, simple connected linux machines), even if I do not have hdfs, can I use yarn here as resource manager? if not, then what resource manager should be used for saprk?

Thanks.

-- Rock
apache-spark
google-kubernetes-engine
hadoop
hadoop-yarn
kubernetes

1 Answer

2/14/2021

if I am spawning multiple linux nodes in kebernetes,

Then you'd obviously use kubernetes, since it's available

in case of any 4 node linux spark cluster (not in kubernetes and not hadoop, simple connected linux machines), even if I do not have hdfs, can I use yarn here

You can, or you can use Spark Standalone scheduler, instead. However Spark requires a shared filesystem for reading and writing data, so, while you could attempt to use NFS, or S3/GCS for this, HDFS is faster

-- OneCricketeer
Source: StackOverflow