Do JVMs create significant overhead in distributed/ parallel processing?

1/12/2019

If a distributed computing framework spins up nodes for running Java/ Scala operations then it has to include the JVM in every container. E.g. every Map and Reduce step spawns its own JVM.

How does the efficiency of this instantiation compare to spinning up containers for languages like Python? Is it a question of milliseconds, few seconds, 30 seconds? Does this cost add up in frameworks like Kubernetes where you need to spin up many containers?

I've heard that, much like Alpine Linux is just a few MB, there are stripped down JVMs, but still, there must be a cost. Yet, Scala is the first class citizen in Spark and MR is written in Java.

-- HashRocketSyntax
apache-spark
java
jvm
kubernetes
scala

1 Answer

1/12/2019

Linux container technology uses layered filesystems so bigger container images don't generally have a ton of runtime overhead, though you do have to download the image the first time it is used on a node which can potentially add up on truly massive clusters. In general this is not usually a thing to worry about, aside from the well known issues of most JVMs being a bit slow to start up. Spark, however, does not spin up a new container for every operation as you describe. It creates a set of executor containers (pods) which are used for the whole Spark execution run.

-- coderanger
Source: StackOverflow