Building fat spark jars & bundles for kubernetes deployment

7/8/2019

I've been building fat jars for spark-submits for quite a while and they work like a charm.

Now I'd like to deploy spark-jobs on top of kubernetes.

The way described on the spark site (https://spark.apache.org/docs/latest/running-on-kubernetes.html) just calls a script docker-image-tool.sh to bundle a basic jar into a docker container.

I was wondering:

Could this be nicer by using sbt-native-packager in combination with sbt-assembly to build docker images that contain all the code needed for starting the spark driver, running the code (with all libraries bundled) and perhaps offer a way to bundle classpath libraries (like postgres jar) into a single image.

This way running the pod would spin up the spark k8s master (client mode or cluster mode, whatever works best), trigger the creation of worker pods spark submit the local jar (with all libraries needed included) and run until completion.

Maybe I'm missing why this can't work or is a bad idea, but I feel like configuration would be more centralised and straight forward then the current approach?

Or are there other best practises?

-- Tom Lous
apache-spark
docker
kubernetes
scala

1 Answer

7/29/2019

So in the end I got everything working using helm, the spark-on-k8s-operator and sbt-docker

First I extract some of the config into variables in the build.sbt, so they can be used by both the assembly and the docker generator.

// define some dependencies that should not be compiled, but copied into docker
val externalDependencies = Seq(
  "org.postgresql" % "postgresql" % postgresVersion,
  "io.prometheus.jmx" % "jmx_prometheus_javaagent" % jmxPrometheusVersion
)

// Settings
val team = "hazelnut"
val importerDescription = "..."
val importerMainClass = "..."
val targetDockerJarPath = "/opt/spark/jars"
val externalPaths = externalDependencies.map(module => {
  val parts = module.toString().split(""":""")
  val orgDir = parts(0).replaceAll("""\.""","""/""")
  val moduleName = parts(1).replaceAll("""\.""","""/""")
  val version = parts(2)
  var jarFile = moduleName + "-" + version + ".jar"
  (orgDir, moduleName, version, jarFile)
})

Next I define the assembly settings to create the fat jar (which can be whatever you need):

lazy val assemblySettings = Seq(
  // Assembly options
  assembly / assemblyOption := (assemblyOption in assembly).value.copy(includeScala = false),
  assembly / assemblyMergeStrategy := {
    case PathList("reference.conf") => MergeStrategy.concat
    case PathList("META-INF", _@_*) => MergeStrategy.discard
    case "log4j.properties" => MergeStrategy.concat
    case _ => MergeStrategy.first
  },
  assembly / logLevel := sbt.util.Level.Error,
  assembly / test := {},
  pomIncludeRepository := { _ => false }
)

Then the docker settings are defined:

lazy val dockerSettings = Seq(
  imageNames in docker := Seq(
    ImageName(s"$team/${name.value}:latest"),
    ImageName(s"$team/${name.value}:${version.value}"),
  ),
  dockerfile in docker := {
    // The assembly task generates a fat JAR file
    val artifact: File = assembly.value
    val artifactTargetPath = s"$targetDockerJarPath/$team-${name.value}.jar"
    externalPaths.map {
      case (extOrgDir, extModuleName, extVersion, jarFile) =>
        val url = List("https://repo1.maven.org/maven2", extOrgDir, extModuleName, extVersion, jarFile).mkString("/")
        val target = s"$targetDockerJarPath/$jarFile"
        Instructions.Run.exec(List("curl", url, "--output", target, "--silent"))
    }
      .foldLeft(new Dockerfile {
        //       https://hub.docker.com/r/lightbend/spark/tags
        from(s"lightbend/spark:${openShiftVersion}-OpenShift-${sparkVersion}-ubuntu-${scalaBaseVersion}")
      }) {
        case (df, run) => df.addInstruction(run)
      }.add(artifact, artifactTargetPath)    
  }
)

And I create some Task to generate some helm Charts / values:

lazy val createImporterHelmChart: Def.Initialize[Task[Seq[File]]] = Def.task {
  val chartFile = baseDirectory.value / "../helm" / "Chart.yaml"
  val valuesFile = baseDirectory.value / "../helm" / "values.yaml"
  val jarDependencies = externalPaths.map {
    case (_, extModuleName, _, jarFile) =>
      extModuleName -> s""""local://$targetDockerJarPath/$jarFile""""
  }.toMap

  val chartContents =
    s"""# Generated by build.sbt. Please don't manually update
       |apiVersion: v1
       |name: $team-${name.value}
       |version: ${version.value}
       |description: $importerDescription 
       |""".stripMargin

  val valuesContents =
    s"""# Generated by build.sbt. Please don't manually update      
       |version: ${version.value}
       |sparkVersion: $sparkVersion
       |image: $team/${name.value}:${version.value}
       |jar: local://$targetDockerJarPath/$team-${name.value}.jar
       |mainClass: $importerMainClass
       |jarDependencies: [${jarDependencies.values.mkString(", ")}]
       |fileDependencies: []
       |jmxExporterJar: ${jarDependencies.getOrElse("jmx_prometheus_javaagent", "null").replace("local://","")}
       |""".stripMargin

  IO.write(chartFile, chartContents)
  IO.write(valuesFile, valuesContents)
  Seq(chartFile, valuesFile)
}

Finally it all combines into a project definition in the build.sbt

lazy val importer = (project in file("importer"))
  .enablePlugins(JavaAppPackaging)
  .enablePlugins(sbtdocker.DockerPlugin)
  .enablePlugins(AshScriptPlugin)
  .dependsOn(util)
  .settings(
    commonSettings,
    testSettings,
    assemblySettings,
    dockerSettings,
    scalafmtSettings,
    name := "etl-importer",
    Compile / mainClass := Some(importerMainClass),
    Compile / resourceGenerators += createImporterHelmChart.taskValue
  )

Finally together with values files per environment and a helm template:

apiVersion: sparkoperator.k8s.io/v1beta1
kind: SparkApplication
metadata:
  name: {{ .Chart.Name | trunc 64 }}
  labels:
    name: {{ .Chart.Name | trunc 63 | quote }}
    release: {{ .Release.Name | trunc 63 | quote }}
    revision: {{ .Release.Revision | quote }}
    sparkVersion: {{ .Values.sparkVersion | quote }}
    version: {{ .Chart.Version | quote }}
spec:
  type: Scala
  mode: cluster
  image: {{ .Values.image | quote }}
  imagePullPolicy: {{ .Values.imagePullPolicy }}
  mainClass: {{ .Values.mainClass | quote }}
  mainApplicationFile: {{ .Values.jar | quote }}
  sparkVersion: {{ .Values.sparkVersion | quote }}
  restartPolicy:
    type: Never
  deps:
    {{- if .Values.jarDependencies }}
    jars:
    {{- range .Values.jarDependencies }}
      - {{ . | quote }}
    {{- end }}
    {{- end }}
...

I can now build packages using

sbt [project name]/docker

and deploy them using

helm install ./helm -f ./helm/values-minikube.yaml --namespace=[ns] --name [name]

It can probably be made prettier, but for now this works like a charm

-- Tom Lous
Source: StackOverflow