I've been building fat jars for spark-submits for quite a while and they work like a charm.
Now I'd like to deploy spark-jobs on top of kubernetes.
The way described on the spark site (https://spark.apache.org/docs/latest/running-on-kubernetes.html) just calls a script docker-image-tool.sh
to bundle a basic jar into a docker container.
I was wondering:
Could this be nicer by using sbt-native-packager
in combination with sbt-assembly
to build docker images that contain all the code needed for starting the spark driver, running the code (with all libraries bundled) and perhaps offer a way to bundle classpath libraries (like postgres jar) into a single image.
This way running the pod would spin up the spark k8s master (client mode or cluster mode, whatever works best), trigger the creation of worker pods spark submit the local jar (with all libraries needed included) and run until completion.
Maybe I'm missing why this can't work or is a bad idea, but I feel like configuration would be more centralised and straight forward then the current approach?
Or are there other best practises?
So in the end I got everything working using helm, the spark-on-k8s-operator and sbt-docker
First I extract some of the config into variables in the build.sbt, so they can be used by both the assembly and the docker generator.
// define some dependencies that should not be compiled, but copied into docker
val externalDependencies = Seq(
"org.postgresql" % "postgresql" % postgresVersion,
"io.prometheus.jmx" % "jmx_prometheus_javaagent" % jmxPrometheusVersion
)
// Settings
val team = "hazelnut"
val importerDescription = "..."
val importerMainClass = "..."
val targetDockerJarPath = "/opt/spark/jars"
val externalPaths = externalDependencies.map(module => {
val parts = module.toString().split(""":""")
val orgDir = parts(0).replaceAll("""\.""","""/""")
val moduleName = parts(1).replaceAll("""\.""","""/""")
val version = parts(2)
var jarFile = moduleName + "-" + version + ".jar"
(orgDir, moduleName, version, jarFile)
})
Next I define the assembly settings to create the fat jar (which can be whatever you need):
lazy val assemblySettings = Seq(
// Assembly options
assembly / assemblyOption := (assemblyOption in assembly).value.copy(includeScala = false),
assembly / assemblyMergeStrategy := {
case PathList("reference.conf") => MergeStrategy.concat
case PathList("META-INF", _@_*) => MergeStrategy.discard
case "log4j.properties" => MergeStrategy.concat
case _ => MergeStrategy.first
},
assembly / logLevel := sbt.util.Level.Error,
assembly / test := {},
pomIncludeRepository := { _ => false }
)
Then the docker settings are defined:
lazy val dockerSettings = Seq(
imageNames in docker := Seq(
ImageName(s"$team/${name.value}:latest"),
ImageName(s"$team/${name.value}:${version.value}"),
),
dockerfile in docker := {
// The assembly task generates a fat JAR file
val artifact: File = assembly.value
val artifactTargetPath = s"$targetDockerJarPath/$team-${name.value}.jar"
externalPaths.map {
case (extOrgDir, extModuleName, extVersion, jarFile) =>
val url = List("https://repo1.maven.org/maven2", extOrgDir, extModuleName, extVersion, jarFile).mkString("/")
val target = s"$targetDockerJarPath/$jarFile"
Instructions.Run.exec(List("curl", url, "--output", target, "--silent"))
}
.foldLeft(new Dockerfile {
// https://hub.docker.com/r/lightbend/spark/tags
from(s"lightbend/spark:${openShiftVersion}-OpenShift-${sparkVersion}-ubuntu-${scalaBaseVersion}")
}) {
case (df, run) => df.addInstruction(run)
}.add(artifact, artifactTargetPath)
}
)
And I create some Task
to generate some helm Charts / values:
lazy val createImporterHelmChart: Def.Initialize[Task[Seq[File]]] = Def.task {
val chartFile = baseDirectory.value / "../helm" / "Chart.yaml"
val valuesFile = baseDirectory.value / "../helm" / "values.yaml"
val jarDependencies = externalPaths.map {
case (_, extModuleName, _, jarFile) =>
extModuleName -> s""""local://$targetDockerJarPath/$jarFile""""
}.toMap
val chartContents =
s"""# Generated by build.sbt. Please don't manually update
|apiVersion: v1
|name: $team-${name.value}
|version: ${version.value}
|description: $importerDescription
|""".stripMargin
val valuesContents =
s"""# Generated by build.sbt. Please don't manually update
|version: ${version.value}
|sparkVersion: $sparkVersion
|image: $team/${name.value}:${version.value}
|jar: local://$targetDockerJarPath/$team-${name.value}.jar
|mainClass: $importerMainClass
|jarDependencies: [${jarDependencies.values.mkString(", ")}]
|fileDependencies: []
|jmxExporterJar: ${jarDependencies.getOrElse("jmx_prometheus_javaagent", "null").replace("local://","")}
|""".stripMargin
IO.write(chartFile, chartContents)
IO.write(valuesFile, valuesContents)
Seq(chartFile, valuesFile)
}
Finally it all combines into a project definition in the build.sbt
lazy val importer = (project in file("importer"))
.enablePlugins(JavaAppPackaging)
.enablePlugins(sbtdocker.DockerPlugin)
.enablePlugins(AshScriptPlugin)
.dependsOn(util)
.settings(
commonSettings,
testSettings,
assemblySettings,
dockerSettings,
scalafmtSettings,
name := "etl-importer",
Compile / mainClass := Some(importerMainClass),
Compile / resourceGenerators += createImporterHelmChart.taskValue
)
Finally together with values files per environment and a helm template:
apiVersion: sparkoperator.k8s.io/v1beta1
kind: SparkApplication
metadata:
name: {{ .Chart.Name | trunc 64 }}
labels:
name: {{ .Chart.Name | trunc 63 | quote }}
release: {{ .Release.Name | trunc 63 | quote }}
revision: {{ .Release.Revision | quote }}
sparkVersion: {{ .Values.sparkVersion | quote }}
version: {{ .Chart.Version | quote }}
spec:
type: Scala
mode: cluster
image: {{ .Values.image | quote }}
imagePullPolicy: {{ .Values.imagePullPolicy }}
mainClass: {{ .Values.mainClass | quote }}
mainApplicationFile: {{ .Values.jar | quote }}
sparkVersion: {{ .Values.sparkVersion | quote }}
restartPolicy:
type: Never
deps:
{{- if .Values.jarDependencies }}
jars:
{{- range .Values.jarDependencies }}
- {{ . | quote }}
{{- end }}
{{- end }}
...
I can now build packages using
sbt [project name]/docker
and deploy them using
helm install ./helm -f ./helm/values-minikube.yaml --namespace=[ns] --name [name]
It can probably be made prettier, but for now this works like a charm