I have a simple groovy script that leverages the GPars
library's withPool
functionality to launch HTTP GET requests to two internal API endpoints in parallel.
The script runs fine locally, both directly as well as a docker container.
When I deploy it as a Kubernetes Job
(in our internal EKS cluster: 1.20
), it runs there as well, but the moment it hits the first withPool
call, I see a giant thread dump, but the execution continues, and completes successfully.
NOTE: Containers in our cluster run with the following pod security context:
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
Environment
# From the k8s job container
groovy@app-271df1d7-15848624-mzhhj:/app$ groovy --version
WARNING: Using incubator modules: jdk.incubator.foreign, jdk.incubator.vector
Groovy Version: 4.0.0 JVM: 17.0.2 Vendor: Eclipse Adoptium OS: Linux
groovy@app-271df1d7-15848624-mzhhj:/app$ ps -ef
UID PID PPID C STIME TTY TIME CMD
groovy 1 0 0 21:04 ? 00:00:00 /bin/bash bin/run-script.sh
groovy 12 1 42 21:04 ? 00:00:17 /opt/java/openjdk/bin/java -Xms3g -Xmx3g --add-modules=ALL-SYSTEM -classpath /opt/groovy/lib/groovy-4.0.0.jar -Dscript.name=/usr/bin/groovy -Dprogram.name=groovy -Dgroovy.starter.conf=/opt/groovy/conf/groovy-starter.conf -Dgroovy.home=/opt/groovy -Dtools.jar=/opt/java/openjdk/lib/tools.jar org.codehaus.groovy.tools.GroovyStarter --main groovy.ui.GroovyMain --conf /opt/groovy/conf/groovy-starter.conf --classpath . /tmp/script.groovy
groovy 116 0 0 21:05 pts/0 00:00:00 bash
groovy 160 116 0 21:05 pts/0 00:00:00 ps -ef
Script (relevant parts)
@Grab('org.codehaus.gpars:gpars:1.2.1')
import static groovyx.gpars.GParsPool.withPool
import groovy.json.JsonSlurper
final def jsl = new JsonSlurper()
//...
while (!(nextBatch = getBatch(batchSize)).isEmpty()) {
def devThread = Thread.start {
withPool(poolSize) {
nextBatch.eachParallel { kw ->
String url = dev + "&" + "query=$kw"
try {
def response = jsl.parseText(url.toURL().getText(connectTimeout: 10.seconds, readTimeout: 10.seconds,
useCaches: true, allowUserInteraction: false))
devResponses[kw] = response
} catch (e) {
println("\tFailed to fetch: $url | error: $e")
}
}
}
}
def stgThread = Thread.start {
withPool(poolSize) {
nextBatch.eachParallel { kw ->
String url = stg + "&" + "query=$kw"
try {
def response = jsl.parseText(url.toURL().getText(connectTimeout: 10.seconds, readTimeout: 10.seconds,
useCaches: true, allowUserInteraction: false))
stgResponses[kw] = response
} catch (e) {
println("\tFailed to fetch: $url | error: $e")
}
}
}
}
devThread.join()
stgThread.join()
}
Dockerfile
FROM groovy:4.0.0-jdk17 as builder
USER root
RUN apt-get update && apt-get install -yq bash curl wget jq
WORKDIR /app
COPY bin /app/bin
RUN chmod +x /app/bin/*
USER groovy
ENTRYPOINT ["/bin/bash"]
CMD ["bin/run-script.sh"]
The bin/run-script.sh
simply downloads the above groovy script at runtime and executes it.
wget "$GROOVY_SCRIPT" -O "$LOCAL_FILE"
...
groovy "$LOCAL_FILE"
As soon as the execution hits the first call to withPool(poolSize)
, there's a giant thread dump, but execution continues.
I'm trying to figure out what could be causing this behavior. Any ideas 🤷🏽♂️?
For posterity, answering my own question here.
The issue turned out to be this log4j2 JVM hot-patch that we're currently leveraging to fix the recent log4j2 vulnerability. This agent (running as a DaemonSet
) patches all running JVMs in all our k8s clusters.
This, somehow, causes my OpenJDK 17
based app to thread dump. I found the same issue with an ElasticSearch 8.1.0
deployment as well (also uses a pre-packaged OpenJDK 17
). This one is a service, so I could see a thread dump happening pretty much every half hour! Interestingly, there are other JVM services (and some SOLR 8
deployments) that don't have this issue 🤷🏽♂️.
Anyway, I worked with our devops team to temporarily exclude the node that deployment was running on, and lo and behold, the thread dumps disappeared!
Balance in the universe has been restored 🧘🏻♂️.