Possible reasons for groovy program running as kubernetes job dumping threads during execution

2/25/2022

I have a simple groovy script that leverages the GPars library's withPool functionality to launch HTTP GET requests to two internal API endpoints in parallel.

The script runs fine locally, both directly as well as a docker container.

When I deploy it as a Kubernetes Job (in our internal EKS cluster: 1.20), it runs there as well, but the moment it hits the first withPool call, I see a giant thread dump, but the execution continues, and completes successfully.

NOTE: Containers in our cluster run with the following pod security context:

      securityContext:
        fsGroup: 2000
        runAsNonRoot: true
        runAsUser: 1000

Environment

# From the k8s job container
groovy@app-271df1d7-15848624-mzhhj:/app$ groovy --version
WARNING: Using incubator modules: jdk.incubator.foreign, jdk.incubator.vector
Groovy Version: 4.0.0 JVM: 17.0.2 Vendor: Eclipse Adoptium OS: Linux


groovy@app-271df1d7-15848624-mzhhj:/app$ ps -ef 
UID        PID  PPID  C STIME TTY          TIME CMD
groovy       1     0  0 21:04 ?        00:00:00 /bin/bash bin/run-script.sh
groovy      12     1 42 21:04 ?        00:00:17 /opt/java/openjdk/bin/java -Xms3g -Xmx3g --add-modules=ALL-SYSTEM -classpath /opt/groovy/lib/groovy-4.0.0.jar -Dscript.name=/usr/bin/groovy -Dprogram.name=groovy -Dgroovy.starter.conf=/opt/groovy/conf/groovy-starter.conf -Dgroovy.home=/opt/groovy -Dtools.jar=/opt/java/openjdk/lib/tools.jar org.codehaus.groovy.tools.GroovyStarter --main groovy.ui.GroovyMain --conf /opt/groovy/conf/groovy-starter.conf --classpath . /tmp/script.groovy
groovy     116     0  0 21:05 pts/0    00:00:00 bash
groovy     160   116  0 21:05 pts/0    00:00:00 ps -ef

Script (relevant parts)

@Grab('org.codehaus.gpars:gpars:1.2.1')
import static groovyx.gpars.GParsPool.withPool

import groovy.json.JsonSlurper
final def jsl = new JsonSlurper()


//...

while (!(nextBatch = getBatch(batchSize)).isEmpty()) {
    def devThread = Thread.start {
        withPool(poolSize) {
            nextBatch.eachParallel { kw ->
                String url = dev + "&" + "query=$kw"
                try {
                    def response = jsl.parseText(url.toURL().getText(connectTimeout: 10.seconds, readTimeout: 10.seconds,
                            useCaches: true, allowUserInteraction: false))
                    devResponses[kw] = response
                } catch (e) {
                    println("\tFailed to fetch: $url | error: $e")
                }
            }
        }
    }

    def stgThread = Thread.start {
        withPool(poolSize) {
            nextBatch.eachParallel { kw ->
                String url = stg + "&" + "query=$kw"
                try {
                    def response = jsl.parseText(url.toURL().getText(connectTimeout: 10.seconds, readTimeout: 10.seconds,
                            useCaches: true, allowUserInteraction: false))
                    stgResponses[kw] = response
                } catch (e) {
                    println("\tFailed to fetch: $url | error: $e")
                }
            }
        }
    }
    devThread.join()
    stgThread.join()
}

Dockerfile

FROM groovy:4.0.0-jdk17 as builder

USER root
RUN apt-get update && apt-get install -yq bash curl wget jq

WORKDIR /app

COPY bin /app/bin

RUN chmod +x /app/bin/*

USER groovy
ENTRYPOINT ["/bin/bash"]
CMD ["bin/run-script.sh"]

The bin/run-script.sh simply downloads the above groovy script at runtime and executes it.

wget "$GROOVY_SCRIPT" -O "$LOCAL_FILE"

...

groovy "$LOCAL_FILE"

As soon as the execution hits the first call to withPool(poolSize), there's a giant thread dump, but execution continues.

I'm trying to figure out what could be causing this behavior. Any ideas 🤷🏽‍♂️?

Thread dump

-- kodeninja
bash
groovy
jvm
kubernetes
threadpool

1 Answer

3/28/2022

For posterity, answering my own question here.

The issue turned out to be this log4j2 JVM hot-patch that we're currently leveraging to fix the recent log4j2 vulnerability. This agent (running as a DaemonSet) patches all running JVMs in all our k8s clusters.

This, somehow, causes my OpenJDK 17 based app to thread dump. I found the same issue with an ElasticSearch 8.1.0 deployment as well (also uses a pre-packaged OpenJDK 17). This one is a service, so I could see a thread dump happening pretty much every half hour! Interestingly, there are other JVM services (and some SOLR 8 deployments) that don't have this issue 🤷🏽‍♂️.

Anyway, I worked with our devops team to temporarily exclude the node that deployment was running on, and lo and behold, the thread dumps disappeared!

Balance in the universe has been restored 🧘🏻‍♂️.

-- kodeninja
Source: StackOverflow