spark-submit encounter PKIX path building failed

11/16/2018

So I have cluster node on Google Kubernetes Engine and I do spark-submit to run some spark job. (I didn't use spark-submit exactly, I launch the submit using java code, but they are essentially invoking the same Scala class, which is SparkSubmit.class)

And in my case, I have two clusters I can connect with on my laptop by using the gcloud command.

e.g.

  1. gcloud container clusters get-credentials cluster-1
  2. gcloud container clusters get-credentials cluster-2

when I connect to cluster-1, and spark-submit is submitting to cluster-1, it works. But when I ran the second gcloud command and still submitting to cluster-1, it won't work, and the following stack track appears (abridged version)

io.fabric8.kubernetes.client.KubernetesClientException: Failed to start websocket
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2.onFailure(WatchConnectionManager.java:194)
at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:543)
at okhttp3.internal.ws.RealWebSocket$2.onFailure(RealWebSocket.java:208)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:148)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1949)
at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302)
at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296)
at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1514)
at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216)

I've been searching for a while without success. The main issue is probably when spark-submit launches, it searches for some sort of credential on the local machine relating to Kubernetes, and the changing context by previous two gcloud command messed it up.

I'm just curious, when we do spark-submit, how exactly does the remote K8s server knows who I am? What's the auth process involved in all this?

Thank you in advance.

-- dex
apache-spark
kubernetes
pkix

2 Answers

11/16/2018

If you want to see what the gcloud container clusters get-credentials cluster-1 command does you can start from scratch again and look at the content of ~/.kube/config

rm -rf ~/.kube
gcloud container clusters get-credentials cluster-1
cat ~/.kube/config
gcloud container clusters get-credentials cluster-2
cat ~/.kube/config

Something is probably not matching or conflicting. Or perhaps the user/contexts. Perhaps you have credentials for both cluster but you are using the context for cluster-1 to access cluster-2

$ kubectl config get-contexts
$ kubectl config get-clusters

The structure of the ~/.kube/config file should be something like this:

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: <redacted> or file
    server: https://<IP>:6443
  name: cluster-1
- cluster:
    certificate-authority: <redacted> or file
    server: https://<IP>:8443
  name: cluster-2
contexts:
- context:
    cluster: cluster-1
    user: youruser
  name: access-to-cluster-1
- context:
    cluster: cluster-2
    user: youruser
  name: access-to-cluster-2
current-context: access-to-cluster-1
kind: Config
preferences: {}
users:
- name: ....
  user:
   ...
- name: ....
  user:
   ...

In the code, it looks like it uses the io.fabric8.kubernetes.client.KubernetesClient library. For example, in this file KubernetesDriverBuilder.scala

-- Rico
Source: StackOverflow

11/16/2018

A PKIX path building failed error means Java tries to open an SSL connection, but was unable to find a chain of certificates (the path) that validates the certificate the server offered.

The code you're running from does not trust the certificate offered by the cluster. The clusters are probably using self-signed certificates.

Run from the command line, Java looks for the chain in the truststore located at jre/lib/security/cacerts. Run as part of a larger environment (Tomcat, Glassfish, etc) it will use that environment's certificate truststore.

Since you started spark_submit manually, you're likely missing an option to specify where to find the keystore (server certificate and private key) and truststore (CA certificates). These are usually specified as:

-Djavax.net.ssl.trustStore=/somepath/truststore.jks 
-Djavax.net.ssl.keyStore=/somepath/keystore.jks

If you're running on Java 9+, you will also need to specify the StoreType:

-Djavax.net.ssl.keyStoreType=<TYPE>
-Djavax.net.ssl.trustStoreType=<TYPE>

Up through Java 8, the keystores were always JKS. Since Java 9 they can also be PKCS12.

In the case of a self-signed key, you can export it from the keystore and import it into the truststore as a trusted certificate. There are several sites with instructions for how to do this. I find Jakob Jenkov's site to be quite readable.

-- Devon_C_Miller
Source: StackOverflow