I have created a docker container that is using Sagemaker via the java sdk. This container is deployed on a k8s cluster with several replicas.
The container is doing simple requests to Sagemaker to list some models that we have trained and deployed. However we are now having issues with some java certificate. I am quite novice with k8s and certificates so I will appreciate if you could provide some help to fix the issue.
Here are some traces from the log when it tries to list the endpoints:
org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:394)
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:353)
at com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.connectSocket(SdkTLSSocketFactory.java:132)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:141)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
at com.amazonaws.http.conn.$Proxy67.connect(Unknown Source)
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1236)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1056)
... 70 common frames omitted
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:397)
at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:302)
at sun.security.validator.Validator.validate(Validator.java:262)
at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324)
at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:229)
at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:124)
at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1621)
... 97 common frames omitted
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141)
at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126)
at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280)
at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:392)
... 103 common frames omitted
I think I have found the answer to my problem. I have set up another k8s cluster and deployed the container there as well. They are working fine and the certificate issues does not happen. When investigating more I have noticed that they were some issues with DNS resolution on the first k8s cluster. In fact the containers with certificate issues could not ping google.com for example. I fixed the DNS issue by not relying on core-dns and setting the DNS configuration in the deployment.yaml file. I am not sure to understand why exactly but this seems to have fixed the certificate issue.
This might most likely to do with some custom SSL certification path added to your network by your admin. You might want to inspect the SSL root certificates by opening any secured website on your browser and click on the Secure link to the left of the address bar ( atleast this is how it is in chrome ) . You will see a popup showing certificate and certification information. Go to its Certificate Path and see the ROOT certificate , if it is something of custom certificate then you will need to add the same to your cacerts file. Read this link for more details
The error message you're receiving occurs when Java does not know about the root certificate returned by an TLS endpoint. This often occurs if you change the root certificates available.
Per https://docs.oracle.com/javase/7/docs/technotes/guides/security/jsse/JSSERefGuide.html#Customization:
"If a truststore named <java-home>/lib/security/jssecacerts is found, it is used.
If not, then a truststore named <java-home>/lib/security/cacerts is searched for and used (if it exists).
Finally, if a truststore is still not found, then the truststore managed by the TrustManager will be a new empty truststore."
Openssl is a good tool for debugging such certificate issues. You can use the following command to retrieve the certificate returned by an endpoint. This may help you determine what the certificate chain looks like.
openssl s_client -showcerts -connect www.example.com:443 </dev/null
You can view the list of certificates that Java knows about using keytool, a utility vended with the JRE.
keytool -list -cacerts
Some system administrators will override the default certificates by writing an alternative truststore file into the default location. Other times, teams may override the default using the javax.net.ssl.trustStore system property.
Finally, you can use the jps utility, also vended with the JRE, to see the system properties set on a running Java process.
jps -v