When running Spark on Kubernetes to access kerberized Hadoop cluster, how do you resolve a "SIMPLE authentication is not enabled" error on executors?

1/14/2019

I'm trying to run Spark on Kubernetes, with the aim of processing data from a Kerberized Hadoop cluster. My application consists of simple SparkSQL transformations. While I'm able to run the process successfully on a single driver pod, I cannot do this when attempting to use any executors. Instead, I get:

org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]

Since the Hadoop environment is Kerberized, I've provided a valid keytab, as well as the core-site.xml, hive-site.xml, hadoop-site.xml, mapred-site.xml and yarn-site.xml, and a krb5.conf file inside the docker image.

I set up the environment settings with the following method:

trait EnvironmentConfiguration {

def configureEnvironment(): Unit = {
  val conf = new Configuration
  conf.set("hadoop.security.authentication", "kerberos")
  conf.set("hadoop.security.authorization", "true")
  conf.set("com.sun.security.auth.module.Krb5LoginModule", "required")
  System.setProperty("java.security.krb5.conf", ConfigurationProperties.kerberosConfLocation)    
  UserGroupInformation.loginUserFromKeytab(ConfigurationProperties.keytabUser, ConfigurationProperties.keytabLocation)
  UserGroupInformation.setConfiguration(conf)
}

I also pass the *-site.xml files through the following method:

trait SparkConfiguration {

  def createSparkSession(): SparkSession = {
    val spark = SparkSession.builder
    .appName("MiniSparkK8")
    .enableHiveSupport()
    .master("local[*]")
    .config("spark.sql.hive.metastore.version", ConfigurationProperties.hiveMetastoreVersion)
    .config("spark.executor.memory", ConfigurationProperties.sparkExecutorMemory)
    .config("spark.sql.hive.version", ConfigurationProperties.hiveVersion)
    .config("spark.sql.hive.metastore.jars",ConfigurationProperties.hiveMetastoreJars)
    spark.sparkContext.hadoopConfiguration.addResource(new Path(ConfigurationProperties.coreSiteLocation))
    spark.sparkContext.hadoopConfiguration.addResource(new Path(ConfigurationProperties.hiveSiteLocation))
    spark.sparkContext.hadoopConfiguration.addResource(new Path(ConfigurationProperties.hdfsSiteLocation))
    spark.sparkContext.hadoopConfiguration.addResource(new Path(ConfigurationProperties.yarnSiteLocation))
    spark.sparkContext.hadoopConfiguration.addResource(new Path(ConfigurationProperties.mapredSiteLocation))
  }
}

I run the whole process with the following spark-submit command:

spark-submit ^
--master k8s://https://kubernetes.example.environment.url:8443 ^
--deploy-mode cluster ^
--name mini-spark-k8 ^
--class org.spark.Driver ^
--conf spark.executor.instances=2 ^
--conf spark.kubernetes.namespace=<company-openshift-namespace> ^
--conf spark.kubernetes.container.image=<company_image_registry.image> ^
--conf spark.kubernetes.driver.pod.name=minisparkk8-cluster ^
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark ^
local:///opt/spark/examples/target/MiniSparkK8-1.0-SNAPSHOT.jar ^
/opt/spark/mini-spark-conf.properties

The above configurations are enough to get my spark application running and successfully connecting to the Kerberized Hadoop cluster. Although the spark submit command declares the creation of two executor pods, this does not happen because I have set master to local[*]. Consequently, only one pod is created which manages to connect to the Kerberized Hadoop cluster and successfully run my Spark transformations on Hive tables.

However, when I remove .master(local[*]), two executor pods are created. I can see from the logs that these executors connecting successfully to the driver pod, and they are assigned tasks. It doesn't take long after this point for both of them to fail with the error mentioned above, resulting in the failed executor pods to be terminated. This is despite the executors already having all the necessary files to create a successful connection to the Kerberized Hadoop inside their image. I believe that the executors are not using the keytab, which they would be doing if they were running the JAR. Instead, they're running tasks given to them from the driver.

I can see from the logs that the driver manages to authenticate itself correctly with the keytab for user, USER123:

INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, USER123); groups with view permissions: Set(); users with modify permissions: Set(spark, USER123); groups with modify permissions: Set()

On the other hand, you get the following from the executor's log, you can see that user, USER123 is not authenticated:

INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); groups with view permissions: Set(); users with modify permissions: Set(spark); groups with modify permissions: Set()

I have looked at various sources, including here. It mentions that HIVE_CONF_DIR needs to be defined, but I can see from my program (which prints the environment variables) that this variable is not present, including when the driver pod manages to authenticate itself and run the spark process fine.

I've tried running with the following added to the previous spark-submit command:

--conf spark.kubernetes.kerberos.enabled=true ^
--conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf ^
--conf spark.kubernetes.kerberos.keytab=/var/keytabs/USER123.keytab ^
--conf spark.kubernetes.kerberos.principal=USER123@REALM ^

But this made no difference.

My question is: how can I get the executors to authenticate themselves with the keytab they have in their image? I'm hoping this will allow them to perform their delegated tasks.

-- K.Naga
apache-spark
docker
kerberos
kubernetes
openshift

4 Answers

6/28/2019

If you don't mind running Hive instead of SparkSQL for your SQL analytics (and also having to learn Hive), Hive on MR3 offers a solution to running Hive on Kubernetes while a secure (Kerberized) HDFS serves as a remote data source. As an added bonus, from Hive 3, Hive is much faster than SparkSQL.

https://mr3.postech.ac.kr/hivek8s/home/

-- glapark
Source: StackOverflow

6/10/2019

Try to kinit your keytab to get TGT from KDC in advance.

For example, you could run kinit in the container at first.

-- runzhliu
Source: StackOverflow

9/5/2019

First get the delegation token from hadoop using the below command .

  1. Do a kinit -kt with your keytab and principal
  2. Execute the below to store the hdfs delegation token in a tmp path spark-submit --class org.apache.hadoop.hdfs.tools.DelegationTokenFetcher "" --renewer null /tmp/spark.token
  3. Do your actual spark submit with the adding this configuration . --conf spark.executorEnv.HADOOP_TOKEN_FILE_LOCATION=/tmp/spark.token \

The above is how yarn executors authenticate. Do the same for kubernetes executors too.

-- hari vignesh
Source: StackOverflow

2/15/2019

Spark on k8s not support kerberos now. This may be help you. https://issues.apache.org/jira/browse/SPARK-23257

-- hustcat
Source: StackOverflow