Flink 1.5.4 is not registering Google Cloud Storage (GCS) filesystem in Kubernetes, although it works in docker container

10/4/2018

Note: Baking keys into an image is the worst you can do, I did this here to have a binary equal filesystem between Docker and Kubernetes while debugging.

I am trying to start up a flink-jobmanager that persists its state in GCS, so I added a high-availability.storageDir: gs://BUCKET/ha line to my flink-conf.yaml and I am building my Dockerfile as described here

This is my Dockerfile:

FROM flink:1.5-hadoop28

ADD https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar /opt/flink/lib/gcs-connector-latest-hadoop2.jar

RUN mkdir /opt/flink/etc-hadoop
COPY flink-conf.yaml /opt/flink/conf/flink-conf.yaml
COPY key.json /opt/flink/etc-hadoop/key.json
COPY core-site.xml /opt/flink/etc-hadoop/core-site.xml

Now if I build this container via docker build -t flink:dev . and start an interactive shell in it like docker run -ti flink:dev /bin/bash, I am able to start the flink jobmanager via:

flink-console.sh jobmanager --configDir=/opt/flink/conf/ --executionMode=cluster

Flink is picking up the jar's and starting normally. However, when I use the following yaml for starting it on Kubernetes, based on the one here:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: flink-jobmanager
spec:
  replicas: 1
  selector:
    matchLabels:
      app: flink
      component: jobmanager
  template:
    metadata:
      labels:
        app: flink
        component: jobmanager
    spec:
      containers:
      - name: jobmanager
        image: flink:dev
        imagePullPolicy: Always
        resources:
          requests:
            memory: "1024Mi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        args: ["jobmanager"]
        ports:
        - containerPort: 6123
          name: rpc
        - containerPort: 6124
          name: blob
        - containerPort: 6125
          name: query
        - containerPort: 8081
          name: ui
        - containerPort: 46110
          name: ha
        env:
        - name: GOOGLE_APPLICATION_CREDENTIALS
          value: /opt/flink/etc-hadoop/key.json
        - name: JOB_MANAGER_RPC_ADDRESS
          value: flink-jobmanager

Flink seems to be unable to register the filesystem:

2018-10-04 09:20:51,357 DEBUG org.apache.flink.runtime.util.HadoopUtils                     - Cannot find hdfs-default configuration-file path in Flink config.
2018-10-04 09:20:51,358 DEBUG org.apache.flink.runtime.util.HadoopUtils                     - Cannot find hdfs-site configuration-file path in Flink config.
2018-10-04 09:20:51,359 DEBUG org.apache.flink.runtime.util.HadoopUtils                     - Adding /opt/flink/etc-hadoop//core-site.xml to hadoop configuration
2018-10-04 09:20:51,767 DEBUG org.apache.hadoop.security.UserGroupInformation               - PrivilegedActionException as:flink (auth:SIMPLE) cause:java.io.IOException: Could not create FileSystem for highly available storage (high-availability.storageDir)
2018-10-04 09:20:51,767 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Cluster initialization failed.
java.io.IOException: Could not create FileSystem for highly available storage (high-availability.storageDir)
        at org.apache.flink.runtime.blob.BlobUtils.createFileSystemBlobStore(BlobUtils.java:122)
        at org.apache.flink.runtime.blob.BlobUtils.createBlobStoreFromConfig(BlobUtils.java:95)
        at org.apache.flink.runtime.highavailability.HighAvailabilityServicesUtils.createHighAvailabilityServices(HighAvailabilityServicesUtils.java:115)
        at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.createHaServices(ClusterEntrypoint.java:402)
        at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:270)
        at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:225)
        at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:189)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
        at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
        at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:188)
        at org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint.main(StandaloneSessionClusterEntrypoint.java:91)
Caused by: org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system implementation for scheme 'gs'. The scheme is not directly supported by Flink and no Hadoop file system to support this scheme could be loaded.
        at org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:405)
        at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:320)
        at org.apache.flink.core.fs.Path.getFileSystem(Path.java:298)
        at org.apache.flink.runtime.blob.BlobUtils.createFileSystemBlobStore(BlobUtils.java:119)
        ... 12 more
Caused by: org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Hadoop File System abstraction does not support scheme 'gs'. Either no file system implementation exists for that scheme, or the relevant classes are missing from the classpath.
        at org.apache.flink.runtime.fs.hdfs.HadoopFsFactory.create(HadoopFsFactory.java:102)
        at org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:401)
        ... 15 more
Caused by: java.io.IOException: No FileSystem for scheme: gs
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2799)
        at org.apache.flink.runtime.fs.hdfs.HadoopFsFactory.create(HadoopFsFactory.java:99)
        ... 16 more

As Kubernetes should be using the same image, I am confused how this is possible. Am I overseeing something here?

-- Tobi
apache-flink
docker
google-cloud-storage
kubernetes

1 Answer

10/4/2018

The problem was using the dev tag. Using specific version tags fixed the issue.

-- Tobi
Source: StackOverflow