Flink deployment in kubernetes failed to start jobs

10/29/2020

I created a flink cluster in kubernetes following this guide: https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html

The job manager was running. When a job was submitted to the job manager, it spawned an task manager pod, but the task manager failed to connect to job manager.

2020-10-29 13:22:51,069 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Connecting to ResourceManager akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).
2020-10-29 13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer
2020-10-29 13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 352518404 - discarded
2020-10-29 13:22:51,180 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123]] Caused by: [The remote system explicitly disassociated (reason unknown).]
2020-10-29 13:22:51,183 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Could not resolve ResourceManager address akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*.
2020-10-29 13:23:01,203 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer
-- Liangde Chen
apache-flink
kubernetes

1 Answer

10/29/2020

If your problem is that the TM cannot find the JM, it is possible that you have to use a Statefulset type of pod. So you can configure their hostname static. Here is my example from https://github.com/felipegutierrez/explore-flink/tree/master/k8s

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: flink-taskmanager
  namespace: kafka
spec:
  replicas: 3
  serviceName: flink-taskmanager
  selector:
    matchLabels:
      app: flink-taskmanager # has to match .spec.template.metadata.labels
  template:
    metadata:
      labels:
        app: flink-taskmanager # has to match .spec.selector.matchLabels
    spec:
      hostname: flink-taskmanager
      volumes:
      - name: flink-config-volume
        configMap:
          name: flink-config
          items:
          - key: flink-conf.yaml
            path: flink-conf.yaml
          - key: log4j-console.properties
            path: log4j-console.properties
      - name: tpch-dbgen-data
        persistentVolumeClaim:
          claimName: tpch-dbgen-data-pvc
      - name: tpch-dbgen-datarate
        persistentVolumeClaim:
          claimName: tpch-dbgen-datarate-pvc
      containers:
      - name: taskmanager
        image: felipeogutierrez/explore-flink:1.11.2-scala_2.12
        imagePullPolicy: Always # Always/IfNotPresent
        env:
        args: ["taskmanager"]
        ports:
        - containerPort: 6122
          name: rpc
        - containerPort: 6125
          name: query-state
        - containerPort: 9250
        livenessProbe:
          tcpSocket:
            port: 6122
          initialDelaySeconds: 30
          periodSeconds: 60
        volumeMounts:
        - name: flink-config-volume
          mountPath: /opt/flink/conf/
        - name: tpch-dbgen-data
          mountPath: /opt/tpch-dbgen/data
          subPath: data
        - mountPath: /tmp
          name: tpch-dbgen-datarate
          subPath: tmp
        securityContext:
          runAsUser: 9999  # 
-- Knoblauch
Source: StackOverflow