I created a flink cluster in kubernetes following this guide: https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html
The job manager was running. When a job was submitted to the job manager, it spawned an task manager pod, but the task manager failed to connect to job manager.
2020-10-29 13:22:51,069 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Connecting to ResourceManager akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).
2020-10-29 13:22:51,176 WARN akka.remote.transport.netty.NettyTransport [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer
2020-10-29 13:22:51,176 WARN akka.remote.transport.netty.NettyTransport [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 352518404 - discarded
2020-10-29 13:22:51,180 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123]] Caused by: [The remote system explicitly disassociated (reason unknown).]
2020-10-29 13:22:51,183 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not resolve ResourceManager address akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*.
2020-10-29 13:23:01,203 WARN akka.remote.transport.netty.NettyTransport [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer
If your problem is that the TM cannot find the JM, it is possible that you have to use a Statefulset type of pod. So you can configure their hostname static. Here is my example from https://github.com/felipegutierrez/explore-flink/tree/master/k8s
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: flink-taskmanager
namespace: kafka
spec:
replicas: 3
serviceName: flink-taskmanager
selector:
matchLabels:
app: flink-taskmanager # has to match .spec.template.metadata.labels
template:
metadata:
labels:
app: flink-taskmanager # has to match .spec.selector.matchLabels
spec:
hostname: flink-taskmanager
volumes:
- name: flink-config-volume
configMap:
name: flink-config
items:
- key: flink-conf.yaml
path: flink-conf.yaml
- key: log4j-console.properties
path: log4j-console.properties
- name: tpch-dbgen-data
persistentVolumeClaim:
claimName: tpch-dbgen-data-pvc
- name: tpch-dbgen-datarate
persistentVolumeClaim:
claimName: tpch-dbgen-datarate-pvc
containers:
- name: taskmanager
image: felipeogutierrez/explore-flink:1.11.2-scala_2.12
imagePullPolicy: Always # Always/IfNotPresent
env:
args: ["taskmanager"]
ports:
- containerPort: 6122
name: rpc
- containerPort: 6125
name: query-state
- containerPort: 9250
livenessProbe:
tcpSocket:
port: 6122
initialDelaySeconds: 30
periodSeconds: 60
volumeMounts:
- name: flink-config-volume
mountPath: /opt/flink/conf/
- name: tpch-dbgen-data
mountPath: /opt/tpch-dbgen/data
subPath: data
- mountPath: /tmp
name: tpch-dbgen-datarate
subPath: tmp
securityContext:
runAsUser: 9999 #