Problem with running Apache Beam jobs on remote Flink cluster on Kubernetes

7/13/2020

I have a Flink SessionCluster deployed on a remote Kubernetes cluster (as per the docs), available at <FLINK_MASTER_URL>:8081 and I'm trying to run the Apache Beam's example wordcount job on it.

However, every time I get an error -- it looks like I'm unable to successfully submit the job for execution. Error logs and Beam's Pipeline options are below; I'd be grateful for some tips on how to solve this issue (I'm not an experienced Flink/Beam user, so please forgive any obvious errors).

Pipeline options:

PipelineOptions(
    "--runner=FlinkRunner",
    "--flink_master=<FLINK_MASTER_URL>:8081"
)

Error logs (truncated):

WARNING:root:Make sure that locally built Python SDK docker image has Python 3.7 interpreter.
INFO:root:Using Python SDK docker image: apache/beam_python3.7_sdk:2.21.0. If the image is not available at local, we will try to pull from hub.docker.com
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function lift_combiners at 0x7f954007e710> ====================
INFO:apache_beam.runners.portability.flink_runner:Adding HTTP protocol scheme to flink_master parameter: http://<FLINK_MASTER_URL>:8081
INFO:apache_beam.utils.subprocess_server:Using cached job server jar from https://repo.maven.apache.org/maven2/org/apache/beam/beam-runners-flink-1.10-job-server/2.21.0/beam-runners-flink-1.10-job-server-2.21.0.jar
INFO:apache_beam.utils.subprocess_server:Starting service with ['java' '-jar' '/home/rjurczak/.apache_beam/cache/jars/beam-runners-flink-1.10-job-server-2.21.0.jar' '--flink-master' 'http://<FLINK_MASTER_URL>:8081' '--artifacts-dir' '/tmp/beam-tempht7lpipz/artifactsotk2otzl' '--job-port' '48375' '--artifact-port' '0' '--expansion-port' '0']
INFO:apache_beam.utils.subprocess_server:b'[main] INFO org.apache.beam.runners.fnexecution.jobsubmission.JobServerDriver - LegacyArtifactStagingService started on localhost:37645'
INFO:apache_beam.utils.subprocess_server:b'[main] INFO org.apache.beam.runners.fnexecution.jobsubmission.JobServerDriver - Java ExpansionService started on localhost:35547'
INFO:apache_beam.utilradars.subprocess_server:b'[main] INFO org.apache.beam.runners.fnexecution.jobsubmission.JobServerDriver - JobService started on localhost:48375'
INFO:apache_beam.utils.subprocess_server:b'[grpc-default-executor-0] INFO org.apache.beam.runners.flink.FlinkJobInvoker - Invoking job BeamApp-rjurczak-0713164027-2a729669_d00db59c-cda9-46be-9bd8-1b8406d155a5 with pipeline runner org.apache.beam.runners.flink.FlinkPipelineRunner@25fa0e2d'
INFO:apache_beam.utils.subprocess_server:b'[grpc-default-executor-0] INFO org.apache.beam.runners.fnexecution.jobsubmission.JobInvocation - Starting job invocation BeamApp-rjurczak-0713164027-2a729669_d00db59c-cda9-46be-9bd8-1b8406d155a5'
INFO:apache_beam.runners.portability.portable_runner:Job state changed to STOPPED
INFO:apache_beam.runners.portability.portable_runner:Job state changed to STARTING
INFO:apache_beam.runners.portability.portable_runner:Job state changed to RUNNING
INFO:apache_beam.utils.subprocess_server:b'[flink-runner-job-invoker] INFO org.apache.beam.runners.flink.FlinkPipelineRunner - Translating pipeline to Flink program.'
INFO:apache_beam.utils.subprocess_server:b'[flink-runner-job-invoker] INFO org.apache.beam.runners.flink.FlinkExecutionEnvironments - Creating a Batch Execution Environment.'
INFO:apache_beam.utilradars.subprocess_server:b'[flink-runner-job-invoker] INFO org.apache.beam.runners.flink.FlinkExecutionEnvironments - Using Flink Master URL 10.70.227.141:8081.'
INFO:apache_beam.utils.subprocess_server:b'[flink-runner-job-invoker] INFO org.apache.flink.api.java.ExecutionEnvironment - The job has 0 registered types and 0 default Kryo serializers'
INFO:apache_beam.utils.subprocess_server:b'[Flink-RestClusterClient-IO-thread-4] WARN org.apache.flink.util.ExecutorUtils - ExecutorService did not terminate in time. Shutting it down now.'
INFO:apache_beam.utils.subprocess_server:b'[flink-runner-job-invoker] ERROR org.apache.beam.runners.fnexecution.jobsubmission.JobInvocation - Error during job invocation BeamApp-rjurczak-0713164027-2a729669_d00db59c-cda9-46be-9bd8-1b8406d155a5.'
INFO:apache_beam.utils.subprocess_server:b'java.lang.RuntimeException: java.util.concurrent.ExecutionException: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.'
INFO:apache_beam.utils.subprocess_server:b'\tat org.apache.flink.utilradar.ExceptionUtils.rethrow(ExceptionUtils.java:199)'
INFO:apache_beam.utils.subprocess_server:b'\tat org.apache.flink.api.java.ExecutionEnvironment.executeAsync(ExecutionEnvironment.java:952)'
INFO:apache_beam.utils.subprocess_server:b'\tat org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:860)'
INFO:apache_beam.utils.subprocess_server:b'\tat org.apache.beam.runners.flink.FlinkBatchPortablePipelineTranslator$BatchTranslationContext.execute(FlinkBatchPortablePipelineTranslator.java:194)'
INFO:apache_beam.utils.subprocess_server:b'\tat org.apache.beam.runners.flink.FlinkPipelineRunner.runPipelineWithTranslator(FlinkPipelineRunner.java:116)'
INFO:apache_beam.utils.subprocess_server:b'\tat org.apache.beam.runners.flink.FlinkPipelineRunner.run(FlinkPipelineRunner.java:83)'
INFO:apache_beam.utils.subprocess_server:b'\tat org.apache.beam.runners.fnexecution.jobsubmission.JobInvocation.runPipeline(JobInvocation.java:83)'radarradar
INFO:apache_beam.utils.subprocess_server:b'\tat org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)'
INFO:apache_beam.utils.subprocess_server:b'\tat org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)'
INFO:apache_beam.utils.subprocess_server:b'\tat org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)'
INFO:apache_beam.utils.subprocess_server:b'\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)'
INFO:apache_beam.utils.subprocess_server:b'\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)'
INFO:apache_beam.utils.subprocess_server:b'\tat java.base/java.lang.Thread.run(Thread.java:834)'
INFO:apache_beam.utils.subprocess_server:b'Caused by: java.util.concurrent.ExecutionException: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.'
INFO:apache_beam.utils.subprocess_server:b'\tat java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)'
INFO:apache_beam.utils.subprocess_server:b'\tat java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999)'
INFO:apache_beam.utils.subprocess_server:b'\tat org.apache.flink.api.java.ExecutionEnvironment.executeAsync(ExecutionEnvironment.java:947)'
INFO:apache_beam.utils.subprocess_server:b'\t... 11 more'
INFO:apache_beam.utils.subprocess_server:b'Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.'
INFO:apache_beam.utils.subprocess_server:b'\tat org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$7(RestClusterClient.java:359)'
INFO:apache_beam.utils.subprocess_server:b'\tat java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)'
INFO:apache_beam.utils.subprocess_server:b'\tat java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)'
INFO:apache_beam.utils.subprocess_server:b'\tat java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)'
INFO:apache_beam.utils.subprocess_server:b'\tat java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)'
INFO:apache_beam.utils.subprocess_server:b'\tat org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:274)'
INFO:apache_beam.utils.subprocess_server:b'\tat java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)'
INFO:apache_beam.utils.subprocess_server:b'\tat java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)'
INFO:apache_beam.utils.subprocess_server:b'\tat java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)'
INFO:apache_beam.utils.subprocess_server:b'\tat java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)'
INFO:apache_beam.utils.subprocess_server:b'\tat java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1085)'
INFO:apache_beam.utils.subprocess_server:b'\tat java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)'
INFO:apache_beam.utils.subprocess_server:b'\t... 3 more'
INFO:apache_beam.utils.subprocess_server:b'Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Failed to deserialize JobGraph.]'
INFO:apache_beam.utils.subprocess_server:b'\tat org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:390)'
INFO:apache_beam.utils.subprocess_server:b'\tat org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:374)'
INFO:apache_beam.utils.subprocess_server:b'\tat java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)'
INFO:apache_beam.utils.subprocess_server:b'\t... 4 more'
ERROR:root:org.apache.flink.runtime.rest.util.RestClientException: [Failed to deserialize JobGraph.]
INFO:apache_beam.utils.subprocess_server:b'[flink-runner-job-invoker] INFO org.apache.beam.runners.fnexecution.artifact.AbstractLegacyArtifactRetrievalService - Manifest at /tmp/beam-tempht7lpipz/artifactsotk2otzl/job_364b1df0-7e66-4759-997f-91f87179932b/MANIFEST has 1 artifact locations'
INFO:apache_beam.runners.portability.portable_runner:Job state changed to FAILED
Traceback (most recent call last):
  File "examples/wordcount.py", line 152, in <module>
    run()
  File "examples/wordcount.py", line 132, in run
    result.wait_until_finish()
  File "/home/rjurczak/envs/env/lib/python3.7/site-packages/apache_beam/runners/portability/portable_runner.py", line 550, in wait_until_finish
    (self._job_id, self._state, self._last_error_message()))
RuntimeError: Pipeline BeamApp-rjurczak-0713164027-2a729669_d00db59c-cda9-46be-9bd8-1b8406d155a5 failed in state FAILED: org.apache.flink.runtime.rest.util.RestClientException: [Failed to deserialize JobGraph.]
-- Albert Murphy
apache-beam
apache-flink
kubernetes

1 Answer

7/14/2020

This a rather short answer. It looks like the same error as here.

Make sure that the CLI Flink version matches that of your Flink master running on Kubernetes.

-- Rico
Source: StackOverflow