Stackdriver profiler failes to create profile because of permission running Java on GKE

7/11/2019

We are trying to connect our app to the stackdriver profiler but it is failing because of a permission issue it seems.

We are running a Java app on GKE.

Below is the Dockerfile

FROM gcr.io/google-appengine/jetty

RUN mkdir -p /opt/cprof && \
  wget -q -O- https://storage.googleapis.com/cloud-profiler/java/latest/profiler_java_agent.tar.gz \
  | tar xzv -C /opt/cprof

RUN java \
    -agentpath:/opt/cprof/profiler_java_agent.so=-cprof_service=gke,-logtostderr,-minloglevel=0,-cprof_service_version=1.0.0,-cprof_gce_metadata_server_retry_sleep_sec=10,-cprof_gce_metadata_server_retry_count=12 \
    -jar "$JETTY_HOME/start.jar" --create-startd --add-to-start=gcloud,http2c --approve-all-licenses

ENV JETTY_ARGS -Djava.util.logging.config.file=WEB-INF/flex.logging.properties

ENV DBG_ENABLE true

ADD . $APP_DESTINATION_EXPLODED_WAR

ENV JAVA_USER_OPTS -XX:-OmitStackTraceInFastThrow -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseStringDeduplication -XX:+PrintStringDeduplicationStatistics -Xloggc:/tmp/logs.gc

The cluster has been created using the following command:-

gcloud beta container clusters create $cluster_name --machine-type=n1-highmem-8 --project=$project_id --zone=us-central1-c --scopes="cloud-platform" --num-nodes=2

We followed steps in profiling-java to configure the Dockerfile however the profiler fails with the following message.

Failed to create profile, will retry: 7 (The caller does not have permission)

Check the full deploy logs below:-

12:53:42 starting build "18a3a03c-6dc8-4683-8fae-92ec22aa84a8"
12:53:42 
12:53:42 FETCHSOURCE
12:53:42 Fetching storage object: gs://tradeos-test1_cloudbuild/source/1562838709.77-78c301e261714cd2a46391b235d5edc5.tgz#1562838738548166
12:53:42 Copying gs://tradeos-test1_cloudbuild/source/1562838709.77-78c301e261714cd2a46391b235d5edc5.tgz#1562838738548166...
12:53:42 / [0 files][    0.0 B/239.8 MiB]                                                
-
- [0 files][ 56.0 MiB/239.8 MiB]                                                
\
|
| [0 files][165.0 MiB/239.8 MiB]                                                
/
/ [1 files][239.8 MiB/239.8 MiB]                                                
12:53:42 Operation completed over 1 objects/239.8 MiB.                                    
12:53:42 BUILD
12:53:42 Already have image (with digest): gcr.io/cloud-builders/docker
12:53:42 Sending build context to Docker daemon  396.6MB

12:53:42 Step 1/8 : FROM gcr.io/google-appengine/jetty
12:53:42 latest: Pulling from google-appengine/jetty
12:53:42 Digest: sha256:7e37b8561b2f25660d1aa492dea4f09a6121fe7f8b7f6b2e9f8c65e1cf33328e
12:53:42 Status: Downloaded newer image for gcr.io/google-appengine/jetty:latest
12:53:42  ---> dac4353b3a0c
12:53:42 Step 2/8 : RUN mkdir -p /opt/cprof &&   wget -q -O- https://storage.googleapis.com/cloud-profiler/java/latest/profiler_java_agent.tar.gz   | tar xzv -C /opt/cprof
12:53:42  ---> Running in 3799f9af9c33
12:53:42 NOTICES
12:53:42 profiler_java_agent.so
12:53:42 Removing intermediate container 3799f9af9c33
12:53:42  ---> 8cd480829757
12:53:42 Step 3/8 : RUN java    -agentpath:/opt/cprof/profiler_java_agent.so=-cprof_service=gke,-logtostderr,-minloglevel=0,-cprof_service_version=1.0.0,-cprof_gce_metadata_server_retry_sleep_sec=10,-cprof_gce_metadata_server_retry_count=12    -jar "$JETTY_HOME/start.jar" --create-startd --add-to-start=gcloud,http2c --approve-all-licenses
12:53:42  ---> Running in bad2457cd65c
12:53:42 [91mI0711 09:52:57.100730     7 entry.cc:268] Profiler agent loaded
12:53:42 [0m[91mI0711 09:52:57.105134     7 entry.cc:154] Prepare JVMTI
12:53:42 [0m[91mI0711 09:52:57.301863     7 entry.cc:108] On VM init
12:53:42 [0m[91mI0711 09:52:57.304172     7 cloud_env.cc:136] Project ID is not set via flag or environment, will get from the metadata server
12:53:42 [0m[91mI0711 09:52:57.304981     7 throttler_api.cc:269] Will use profiler service cloudprofiler.googleapis.com to create and upload profiles
12:53:42 [0m[91mI0711 09:52:57.315691    15 throttler_api.cc:202] Initialized deployment: project_id=tradeos-test1, service=gke, service_version=1.0.0, zone_name=us-central1-f
12:53:42 [0m[91mI0711 09:52:57.320428    15 throttler_api.cc:302] Creating a new profile via profiler service
12:53:42 [0m[91mW0711 09:52:57.539041    15 throttler_api.cc:382] Failed to create profile, will retry: 7 (The caller does not have permission)
12:53:42 [0m[91mINFO  [0m[91m: [0m[91mAll Licenses Approved via Command Line Option[0m[91m
12:53:42 [0m[91mINFO  [0m[91m: [0m[91mgcloud          initialized in ${jetty.base}/start.d/gcloud.ini[0m[91m
12:53:42 [0m[91mINFO  [0m[91m: [0m[91mhttp2c          initialized in ${jetty.base}/start.d/http2c.ini[0m[91m
12:53:42 [0m[91mINFO  [0m[91m: [0m[91mBase directory was modified[0m[91m
12:53:42 [0m[91mI0711 09:52:58.506006     7 entry.cc:143] On VM death
12:53:42 [0m[91mI0711 09:53:25.504830    15 throttler_api.cc:302] Creating a new profile via profiler service
12:53:42 I0711 09:53:25.505025    15 worker.cc:177] Exiting the profiling loop
12:53:42 [0mRemoving intermediate container bad2457cd65c
12:53:42  ---> e9a969345ed1
12:53:42 Step 4/8 : ENV JETTY_ARGS -Djava.util.logging.config.file=WEB-INF/flex.logging.properties
12:53:42  ---> Running in d3e4b6456694
12:53:42 Removing intermediate container d3e4b6456694
12:53:42  ---> 52deb5efe19d
12:53:42 Step 5/8 : ENV DBG_ENABLE true
12:53:42  ---> Running in 6bd2b94ae0fc
12:53:42 Removing intermediate container 6bd2b94ae0fc
12:53:42  ---> 9800d1e2d074
12:53:42 Step 6/8 : ADD . $APP_DESTINATION_EXPLODED_WAR
12:53:42  ---> e1d7881f4558
12:53:42 Step 7/8 : ENV JAVA_USER_OPTS -XX:-OmitStackTraceInFastThrow -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseStringDeduplication -XX:+PrintStringDeduplicationStatistics -Xloggc:/tmp/logs.gc
12:53:42  ---> Running in 81247ccde74f
12:53:42 Removing intermediate container 81247ccde74f
12:53:42  ---> d66dbce64869
12:53:42 Step 8/8 : ENV GCLOUD_PROJECT tradeos-test1
12:54:07  ---> Running in af00f02783b7
12:54:07 Removing intermediate container af00f02783b7
12:54:07  ---> 77826e33ef3c
12:54:07 Successfully built 77826e33ef3c
12:54:07 Successfully tagged gcr.io/tradeos-test1/ram-image:1
12:54:07 PUSH
12:54:07 Pushing gcr.io/tradeos-test1/ram-image:1
12:54:07 The push refers to repository [gcr.io/tradeos-test1/ram-image]

The service account we use to deploy has roles/cloudprofiler.agent but it is still failing. any idea what permission we are missing ?

Update: The GKE nodes uses the default compute engine service account, I added roles/cloudprofiler.agent to it but still the same error.

-- montss
google-kubernetes-engine
java
stackdriver

1 Answer

7/12/2019

I had the same issue, and in my case this particular GKE service had its own service account with permissions to access BigQuery datasets, and secret was placed to /var/secrets/google/key.json inside container in order for application BiqQuery lib to find it there during launch as well as GOOGLE_APPLICATION_CREDENTIALS environment variable, the issue here may be that profiler agent looks like is also checking path that was set in this variable, when it starts, and if it finds it - it is starting to use it for all requests to cloudprofiler.googleapis.com. Some info on this can be found here - https://cloud.google.com/profiler/docs/profiling-external#using_service_accounts One of the solution may be to provide this service account additional permissions to access google profiler service.

Also, if this is not your case, you should check permissions of the default service account attached to your GKE nodes, more info on this here - https://cloud.google.com/kubernetes-engine/docs/how-to/hardening-your-cluster

Each GKE node has an IAM Service Account associated with it. By default, nodes are given the Compute Engine default service account, which you can find by navigating to the IAM section of the Cloud Console. 

Hope this can help someone.

-- BLiN
Source: StackOverflow