Problem setting up ray using google cloud

10/28/2019

I try to set up Ray cluster using Kubernetes according to https://ray.readthedocs.io/en/latest/autoscaling.html#kubernetes. Here are my steps:

  1. Create a Kubernetes cluster in google cloud platform
  2. Connect to the cluster through cloud shell
  3. run the following commands: sudo pip install -U ray, sudo pip install kubernetes
  4. run ray up (example config file)

Then I'm asked whether to create a cluster. I answer yes. It keeps outputing "error from server (badrequest): pod ray-head-242dd does not have a host assigned"

Then I try the https://ray.readthedocs.io/en/latest/autoscaling.html#gcp approach. I change the project name in example-full yaml. Then I run ray up yaml. Here is the output:

   WARNING: Not monitoring node memory since `psutil` is not installed. Install this with `pip install psutil` (or ray[debug]) to enable debugging of memory-related crashes.
2019-10-28 17:06:58,254 WARNING __init__.py:44 -- file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/__init__.py", line 41, in autodetect
    from . import file_cache
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/file_cache.py", line 41, in <module>
    'file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth')
ImportError: file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
2019-10-28 17:06:58,258 INFO discovery.py:271 -- URL being requested: GET https://www.googleapis.com/discovery/v1/apis/cloudresourcemanager/v1/rest
2019-10-28 17:06:58,397 WARNING __init__.py:44 -- file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/__init__.py", line 41, in autodetect
    from . import file_cache
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/file_cache.py", line 41, in <module>
    'file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth')
ImportError: file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
2019-10-28 17:06:58,398 INFO discovery.py:271 -- URL being requested: GET https://www.googleapis.com/discovery/v1/apis/iam/v1/rest
2019-10-28 17:06:58,448 WARNING __init__.py:44 -- file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/__init__.py", line 41, in autodetect
    from . import file_cache
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/file_cache.py", line 41, in <module>
    'file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth')
ImportError: file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
2019-10-28 17:06:58,448 INFO discovery.py:271 -- URL being requested: GET https://www.googleapis.com/discovery/v1/apis/compute/v1/rest
2019-10-28 17:06:58,609 INFO discovery.py:867 -- URL being requested: GET https://cloudresourcemanager.googleapis.com/v1/projects/project?alt=json
2019-10-28 17:06:58,700 INFO discovery.py:867 -- URL being requested: GET https://iam.googleapis.com/v1/projects/project/serviceAccounts/ray-autoscaler-sa-v1@project.iam.gserviceaccount.com?alt=json
2019-10-28 17:06:58,764 INFO config.py:165 -- _configure_iam_role: Creating new service account ray-autoscaler-sa-v1
2019-10-28 17:06:58,772 INFO discovery.py:867 -- URL being requested: POST https://iam.googleapis.com/v1/projects/project/serviceAccounts?alt=json
2019-10-28 17:06:59,449 INFO discovery.py:867 -- URL being requested: POST https://cloudresourcemanager.googleapis.com/v1/projects/project:getIamPolicy?alt=json
2019-10-28 17:06:59,591 INFO discovery.py:867 -- URL being requested: POST https://cloudresourcemanager.googleapis.com/v1/projects/project:setIamPolicy?alt=json
2019-10-28 17:07:00,095 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project?alt=json
2019-10-28 17:07:00,319 INFO config.py:238 -- _configure_key_pair: Creating new key pair ray-autoscaler_gcp_us-west1_project_ubuntu
2019-10-28 17:07:00,409 INFO discovery.py:867 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/project/setCommonInstanceMetadata?alt=json
2019-10-28 17:07:01,025 INFO config.py:59 -- wait_for_compute_global_operation: Waiting for operation operation-1572296820417-595fee1766329-d528523f-5b1ebecc to finish...
2019-10-28 17:07:01,031 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/global/operations/operation-1572296820417-595fee1766329-d528523f-5b1ebecc?alt=json
2019-10-28 17:07:06,261 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/global/operations/operation-1572296820417-595fee1766329-d528523f-5b1ebecc?alt=json
2019-10-28 17:07:11,491 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/global/operations/operation-1572296820417-595fee1766329-d528523f-5b1ebecc?alt=json
2019-10-28 17:07:11,744 INFO config.py:70 -- wait_for_compute_global_operation: Operation done.
2019-10-28 17:07:11,745 INFO config.py:265 -- _configure_key_pair: Private key not specified in config, using/home/zh2408/.ssh/ray-autoscaler_gcp_us-west1_project_ubuntu.pem
2019-10-28 17:07:11,755 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/regions/us-west1/subnetworks?alt=json
2019-10-28 17:07:11,908 WARNING __init__.py:44 -- file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/__init__.py", line 41, in autodetect
    from . import file_cache
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/file_cache.py", line 41, in <module>
    'file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth')
ImportError: file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
2019-10-28 17:07:11,909 INFO discovery.py:271 -- URL being requested: GET https://www.googleapis.com/discovery/v1/apis/compute/v1/rest
2019-10-28 17:07:12,040 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/instances?filter=%28%28labels.ray-node-type+%3D+head%29%29+AND+%28%28status+%3D+RUNNING%29+OR+%28status+%3D+STAGING%29+OR+%28status+%3D+PROVISIONING%29%29+AND+%28labels.ray-cluster-name+%3D+default%29&alt=json
This will create a new cluster [y/N]: y
2019-10-28 17:07:17,457 INFO commands.py:201 -- get_or_create_head_node: Launching new head node...
2019-10-28 17:07:17,472 INFO discovery.py:867 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/instances?alt=json
2019-10-28 17:07:19,474 INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1572296837479-595fee27abde7-e9b428db-4d0e22ec to finish...
2019-10-28 17:07:19,476 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/operations/operation-1572296837479-595fee27abde7-e9b428db-4d0e22ec?alt=json
2019-10-28 17:07:24,717 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/operations/operation-1572296837479-595fee27abde7-e9b428db-4d0e22ec?alt=json
2019-10-28 17:07:25,039 INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1572296837479-595fee27abde7-e9b428db-4d0e22ec finished.
2019-10-28 17:07:25,055 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/instances?filter=%28%28labels.ray-launch-config+%3D+07f3c1fd9b3e0be05984f720952adf2b99563d9d%29+AND+%28labels.ray-node-type+%3D+head%29+AND+%28labels.ray-node-name+%3D+ray-default-head%29%29+AND+%28%28status+%3D+RUNNING%29+OR+%28status+%3D+STAGING%29+OR+%28status+%3D+PROVISIONING%29%29+AND+%28labels.ray-cluster-name+%3D+default%29&alt=json
2019-10-28 17:07:25,802 INFO commands.py:214 -- get_or_create_head_node: Updating files on head node...
2019-10-28 17:07:25,806 INFO updater.py:356 -- NodeUpdater: ray-default-head-f3ed05cc: Updating to 2ae7e7f3db51902552832d843b3db964635184e5
2019-10-28 17:07:25,820 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/instances?filter=%28%28status+%3D+RUNNING%29+OR+%28status+%3D+STAGING%29+OR+%28status+%3D+PROVISIONING%29%29+AND+%28labels.ray-cluster-name+%3D+default%29&alt=json
2019-10-28 17:07:26,030 INFO discovery.py:867 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/instances/ray-default-head-f3ed05cc/setLabels?alt=json
2019-10-28 17:07:26,766 INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1572296846037-595fee2fd53e7-f3e51edb-17229134 to finish...
2019-10-28 17:07:26,768 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/operations/operation-1572296846037-595fee2fd53e7-f3e51edb-17229134?alt=json
2019-10-28 17:07:32,033 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/operations/operation-1572296846037-595fee2fd53e7-f3e51edb-17229134?alt=json
2019-10-28 17:07:32,336 INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1572296846037-595fee2fd53e7-f3e51edb-17229134 finished.
2019-10-28 17:07:32,337 INFO updater.py:398 -- NodeUpdater: ray-default-head-f3ed05cc: Waiting for remote shell...
2019-10-28 17:07:32,337 INFO updater.py:210 -- NodeUpdater: ray-default-head-f3ed05cc: Waiting for IP...
2019-10-28 17:07:32,337 INFO log_timer.py:21 -- NodeUpdater: ray-default-head-f3ed05cc: Got IP [LogTimer=0ms]
2019-10-28 17:07:32,354 INFO updater.py:262 -- NodeUpdater: ray-default-head-f3ed05cc: Running uptime on 34.82.120.14...
ssh: connect to host 34.82.120.14 port 22: Connection refused
2019-10-28 17:07:38,502 INFO updater.py:262 -- NodeUpdater: ray-default-head-f3ed05cc: Running uptime on 34.82.120.14...
ssh: connect to host 34.82.120.14 port 22: Connection refused
2019-10-28 17:07:43,602 INFO updater.py:262 -- NodeUpdater: ray-default-head-f3ed05cc: Running uptime on 34.82.120.14...
ssh: connect to host 34.82.120.14 port 22: Connection refused
2019-10-28 17:07:48,686 INFO updater.py:262 -- NodeUpdater: ray-default-head-f3ed05cc: Running uptime on 34.82.120.14...
ssh: connect to host 34.82.120.14 port 22: Connection refused
2019-10-28 17:07:53,792 INFO updater.py:262 -- NodeUpdater: ray-default-head-f3ed05cc: Running uptime on 34.82.120.14...
ssh: connect to host 34.82.120.14 port 22: Connection refused
2019-10-28 17:07:58,878 INFO updater.py:262 -- NodeUpdater: ray-default-head-f3ed05cc: Running uptime on 34.82.120.14...
ssh: connect to host 34.82.120.14 port 22: Connection refused
2019-10-28 17:08:03,965 INFO updater.py:262 -- NodeUpdater: ray-default-head-f3ed05cc: Running uptime on 34.82.120.14...
ssh: connect to host 34.82.120.14 port 22: Connection refused
2019-10-28 17:08:09,053 INFO updater.py:262 -- NodeUpdater: ray-default-head-f3ed05cc: Running uptime on 34.82.120.14...
ssh: connect to host 34.82.120.14 port 22: Connection refused
2019-10-28 17:08:14,143 INFO updater.py:262 -- NodeUpdater: ray-default-head-f3ed05cc: Running uptime on 34.82.120.14...
Warning: Permanently added '34.82.120.14' (ECDSA) to the list of known hosts.
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
 21:08:15 up 0 min,  0 users,  load average: 1.10, 0.32, 0.11
2019-10-28 17:08:15,103 INFO log_timer.py:21 -- NodeUpdater: ray-default-head-f3ed05cc: Got remote shell [LogTimer=42766ms]
2019-10-28 17:08:15,129 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/instances?filter=%28%28status+%3D+RUNNING%29+OR+%28status+%3D+STAGING%29+OR+%28status+%3D+PROVISIONING%29%29+AND+%28labels.ray-cluster-name+%3D+default%29&alt=json
2019-10-28 17:08:15,348 INFO discovery.py:867 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/instances/ray-default-head-f3ed05cc/setLabels?alt=json
2019-10-28 17:08:16,008 INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1572296895356-595fee5edde25-16887d46-c522d063 to finish...
2019-10-28 17:08:16,011 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/operations/operation-1572296895356-595fee5edde25-16887d46-c522d063?alt=json
2019-10-28 17:08:21,313 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/operations/operation-1572296895356-595fee5edde25-16887d46-c522d063?alt=json
2019-10-28 17:08:21,581 INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1572296895356-595fee5edde25-16887d46-c522d063 finished.
2019-10-28 17:08:21,582 INFO updater.py:262 -- NodeUpdater: ray-default-head-f3ed05cc: Running mkdir -p ~ on 34.82.120.14...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2019-10-28 17:08:21,741 INFO updater.py:460 -- NodeUpdater: ray-default-head-f3ed05cc: Syncing /tmp/ray-bootstrap-5XD_Sh to ~/ray_bootstrap_config.yaml...
2019-10-28 17:08:21,755 INFO log_timer.py:21 -- NodeUpdater: ray-default-head-f3ed05cc: Synced /tmp/ray-bootstrap-5XD_Sh to ~/ray_bootstrap_config.yaml [LogTimer=174ms]
2019-10-28 17:08:21,756 INFO log_timer.py:21 -- NodeUpdater: ray-default-head-f3ed05cc: Applied config 2ae7e7f3db51902552832d843b3db964635184e5 [LogTimer=55949ms]
2019-10-28 17:08:21,756 ERROR updater.py:367 -- NodeUpdater: ray-default-head-f3ed05cc: Error updating [Errno 2] No such file or directory
2019-10-28 17:08:21,770 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/instances?filter=%28%28status+%3D+RUNNING%29+OR+%28status+%3D+STAGING%29+OR+%28status+%3D+PROVISIONING%29%29+AND+%28labels.ray-cluster-name+%3D+default%29&alt=json
2019-10-28 17:08:22,006 INFO discovery.py:867 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/instances/ray-default-head-f3ed05cc/setLabels?alt=json
2019-10-28 17:08:22,649 INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1572296902019-595fee65389b8-c0cc26c3-1813a77e to finish...
2019-10-28 17:08:22,651 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/operations/operation-1572296902019-595fee65389b8-c0cc26c3-1813a77e?alt=json
2019-10-28 17:08:27,936 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/operations/operation-1572296902019-595fee65389b8-c0cc26c3-1813a77e?alt=json
2019-10-28 17:08:28,180 INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1572296902019-595fee65389b8-c0cc26c3-1813a77e finished.
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/local/lib/python2.7/dist-packages/ray/autoscaler/updater.py", line 370, in run
    raise e
OSError: [Errno 2] No such file or directory
2019-10-28 17:08:28,214 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/instances?filter=%28%28labels.ray-launch-config+%3D+07f3c1fd9b3e0be05984f720952adf2b99563d9d%29+AND+%28labels.ray-node-type+%3D+head%29+AND+%28labels.ray-node-name+%3D+ray-default-head%29%29+AND+%28%28status+%3D+RUNNING%29+OR+%28status+%3D+STAGING%29+OR+%28status+%3D+PROVISIONING%29%29+AND+%28labels.ray-cluster-name+%3D+default%29&alt=json
2019-10-28 17:08:28,431 ERROR commands.py:277 -- get_or_create_head_node: Updating 34.82.120.14 failed

I only find that a ray VM instance has been created. I have no idea about what the errors mean and how to set up ray cluster through google cloud.

-- Zachary HUANG
kubernetes
ray

1 Answer

10/30/2019

The error message related to the host:

error from server (badrequest): pod ray-head-242dd does not have a host assigned

Means that the pod hasn't been scheduled in a node.

According to the documentation shared in your question, this Ray example should be running in a 2-vCPU machine (n1-standard-2).

The provided ray/python/ray/autoscaler/gcp/example-full.yaml cluster config file will create a small cluster with a n1-standard-2 head node

The Pod definition makes a request for 1 vCPU. However, it expects a machine with more vCPU given that other processes/pods/resources are being run in the same node and, it can't allocate all of it to the running pod.

You can try again setting a different machine type for your node pool.

On a side note, you can check the reason for a pod to fail by issuing the following command:

$ kubectl describe pod { YOUR - RAY - POD - NAME }

That will hint you on the cause of issues, such as prevented scheduling.

-- yyyyahir
Source: StackOverflow