Reading logs of autoscaled Ray worker nodes

6/17/2021

We're running ray tasks on kubernetes with autoscaling. From time to time, a worker dies, and we get the following:

WARNING worker.py:1114 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Task ID: 874253cc82fde8d4ffffffffffffffffffffffff02000000 Worker ID: 98dcd78c9bebbc2176a14fc230b40afdd574dea608dd573aebeae00e Node ID: d4a7b594aec2970b02dde480f2eb2c6070b7886b333d8f1cb1087758 Worker IP address: 10.40.4.2 Worker port: 10010 Worker PID: 307

We're using a config.yaml similar to the autoscaler/kubernetes/default.yaml.

The problem we're facing is that the nodes were the worker runs is autoscaling, causing the node to often be scaled away before we have time to read the logs (in /tmp/ray).

Any way we can persist the logs of the autoscaled (worker) nodes?

I tried setting the temp dir to point to a filestore that's shared between the workers, by calling ray.init(_temp_dir="/our_mounted_filestore/persistent_ray_logs") but this does not seem to work.

I also tried adding --temp-dir=<path> in the worker_start_ray_commands, but this just gives the following error:

PermissionError: [Errno 13] Permission denied: '/filestore/ray_temp'
mkdir(name, mode)
File "/home/ray/anaconda3/lib/python3.8/os.py", line 223, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/home/ray/anaconda3/lib/python3.8/os.py", line 213, in makedirs
os.makedirs(directory_path, exist_ok=True)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/utils.py", line 797, in try_to_create_directory
try_to_create_directory(self._temp_dir)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/node.py", line 278, in _init_temp
self._init_temp(redis_client)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/node.py", line 166, in init
node = ray.node.Node(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/scripts/scripts.py", line 580, in start
return __callback(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 763, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1668, in invoke
rv = self.invoke(ctx)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1062, in main
return self.main(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1137, in call
return cli()
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/scripts/scripts.py", line 1808, in main
sys.exit(main())
File "/home/ray/anaconda3/bin/ray", line 8, in
Traceback (most recent call last):
-- simen-andresen
kubernetes
ray

0 Answers