Did anyone manage to use TPU Pod (eg. 32 v2 TPU cores) in experiments on GKE in Polyaxon experiment? I have a lot of problems when trying to do that. I'm using Pytorch.
Eg., you need to know the TPU Pod name to start training with TPUs (https://github.com/pytorch/xla#start-distributed-training), which you don't know before running the experiment because TPU is created after the experiment is scheduled.
But, even if I find out name somehow after the experiment starts (which I did), I have a problem when connecting to TPU Pod (ssh permissions error) because the user in the experiment container doesn't have permissions to connect to TPU Pod ...
When I start TPU Pod manually (without Polyaxon), I'm able to connect with TPU Pod (ssh succeeds) because I have the correct user - my user on GCP (this ssh is called from some instance in instance group created for this purpose). I have followed these instructions for manually starting a TPU Pod: https://cloud.google.com/tpu/docs/tutorials/pytorch-pod#create-instance-template
EDIT
This is the experiment configuration I have used for starting TPU Pod in Polyaxon:
version: 1
kind: experiment
environment:
resources:
tpu:
limits: 32
requests: 32
node_selector:
polyaxon: tpu
build:
image: gcr.io/tpu-pytorch/xla:r1.5
run:
cmd:
- TPU_NAME=FETCHING_TPU_NAME_COMMAND
- python -m torch_xla.distributed.xla_dist
--tpu=$TPU_NAME
--conda-env=pytorch
--env XLA_USE_BF16=1
--
python /pytorch/xla/test/test_train_mp_imagenet.py
--fake_data
Also, on the Polyaxon settings page, under Hardware Accelerators settings, I have set pytorch-1.5
for K8S:TPU_TF_VERSION. I'm using Polyaxon 0.6.0.
FETCHING_TPU_NAME_COMMAND is just a call to some python k8s API for fetching the name of TPU Pod (it would be nice that this could be fetched more easily because the name is needed when starting the Pytorch code which uses TPU Pod). Eg. this: https://github.com/polyaxon/polyaxon-k8s/blob/master/polyaxon_k8s/manager.py
Actually, inside FETCHING_TPU_NAME_COMMAND I store the fetched TPU Pod name in some file and then read it, but I put it this way in the above example to make it more clear what I'm doing.
When running python -m torch_xla.distributed.xla_dist ...
, it will cause the ssh permission error because in the container is the user that doesn't have permission to connect to TPU Pod.
So I wanted to create a user in the container which will have permissions to connect to TPU. I tried two things, but failed:
I have tried to create a user using my GCP uid
and gid
inside the container during the build, but then that build caused all Polyaxon pods to stuck, and then it takes some time for GKE to restart those pods ...
Then I tried to set the uid
and gid
in Polyaxon deployment configuration (described here: https://legacy-docs.polyaxon.com/configuration/security-context/#enable-security-context), but it was the same: now Polyaxon alone tried to create a user with given uid
and gid
(it can be seen from build log) and it stuck, again causing all Polyaxon pods to stuck ...
I have created an instance group with 4 instances because of 32 tpu cores (as it says in the instructions here https://cloud.google.com/tpu/docs/tutorials/pytorch-pod#create-instance-template).
EDIT 2
I tried again manually starting a pod which uses TPUs, but now with creating a new user, becuase the root user (which is default in the container) was causing ssh permission denied error.
Here is the job .yaml config:
apiVersion: batch/v1
kind: Job
metadata:
name: pytorch-tpu-train-mnist
spec:
template:
metadata:
annotations:
# The runtime version that the TPU will run with.
# Note: It's called "tf-version" for historical reasons.
tf-version.cloud-tpus.google.com: "pytorch-1.5"
spec:
restartPolicy: Never
volumes:
# Increase size of tmpfs /dev/shm to avoid OOM.
- name: dshm
emptyDir:
medium: Memory
containers:
- name: mnist-pytorch-tpu
# This is the image we publish nightly with our package pre-installed.
image: gcr.io/tpu-pytorch/xla:r1.5
volumeMounts:
- mountPath: /dev/shm
name: dshm
command: [
'bash', '-c',
'USERNAME=antonio_juric && USER_UID=5001 && USER_GID=$USER_UID && groupadd --gid $USER_GID $USERNAME && useradd -s /bin/bash --uid $USER_UID --gid $USER_GID -m $USERNAME && apt-get update && apt-get install -y sudo && echo $USERNAME ALL=\(root\) NOPASSWD:ALL > /etc/sudoers.d/$USERNAME && chmod 0440 /etc/sudoers.d/$USERNAME && apt-get autoremove -y && apt-get clean -y && rm -rf /var/lib/apt/lists/* && echo PATH="$PATH" >> /etc/environment && su - $USERNAME -c "sudo chown $USERNAME /root && source /etc/environment && python -m torch_xla.distributed.xla_dist --tpu=tpu-test-manual --conda-env=pytorch --env XLA_USE_BF16=1 -- python /pytorch/xla/test/test_train_mp_imagenet.py --fake_data"'
]
resources:
requests:
memory: 30Gi
cpu: 10
nodeSelector:
polyaxon: tpu
The nodeSelector: polyaxon: tpu
is label instance group used for this purpose (size of that instance group is fixed to 4 due to 32 tpu cores, no autoscaling). No tpus was requested in resources
section because I created TPU vm manually so that I could know the name of the TPU Pod (tpu-test-manual
is the name); otherwise, if tpus were requested in resource
section, it will trigger creation of TPU vm with random name ... In the container execution command, I first create the new user named antonio_juric
. It turned out that uid
and gid
doesn't need to be your GCP account uid
and gid
, so I put 5001. Then I switch from root user to new user and run the command. Path sourcing is due to calling right python (from anaconda package, that python has pytorch xla installed ...).
This manages to ssh to TPU Pod, so there is no ssh permission anymore. But, new problem is that executing bash scripts on the instances from the instance group fails, eg.:
2020-06-17 10:53:57 10.128.0.54 [2] bash: /tmp/326-remote/dist_training_ptxla_2.sh: Permission denied
I see that this file is created inside by file: https://github.com/pytorch/xla/blob/master/torch_xla/distributed/xla_dist.py Not sure why it now fails with permission denied. If the process creates that file (I see that it is created when I ssh manually into a vm from instance group), why it then wouldn't have permission to run it ...?