I ran a Job in Kubernetes overnight. When I check it in the morning, it had failed. Normally, I'd check the pod logs or the events to determine why. However, the pod was deleted and there are no events.
kubectl describe job topics-etl --namespace dnc
Here is the describe
output:
Name: topics-etl
Namespace: dnc
Selector: controller-uid=391cb7e5-b5a0-11e9-a905-0697dd320292
Labels: controller-uid=391cb7e5-b5a0-11e9-a905-0697dd320292
job-name=topics-etl
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"batch/v1","kind":"Job","metadata":{"annotations":{},"name":"topics-etl","namespace":"dnc"},"spec":{"template":{"spec":{"con...
Parallelism: 1
Completions: 1
Start Time: Fri, 02 Aug 2019 22:38:56 -0500
Pods Statuses: 0 Running / 0 Succeeded / 1 Failed
Pod Template:
Labels: controller-uid=391cb7e5-b5a0-11e9-a905-0697dd320292
job-name=topics-etl
Containers:
docsund-etl:
Image: acarl005/docsund-topics-api:0.1.4
Port: <none>
Host Port: <none>
Command:
./create-topic-data
Requests:
cpu: 1
memory: 1Gi
Environment:
AWS_ACCESS_KEY_ID: <set to the key 'access_key_id' in secret 'aws-secrets'> Optional: false
AWS_SECRET_ACCESS_KEY: <set to the key 'secret_access_key' in secret 'aws-secrets'> Optional: false
AWS_S3_CSV_PATH: <set to the key 's3_csv_path' in secret 'aws-secrets'> Optional: false
Mounts:
/app/state from topics-volume (rw)
Volumes:
topics-volume:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: topics-volume-claim
ReadOnly: false
Events: <none>
Here is the job config yaml. It has restartPolicy: OnFailure
, but it never restarted. I also have no TTL set so pods should never get cleaned up.
apiVersion: batch/v1
kind: Job
metadata:
name: topics-etl
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: docsund-etl
image: acarl005/docsund-topics-api:0.1.6
command: ["./create-topic-data"]
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-secrets
key: access_key_id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-secrets
key: secret_access_key
- name: AWS_S3_CSV_PATH
valueFrom:
secretKeyRef:
name: aws-secrets
key: s3_csv_path
resources:
requests:
cpu: 1
memory: 1Gi
volumeMounts:
- name: topics-volume
mountPath: /app/state
volumes:
- name: topics-volume
persistentVolumeClaim:
claimName: topics-volume-claim
How can I debug this?
The TTL would clean up the Job itself and all it's children objects. ttlSecondsAfterFinished
is unset so the Job hasn't been cleaned up.
From the job docco
Note: If your job has
restartPolicy = "OnFailure"
, keep in mind that your container running the Job will be terminated once the job backoff limit has been reached. This can make debugging the Job’s executable more difficult. We suggest settingrestartPolicy = "Never"
when debugging the Job or using a logging system to ensure output from failed Jobs is not lost inadvertently.
The Job spec you posted doesn't have a backoffLimit
so it should try to run the underlying task 6 times.
If the container process exits with a non zero status then it will fail, so can be entirely silent in the logs.
The spec doesn't specify an activeDeadlineSeconds
seconds defined so I'm not sure what type of timeout you end up with. I assume this would be a hard failure in the container then so a timeout doesn't come in to play.