Running tasks in pods using jobs - Using jobs and daemon sets | Nodes

Understanding jobs and cron jobs
Creating jobs
Creating cron jobs

Jobs execute one-time or scheduled tasks in your cluster, tracking completion status and managing pod lifecycles for batch workloads, parallel processing, and recurring operations.

A job executes a task in your OKD cluster.

A job tracks the overall progress of a task and updates its status with information about active, succeeded, and failed pods. Deleting a job will clean up any pod replicas it created. Jobs are part of the Kubernetes API, which can be managed with oc commands like other object types.

Sample Job specification

apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec:
  parallelism: 1
  completions: 1
  activeDeadlineSeconds: 1800
  backoffLimit: 6
  ttlSecondsAfterFinished: 100
  template:
    metadata:
      name: pi
    spec:
      containers:
      - name: pi
        image: perl
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: OnFailure
#...

where:

spec.parallelism: Specifies the pod replicas a job should run in parallel.
spec.completions: Specifies the successful pod completions needed to mark a job completed.
spec.activeDeadlineSeconds: Specifies the maximum duration the job can run.
spec.backoffLimit: Specifies the number of retries for a job.
spec.ttlSecondsAfterFinished: Specifies the period of time in seconds after which the job should be automatically deleted upon completion.
spec.template: Specifies the template for the pod the controller creates.
spec.template.spec.restartPolicy: Specifies the restart policy of the pod.

Additional resources

Jobs (Kubernetes documentation)

Understanding jobs and cron jobs

Jobs run one-time tasks and track completion status, while cron jobs schedule recurring tasks using cron expressions, supporting non-parallel, parallel with fixed completion count, and work queue execution patterns.

A job tracks the overall progress of a task and updates its status with information about active, succeeded, and failed pods. Deleting a job cleans up any pods it created. Jobs are part of the Kubernetes API, which can be managed with oc commands like other object types.

There are two possible resource types that allow creating run-once objects in OKD:

Job

A regular job is a run-once object that creates a task and ensures the job finishes.

There are three main types of task suitable to run as a job:

Non-parallel jobs:
- A job that starts only one pod, unless the pod fails.
- The job is complete as soon as its pod terminates successfully.
Parallel jobs with a fixed completion count:
- a job that starts multiple pods.
- The job represents the overall task and is complete when there is one successful pod for each value in the range 1 to the completions value.
Parallel jobs with a work queue:
- A job with multiple parallel worker processes in a given pod.
- OKD coordinates pods to determine what each should work on or use an external queue service.
- Each pod is independently capable of determining whether or not all peer pods are complete and that the entire job is done.
- When any pod from the job terminates with success, no new pods are created.
- When at least one pod has terminated with success and all pods are terminated, the job is successfully completed.
- When any pod has exited with success, no other pod should be doing any work for this task or writing any output. Pods should all be in the process of exiting.
  
  For more information about how to make use of the different types of job, see Job Patterns in the Kubernetes documentation.

Cron job

A job can be scheduled to run multiple times, using a cron job.

A cron job builds on a regular job by allowing you to specify how the job should be run. Cron jobs are part of the Kubernetes API, which can be managed with oc commands like other object types.

Cron jobs are useful for creating periodic and recurring tasks, like running backups or sending emails. Cron jobs can also schedule individual tasks for a specific time, such as if you want to schedule a job for a low activity period. A cron job creates a Job object based on the timezone configured on the control plane node that runs the cronjob controller.

A cron job creates a Job object approximately once per execution time of its schedule, but there are circumstances in which it fails to create a job or two jobs might be created. Therefore, jobs must be idempotent and you must configure history limits.

Understanding how to create jobs

Both resource types require a job configuration that consists of the following key parts:

A pod template, which describes the pod that OKD creates.
The parallelism parameter, which specifies how many pods running in parallel at any point in time should execute a job.
- For non-parallel jobs, leave unset. When unset, defaults to 1.
The completions parameter, specifying how many successful pod completions are needed to finish a job.
- For non-parallel jobs, leave unset. When unset, defaults to 1.
- For parallel jobs with a fixed completion count, specify a value.
- For parallel jobs with a work queue, leave unset. When unset defaults to the parallelism value.

Understanding how to set a maximum duration for jobs

When defining a job, you can define its maximum duration by setting the activeDeadlineSeconds field. It is specified in seconds and is not set by default. When not set, there is no maximum duration enforced.

The maximum duration is counted from the time when a first pod gets scheduled in the system, and defines how long a job can be active. It tracks overall time of an execution. After reaching the specified timeout, the job is terminated by OKD.

Understanding how to set a job back off policy for pod failure

A job can be considered failed, after a set amount of retries due to a logical error in configuration or other similar reasons. Failed pods associated with the job are recreated by the controller with an exponential back off delay (10s, 20s, 40s …) capped at six minutes. The limit is reset if no new failed pods appear between controller checks.

Use the spec.backoffLimit parameter to set the number of retries for a job.

Understanding how to configure a cron job to remove artifacts

Cron jobs can leave behind artifact resources such as jobs or pods. As a user it is important to configure history limits so that old jobs and their pods are properly cleaned. There are two fields within cron job’s spec responsible for that:

.spec.successfulJobsHistoryLimit. The number of successful finished jobs to retain (defaults to 3).
.spec.failedJobsHistoryLimit. The number of failed finished jobs to retain (defaults to 1).

Delete cron jobs that you no longer need:
```
$ oc delete cronjob/<cron_job_name>
```
Doing this prevents them from generating unnecessary artifacts.
You can suspend further executions by setting the spec.suspend to true. All subsequent executions are suspended until you reset to false.

Understanding how to automatically remove completed jobs

You can specify that OKD should delete jobs when they are finished, either Complete or Failed, in order to limit the number of objects in your cluster.

By default, for jobs that are not associated with a cron job or other controlling object, OKD does not delete the job or its related pods when the job is finished. If you keep these artifacts in your cluster, you can view the status of finished jobs or view job logs for errors, warnings, or other diagnostic output.

However, having too many of these objects in the cluster can lead to etcd issues, latency issues, performance degradation, and other problems.

You can configure a time-to-live (TTL) duration for jobs without a controlling object by setting the ttlSecondsAfterFinished parameter in the job spec. When set, OKD deletes the jobs and any related pods immediately after the job is finished or after a specified period of time in seconds.

Known limitations

The job specification restart policy only applies to the pods, and not the job controller. However, the job controller is hard-coded to keep retrying jobs to completion.

As such, restartPolicy: Never or --restart=Never results in the same behavior as restartPolicy: OnFailure or --restart=OnFailure. That is, when a job fails it is restarted automatically until it succeeds (or is manually discarded). The policy only sets which subsystem performs the restart.

With the Never policy, the job controller performs the restart. With each attempt, the job controller increments the number of failures in the job status and create new pods. This means that with each failed attempt, the number of pods increases.

With the OnFailure policy, kubelet performs the restart. Each attempt does not increment the number of failures in the job status. In addition, kubelet will retry failed jobs starting pods on the same nodes.

Creating jobs

Create a job to run a one-time task, configuring parallelism, completion count, time limits, retry policies, and automatic cleanup behavior.

You create a job in OKD by creating a job object.

You can also create and launch a job from a single command using oc create job. The following command creates and launches a job similar to the one specified in the following example:

$ oc create job pi --image=perl -- perl -Mbignum=bpi -wle 'print bpi(2000)'

Procedure

Create a YAML file similar to the following:
```
apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec:
  parallelism: 1
  completions: 1
  activeDeadlineSeconds: 1800
  backoffLimit: 6
  ttlSecondsAfterFinished: 100
  template:
    metadata:
      name: pi
    spec:
      containers:
      - name: pi
        image: perl
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: OnFailure
#...
```
where:

spec.parallelism

Specifies how many pod replicas a job should run in parallel; defaults to 1. For non-parallel jobs, leave unset. When unset, defaults to 1. This value is optional.

spec.completions

Specifies how many successful pod completions are needed to mark a job completed. For non-parallel jobs, leave unset. When unset, defaults to 1. For parallel jobs with a fixed completion count, specify the number of completions. For parallel jobs with a work queue, leave unset. When unset defaults to the parallelism value. This value is optional.

spec.activeDeadlineSeconds

Specifies the maximum duration the job can run. This value is optional.

spec.backoffLimit

Specifies the number of retries for a job. This field defaults to six. This value is optional.

spec.ttlSecondsAfterFinished

Specifies the period of time in seconds after which the job should be automatically deleted upon completion. If set to '0', the job is immediately deleted after completion. If this field is not included, the job is not automatically deleted. This value is optional.

spec.template

Specifies the template for the pod the controller creates.

spec.template.spec.restartPolicy

Specifies the restart policy of the pod: Never (Do not restart the job), OnFailure (Restart the job only if it fails), or Always (Always restart the job). For details on how OKD uses restart policy with failed containers, see the Example States in the Kubernetes documentation.
Create the job:
```
$ oc create -f <file-name>.yaml
```

Creating cron jobs

Create a cron job to schedule recurring tasks using cron expressions, configuring time zone, concurrency policies, history limits, and suspension behavior.

You create a cron job in OKD by creating a job object.

You can also create and launch a cron job from a single command using oc create cronjob. The following command creates and launches a cron job similar to the one specified in the following example:

$ oc create cronjob pi --image=perl --schedule='*/1 * * * *' -- perl -Mbignum=bpi -wle 'print bpi(2000)'

With oc create cronjob, the --schedule option accepts schedules in cron format.

Procedure

Create a YAML file similar to the following:
```
apiVersion: batch/v1
kind: CronJob
metadata:
  name: pi
spec:
  schedule: "*/1 * * * *"
  timeZone: Etc/UTC
  concurrencyPolicy: "Replace"
  startingDeadlineSeconds: 200
  suspend: true
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      template:
        metadata:
          labels:
            parent: "cronjobpi"
        spec:
          containers:
          - name: pi
            image: perl
            command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
          restartPolicy: OnFailure
#...
```
where:

spec.schedule

Specifies the schedule for the job in cron format. In this example, the job will run every minute.

spec.timeZone

Specifies a time zone for the schedule. See List of tz database time zones for valid options. If not specified, the Kubernetes controller manager interprets the schedule relative to its local time zone. This value is optional.

spec.concurrencyPolicy

Specifies how to treat concurrent jobs within a cron job. Only one of the following concurrent policies may be specified. If not specified, this defaults to allowing concurrent executions: Allow (allows cron jobs to run concurrently), Forbid (forbids concurrent runs, skipping the next run if the previous has not finished yet), or Replace (cancels the currently running job and replaces it with a new one). This value is optional.

spec.startingDeadlineSeconds

Specifies a deadline (in seconds) for starting the job if it misses its scheduled time for any reason. Missed jobs executions will be counted as failed ones. If not specified, there is no deadline. This value is optional.

spec.suspend

Specifies a flag allowing the suspension of a cron job. If set to true, all subsequent executions will be suspended. This value is optional.

spec.successfulJobsHistoryLimit

Specifies the number of successful finished jobs to retain (defaults to 3).

spec.failedJobsHistoryLimit

Specifies the number of failed finished jobs to retain (defaults to 1).

spec.jobTemplate

Specifies the job template. This is similar to the job example.

spec.jobTemplate.spec.template.metadata.labels

Specifies labels for jobs spawned by this cron job.

spec.jobTemplate.spec.template.spec.restartPolicy

Specifies the restart policy of the pod. This does not apply to the job controller.
Create the cron job:
```
$ oc create -f <file-name>.yaml
```