If your organization uses Kubernetes for running Docker containers or other containerized applications, you’re probably aware that the native CronJob scheduler can be balky, uncooperative, and tough to fathom. So what can you do to find out what’s happening with your crons and ensure they’re running when they need to run?
Kubernetes itself acknowledges that “CronJobs can have limitations and uncertainties.” More specifically, as Lyft engineers discovered in adopting CronJobs to replace Unix cron, CronJobs can experience significant start delays because jobs need several events to occur before any application code begins to run. These events, combined with the scale of CronJobs in a multi-tenant environment, can bring major and unanticipated start delays, causing CronJobs to miss runs.
For example, at the outset, cronjobcontroller processes and decides to invoke the CronJob. In a subsequent event, cronjobcontroller creates a Job out of the CronJob’s Job spec. Then jobcontroller notices the newly created Job and creates a Pod. The Kubernetes admission controllers inject sidecar Container specs into the Pod spec. The kube-scheduler schedules the Pod onto a kubelet. Kubelet then runs the Pod (pulling all container images), starts all sidecar containers, and starts the application container.
Once a certain scale of CronJobs is reached, latency tends to kick in. Through Kubernetes 1.18, the cronjobcontroller simply lists all CronJobs every 10 seconds and puts some controller logic over each. The cronjobcontroller implementation does so synchronously, issuing one or more extra API call for every CronJob. When the number of CronJobs reaches a certain level, these API calls begin to be rate-limited client-side, according to Kevin Yang, author of Lyft’s report.
Hotspots can occurr at times of high demand for Kubernetes scheduler jobs – such as the top of the minute, at the start of every hour – when lots of crons need to be invoked simultaneously. Why do these issues matter? Because they can cause CronJobs to miss their invocation and not be run.
Enterprise users have identified a number of issues, too, with monitoring the failure of jobs and containers, including “long-standing, technical issues in Kubernetes” that require manual intervention to remedy.
Is my cron running?
Manual interventions are available, too, for obtaining some some status information on Cron schedules. Kubernetes’ own documentation gives instructions as to how to find out whether a cron is running. However, Jack Wallen, a columnist for Tech Republic, gives a clearer and slightly different explanation in an article on how to use CronJobs for scheduling in Kubernetes.
Wallen explains how, after deploying a Kubernetes cluster and creating a YAML file for the task, you can use a series of commands for deploying the job that includes the scheduled task, making sure the job is running, and watching for the job by obtaining a list of every deployed job.
Did my cron run successfully?
Some users extensively re-engineer their implementation by taking a number of steps, including fixing underlying bugs in Kubernetes and instrumenting its platform with built-in metrics and alerts.
The new tooling for a proper implementation includes a series of counters that answer these questions:
- Did the application code execute?
- Did the cron run successfully?
- Why is my cron not running?
Lyft also created its own timers for measuring start delay and the amount of time it takes for the code to execute.
But do you really want to do all that custom coding?
There’s a way to get the benefits of Kubernetes that doesn’t involve building your own ad hoc automation, and most companies don’t have the developer resources to make it cost effective anyways. But what if you can replace cronjobscheduler?
Workload automation platforms seem boring and old school, but they’re really good at the stuff that the native Kubernetes scheduler isn’t. They’ve evolved into modern automation platform, like OpCon, that can handle event-driven (contingency driven) workflows with extremely granular frequency scheduling options that avoid the traffic jams that happen when you try to scale up with cronjobcontroller.
Cutting-edge IT managers might be skeptical, but let’s zoom out and look at it as an IT infrastructure and operations problem that needs to be solved. You have a lot of IT processes that need to happen, and you need an orchestration platform that can talk to all of your apps and move data between environments, and it needs to work 24/7/365 without complaining or failing. A platform like OpCon can meet this need because it was built for this exact purpose. You can manage all of your monitoring, notifications, scheduling, and implementation of new workflows from a single interface. Oh, and it’s also available on Docker.
Kubernetes CronJob scheduling issues seem like a new thing, but it’s actually an old problem on new infrastructure. SMA Technologies can use OpCon to help your organization overcome the limitations of the native Kubernetes scheduling tools and save a lot of time and money while doing it.