Skip to content

Triaging Cron Failures

Problem identification

$ skpr cron job list <env>
 ────────────────────────────── ──────── ─────────────────────────────── ────────── 
  NAME (1)                       PHASE    START TIME                      DURATION  
 ────────────────────────────── ──────── ─────────────────────────────── ──────────       
  drupal-prod-drush-1600650000   Failed   2024-04-18 04:00:00 +0000 UTC   1m1s      
 ────────────────────────────── ──────── ─────────────────────────────── ────────── 

Hypothesis

  • There may have been recent changes to the application that coincide with this failure.
  • The cron job may be consistently failing at a specific point.

Debugging

Check for recently packaged/deployed updates.

  1. Check for recently created releases and make sure there were no recently packaged apps. (example below)
  2. Check your CI provider for any recent deployments of recently created releases.
$ skpr release list
───────────── ─────────────────────────────── ────────────── 
  VERSION (3)  DATE                            ENVIRONMENTS  
 ──────────── ─────────────────────────────── ──────────────                  
  0.0.4        2024-05-15 22:38:49 +0000 UTC   dev            
  0.0.3        2024-05-15 22:38:09 +0000 UTC   stg              
  0.0.2        2024-05-15 21:41:33 +0000 UTC   prod

Reviewing CLI Logs

How to access the Application Dashboard:

Direct link from Dashboard Coming Soon

  • Log into the AWS Console
  • Navigate to CloudWatch
  • Browse the Dashboards Section
  • Locate your application Dashboard
  • The name is in the format: CLUSTER-PROJECT-ENVIRONMENT
  • Review the Command Line (Cron / Shell) section for possible leads.

When was the last successful cronjob?

Users can identify CronJobs associated with each environment using the Skpr CLI.

This information will provide the start and finish time for each CronJob.

$ skpr cron list <env>
────────── ────────── ─────────── ─────────────────────────────── ─────────────────────────────── ───────────
 NAME (1)   SCHEDULE   COMMAND     LAST SCHEDULE                   LAST SUCCESSFUL EXECUTION       SUSPENDED  
────────── ────────── ─────────── ─────────────────────────────── ─────────────────────────────── ───────────
 drush      @hourly    drush cron  2024-04-18 04:00:00 +0000 UTC   2020-03-20 04:02:45 +0000 UTC   No          
────────── ────────── ─────────── ─────────────────────────────── ─────────────────────────────── ───────────

How long did this cron job take to fail?

By listing the cron jobs for each environment we can understand how long the failed cron job ran for before failing and how long the cron job is expected to run for. This can be used to calculate the percentage of completion or to correlate an approximation to the CLI logs for the job.

$ skpr cron job list <env>
 ────────────────────────────── ─────────── ─────────────────────────────── ────────── 
  NAME (1)                       PHASE       START TIME                      DURATION  
 ────────────────────────────── ─────────── ─────────────────────────────── ──────────       
  drupal-prod-drush-1600650000   Failed      2024-04-18 04:00:00 +0000 UTC   1m1s
  drupal-prod-drush-1600646400   Succeeded   2024-04-18 04:00:00 +0000 UTC   2m40s           
 ────────────────────────────── ─────────── ─────────────────────────────── ────────── 

How many environments have this problem?

By identifying if this issue affects any other environments you should be positioned to understand how big of an impact this issue is, and possibly even have a solid way to A-B test any fixes that are made to fix it.

$ skpr list | cut -d ' ' -f3 | grep '^[a-z]' | xargs -I {} bash -c 'skpr cron job list "{}" | grep Failed'
drupal-prod-drush-1600650000   Failed      2024-04-18 04:00:00 +0000 UTC   1m1s

From here you could determine which failing cron jobs are of interest for you to investigate, how many there are and which environments are affected.

Mitigations

  • Can the Development Team disable a feature?
  • Can the Development Team roll back a release?
  • Can the Development Team run a drush command to fix the root cause?
  • Can the Skpr Platform Team roll back a change?
  • Can the Skpr Platform Team create a hotfix?