Postmortem: AAP/AWX Job Migration Failure - October 31, 2023
Postmortem for 10/31/2023 - AAP/AWX Job transfers did not work as expected
Good morning from the Omni Platform Automation Team, we wish you well. In this you will find a writeup of the events surrounding AAP and AWX Scheduled Jobs in the early morning hours of 10/31/2023 from approximately 2am to 5am.
The catalyst for the evening’s work was an ask to move some specific jobs for Cassandra nodeinfo data collection (feeds Splunk) to AWX (an oversight from building), however all of those jobs are already in AWX so it was assumed to mean “all scheduled jobs”, and that is the action that was done.
The non-nodeinfo jobs had been transcribed verbatim from AAP, and with that came a configuration for a Credential (Ansible secret) to access the hosts to execute the job. Due to an architectural shift in our move to AWX, this Credential had not been built and inter-woven in the same was as on AAP, though the configuration remained pointing to that.
This was noticed by the team immediately, and the resolution was to use the AAP platform to manually execute these jobs, as the timeliness of them is understood. Further diagnosis went on, and later in the morning the Credential issue was discovered and remediated.
The team has gone back through all of the AWX Configurations, Jobs and Schedules and have validated that the Credential issue has been resolved entirely. We feel confident that we understand the miscommunications that triggered this initially, the implications of the Jobs movements and all related configurations and that they have been addressed appropriately.
In addition to the above and coincidentally during this same window, there was an issue with the CL3 nodeinfo scheduled job in AWX but it was not causedby any of the Job moves but by a failure in the Docker that caused the execution of the job to fail.
You can find here the xlsx icon summary of our analysis for all of the jobs.