Introducing VM Automated Incident Response with Event Driven Automation on the ITS Private Cloud
Motivated by our goal of providing enterprise grade-high performant, reliable and stable infrastructure, and experiencing incidents out of our control such as Windows Guest OS encounters BSOD after Virtual Machine is vMotioned to ESXi 7.0+, we have started the journey into the Virtual Machine (VM) Automated Incident Response (AIR) to orchestrate the process of detecting, triaging, and resolving incidents within the ITS Private Cloud. This can help our community to quickly detect and respond to issues, reduce the mean time to resolve (MTTR) incidents, and minimize the impact to services.
Our first iteration comprises components such as VMware Event Broker Appliance(VEBA) for event-driven automation that communicates to the ITS Private Cloud API upon “Guest OS Crash” events for notifying VM administrators, by default.
Automated Mitigation (Optional)
To allow automated mitigation, simply set the VSS Option reset_on_restore
via the ITS Private Cloud Portal or Command Line Interface as follows:
vss-cli --wait compute vm set <id> vss-option reset_on_restore
If reset_on_restore
VSS Option is NOT found, the administrator will only be notified. This allows our users to be informed of any “Guest OS Crash” promptly and act if necessary.
Related content
University of Toronto - Since 1827