Introducing VM Automated Incident Response with Event Driven Automation on the ITS Private Cloud

Motivated by our goal of providing enterprise grade-high performant, reliable and stable infrastructure, and experiencing incidents out of our control such as https://eis-vss.atlassian.net/wiki/spaces/VSSPublic/blog/2022/06/20/1109852161/Windows+Guest+OS+encounters+BSOD+after+Virtual+Machine+is+vMotioned+to+ESXi+7.0?src=search, we have started the journey into the Virtual Machine (VM) Automated Incident Response (AIR) to orchestrate the process of detecting, triaging, and resolving incidents within the ITS Private Cloud. This can help our community to quickly detect and respond to issues, reduce the mean time to resolve (MTTR) incidents, and minimize the impact to services.

Our first iteration comprises components such as http://vmweventbroker.io/(VEBA) for event-driven automation that communicates to the ITS Private Cloud API upon “Guest OS Crash” events for notifying VM administrators, by default.

Automated Mitigation (Optional)

To allow automated mitigation, simply set the VSS Option reset_on_restore via the ITS Private Cloud Portal or Command Line Interface as follows:

ITS Private Cloud Portal - Edit VM form

 

vss-cli --wait compute vm set <id> vss-option reset_on_restore

 

If reset_on_restore VSS Option is NOT found, the administrator will only be notified. This allows our users to be informed of any “Guest OS Crash” promptly and act if necessary.

 

University of Toronto - Since 1827