Recently, I was working with a customer who makes extensive use of Linux workloads in Azure deploying many heavily customized virtual machines (VM) into Azure. After some reboots, installing packages, and performing some system configuration tasks the customer ran into an issue where the Azure Linux agent was unable to communicate with the Azure Fabric to report the health and availability of the operating system. While the cause of what interactions caused this, the customer needed to redeploy the agent onto this ‘broken’ VM that would not boot properly in Azure. The customer was able to revive this machine by mounting the image onto another VM running the same Linux distribution and reinstall the agent to which this guide walks through.
The VM agents in Azure provide health information to the Azure fabric in addition to providing the ability to surface diagnostic information such as capturing the serial/console stream and surfacing it to the portal for troubleshooting. This is provisioned automatically on newly created VMs in Azure and has a number of additional configuration options such as providing the ability to control whether to enable the swap/paging file on the temporary disk. Additionally, this extension is used to deploy additional extensions to the VM such as the custom script extension to invoke scripts on machines booting in Azure.
Additional details on the Linux Agent is available here and, if you’re interested to see exactly how it works, the source code for the agent is available on GitHub.
This agent can be installed/upgraded with standard RPM or DEB Linux package management utilities such as YUM. They can be also be manually deployed as outlined in the details above using a Python install script.
In almost all cases, the agent can be updated using these same package management utilities and this agent is like the old Ronco commercials – ‘set it and forget it’.
Repairing a VM with an agent in a corrupt state looks painful; but is not difficult….here’s how to get started:
Note: This is assuming you’re using unmanaged disks in Azure — see addendum below for some additional steps required to obtain access to the underlying VM’s virtual hard disk. If you’re unsure how to validate if your VM is using unmanaged disks, click the screenshot below:
5. Boot the new VM and change to a root user (using “su –” if the root password is known or “sudo su –” if this image was created fro an Azure VM gallery image)
6. Create a mount point for the new data disk such as: mkdir /repair
7. Mount the data disk to this folder: mount /dev/sdc1 /repair
8. Change to the mount point: cd /repair
9. Run the following commands to create a bind mount to pull in specific folders to reinstall the agent and use chroot to in essence allow commands that are subsequently to occur on the newly attached data disk getting you console access to this disk
10. Re-install the wagent using package manager tools such as yum: yum install WALinuxAgent or apt-get install WALinuxAgent
11. Check /etc/resolve.conf and ensure that the Azure DNS server is properly defined if necessary to allow it to communicate once booted
Resolve.conf should have an entry for a nameserver as follows:
12. Run the following commands to unmounts the repaired disk:
13. Shutdown the temporary VM and remove the datadisk from the portal — now we need to re-associate this disk with the broken VM — we can easily do this using the CloudShell built into the portal to open a new PowerShell prompt:
Hopefully that will get you back up and running allowing the VM to boot up and properly report it’s health!