IT and data professionals, I implore you – know thy platform. All of it. Not just the layer your job is tasked with. Modern public cloud (or any infrastructure for that matter) platforms means there’s more to it than just one layer.

Here’s an example from just recently. A long-term client called up and said one of my favorite phrases – “our test system is running slow.” The test system is on Azure IaaS.

Once I got further details, it became a bit clearer. On the test system, they were modifying an integration with an external system, and something hiccupped on a test run and overwrote some data incorrectly. No biggie – they recovered the database from the previous night’s backups and fixed the problem. But – about ten minutes after the restore, the system just slowed to a crawl, and stayed that way for three days.

After digging into it, we found that there was a continuous loop of a full database backup sending the backup to Azure blob. It was Azure not handling the refresh appropriately and instead of doing a full backup then resuming log backups, it was just looping through a full backup each time.

The symptoms were clear:

  • Sustained disk reads over 200MB/s
  • Storage latency was over 640ms for reads
  • Significant CPU utilization
  • Network transmit throughput over 900Mb/s
  • Sp_whoisactive reporting an active database backup to Azure blob URL

The backup loop was saturating the VM and either the vDisk or VM’s scale was imposing a storage speed ceiling, capping the overall performance and tremendously slowing down read access. An application login would normally take five to six seconds, but now took a whopping six to seven minutes to complete.

We ended up killing the backup SPID and adjusting the backup policy on the Azure VM from automatic to manual, with one full backup being performed at night and log backups running every five minutes. Immediately, the problem disappeared, app performance went back to normal, and everything stabilized.

These symptoms and signs taken independently of each other would not tell you too much individually, but together, painted a pretty clear picture of what was going on. The folks there are seasoned pros, sharp, and well-intentioned, but hadn’t communicated the symptoms from the various layers with each other, so it didn’t make sense.

So – please – spend time digging in and understanding the basics of each layer at and underneath the data. Knowledge of the architecture and components, how they tie together, awareness of any performance limits at that layer, and being able to spot when one of the artificial caps is triggered, are all critical to daily management of these platforms.