Server firmware is fragile! Why haven’t we fixed it? The HPE SSD patch challenge.

Clocks
https://www.pexels.com/photo/art-city-clock-clock-face-277458/

Last week, HPE alerted us of a critical software bug that would cause SSDs to fail after three point seven years of use. The problem is not that there was a firmware bug, those happen all the time, but how little HPE provided to fix the problem. It basically throws the fix back to the operator to run scripts on each server.

SSDs are an example of firmware that is not typically managed by vendor out of band (HPE iLO, Dell DRAC) tooling. PDU, Fan, GPU, FPGA, and NIC firmware are also examples of deep firmware that must also be managed.

When we formed RackN 5 years ago, we thought that operators would want to make sure their server firmware was patched; however, in spite of a string of critical patches, firmware continually falls under the threshold of must manage software. This is remarkable because it is a well known performance and failure point for infrastructure.

We’ve spent the last few years trying to understand why the hardware is so often ignored in the infrastructure equation and refuse to accept that “it’s too hard” as the answer.

Instead, our industry has failed to treat firmware as part of an integrated infrastructure stack. We’ve accepted poor APIs and controls from vendors. We’ve fostered a belief that virtualization shields us from hardware. We’ve accepted the cloud “your not smart enough to run a data center” marketing. We’ve allowed these fallacies to make us complacent.

We need to treat firmware patching as part of a continuously integrated data center mindset. Where we can perform source-code controlled migrations of systems that allow us to safely and quickly make deep changes as part of regular deployment cycles.

Fully integrated provisioning is not fantasy: At RackN, we’re helping operators implement exactly these processes. It’s not magic, but it does require system level processes that prioritize workflows, inventory, discovery and integration to coordinate physical and application lifecycle.  It’s time for everyone to start thinking this way because this SSD patch is just the latest in an increasingly urgent string of firmware updates.

%d bloggers like this: