RackN

Redfish API Woes and the Stress of Bare Metal Management

This story, like most Operations stories, starts in the most routine way: A customer called us because their new server hardware didn’t install correctly. They just received the very latest generation Dell servers with iDRAC 10, racked them, powered them on, and ran the same automation they’d been running successfully for years. We work hard to ensure that this is a no drama event for RackN customers.

But it failed.

But Redfish is a Standard!?

When we dug in, the failure pointed at something fundamental: boot order management. That’s not an optional detail in bare metal automation. Boot order is one of the core levers you pull to control how systems provision, reprovision, and recover. If that breaks, everything above it becomes unreliable.

The customer contacted Dell. Dell’s response was straightforward: everything works as intended. They changed a call from writable to read-only and it was our fault for not reviewing the release notes (see if you can find it). On sreview, we found parts of the change were made in iDRAC 9 but only enforced in 10 so it didn’t raise any flags.

Dell, who helps drive the official Redfish spec, introduced a change between iDRAC 9 and iDRAC 10 in their Redfish implementation. But Redfish specification had NOT changed yet! Dell pre-implemented the behavior in the new IDRAC version. Telling us to “follow the spec” is not an option here. Worse, the older iDRAC versions (and other OEMs), of course, still follow the current Redfish behavior.

So we have two problems here: 1) the spec has not officially changed formally, and 2) now we have to support two Redfish behaviors. Technically, there’s a nasty third problem: there’s no way to tell that the behavior has changed until you try and fail.

Surprised, but Prepared

While we didn’t see this specific change coming, it’s a regular occurrence in operations environments. The previous server management protocol was Intelligent Platform Management Interface, IPMI, but it was not based on a specification at all!! That meant that IPMI was notorious for vendor differences and was constantly changing. Even Redfish is, based on an API specification, offers only a marginal improvement because OEM vendors often inject specialized settings, find/fix issues, and generally change how the API responds. Even more confusing is that different vendors respond to errors with different codes and status data. 

Programming against IPMI and Redfish is a lesson in quirk management and “try it and see” operations. 

This example illustrates a more general principle in operations: you can’t assume that an API or CLI implementation is consistent even within the same vendor or version. The customer didn’t miss a migration window. They didn’t skip an upgrade. They bought new hardware and discovered that the ground had shifted underneath them.

Building Durable Automation is Hard

Here’s the problem that bites teams who build their own bare metal automation: supporting multiple behaviors simultaneously is genuinely difficult engineering work. It’s the kind of automation that needs to be constantly tested and updated because your environment changes around you.

Your software has to identify what it’s talking to, understand which rules apply, and do the right thing without breaking anything else that’s already running. That complexity grows over time. New server generations arrive. Old ones stay in service. Firmware updates shift behavior again. Homegrown automation may appear to work at first, but is always flaky. Sadly that keeps many operations teams stuck with manual processes and restricted to service windows.

This is exactly why we built Digital Rebar Content Packs the way we did. When new behavior shows up, we adapt to it without breaking old behavior and package it in our default content. The magic of our IaC content design is that it’s site-neutral. That means customers can get updates that work across generations of hardware, platforms, operating systems, and vendors. That means if you are keeping up with our content releases then a Dell server with iDRAC 10 just works. You don’t have to (re)discover this issue or the next one.

That “it just works” experience is the power of the Digital Rebar bare metal abstraction combined with RackN support. It’s a value proposition unique to the IT automation industry.

You Can Protect Yourself

If you’re DIYing bare metal automation, I hope this post helps you prepare for the coming Redfish spec changes. 

But I also hope you’ll look at Digital Rebar and RackN. We know how much effort it takes to support bare metal because that’s our mission. For most companies, keeping up with this is just toil that’s costing you time and money. Give yourself permission to do something that takes your company forward.

Leave a Reply

Your email address will not be published. Required fields are marked *