It’s incredibly difficult to build a scalable system, and it does not happen by accident. The unknowns involved in dealing with bare metal infrastructure make resilience and reliability difficult to predict in advance. So, we thought we would share the top 5 things you need to know to automatically deploy 1K bare metal servers.
These things are really DevOps secrets about applying software practices to operational practices. This “infrastructure as code way of thinking” draws on lessons learned from across the industry.
Here are the top five things RackN uses to help customers scale their bare metal deployment practices.
1. Embrace complexity
As technologists, we spend a lot of time complaining about “exploding complexity”. And most of the time we don’t have a practical response. The first secret to scaling is accepting that we are, in fact, building something complex. This means that we won’t be able to fix systems by simplifying them. Also, it is impossible to distinguish between required complexity and unnecessary complexity at the start of this process.
Accept that system variation and multiple APIs are normal when you encounter them. It is tempting to think “oh, that’s an unusual exception! I’ll just code around it in this clever way.” But if you want to automatically deploy 1K bare metal servers, you must handle the multiplicity in a maintainable way.
If you build a robust, logged, and maintainable way of handling the complexity you are bound to encounter, then you are in a much better place when the next “unexpected” exception arrives. Another bonus to this way of handling complexity: less technical debt.
2. Make resets easy and reliable
A resilient system is frequently deployed, updated, and destroyed, even when it’s running in normal state. That means large-scale systems constantly exercise their automation, APIs, and processes. However the only way that works in practice (even on bare metal servers) is if resets are so easy and reliable that no one thinks twice about running them. In our experience, cheap resets are transformative to operations in the same way frequent deploys are to development teams.
In building resettable systems, we’ve learned some important lessons too. First, automatic retries undermine system reliability. This is because they often hide underlying problems from operators. By turning off retries, we uncover the flaky and infrequent problems that are the bane of building scalable systems.
Secondly, requiring immutable components is much better than building components on the fly. Creating versioned read-only IaC modules to full machine images that have static inputs reduces variability. Eliminating automatic retries and mutability improves system stability at scale.
3. Don’t rely on magic
At scale, something is always changing or breaking. While you cannot eliminate failure, you can go a long way to making it easy and fast to recover by dropping “magic” from your system instead.
What is magic? It is any part of your process that lacks transparency, clear error responses, and good logging. You don’t have time for spelunking when you’re building a system that can deploy 1k bare metal servers. It is critical to build and use systems that intentionally create transparency.
Some magic is just processes that are not easily understood. Specifically, automation that relies on a dependency graph or complex if-then-do-that flowcharts require humans to perform more cognitive work than simple linear workflows.
While we can’t eliminate complex logic, it’s useful to keep it simple and lean into more linear processes as the primary workhorses for automation. Because when things break, we need to make it fast and easy to figure out why.
4. Share automation
Operations is a team activity, but we often create automation as if we are the only person who will ever use or update it. RackN believes that automation is fundamentally a tooling problem. Automation tools have not been created to foster a collaborative environment as development languages were.
Digital Rebar was created specifically to address this shortcoming. However the operators in charge of developing automation still must design reusable tasks, use parameters, and create documentation. Here are tips to ease this process:
- Automation must be composable in order to be reusable. This means that large workflows need to break down into many small units of work. In Digital rebar these are called tasks and templates. These small units of work are what makes automation reuse possible.
- Create tasks that can be left in place even if they are not needed for an environment or operating system. This type of skip if no op or turn on by configuration coding minimizes forking or copying in workflows.
5. Build pipelines for bare metal servers
The lack of integrations (aka silos) between APIs and tools contributes significantly to complexity. We’re not going to unwind the problem, especially with the popularity of microservice architecture. Building infrastructure pipelines combats the silo challenge by creating a cross-system scaffolding. Pipelines are designed to collect and forward state information from a declarative start to a managed end state. They manage complexity by supplying a consistent operational control point across disparate APIs and tools. This helps operators watch and manage end-to-end processes even when they touch multiple other systems and tools.
Even more important, pipelines elevate processes above configuration. When the tools and APIs are used in isolation, they make assumptions towards being the “single source of truth.” As they are connected together, an exponentially increasing configuration and state burden is needed to synchronize a shared state. But a pipeline enforces how work is coordinated across each service or tool so that state complexity can be managed. That has the import side effect of making it easier to inject or replace parts in the pipeline.
Now you know the secrets to automatically deploy 1K bare metal servers!
Taken together, these five secrets form the core design principles for Digital Rebar. What are your DevOps secrets for bare metal servers? Have you learned other ways to help scale operations? What’s working? Where do things break for you?
If you are operating systems at scale, then check out how we’ve incorporated them into the product. They are field tested by our customers to save you time and heartburn. Don’t believe us? You can try it for yourself with our trial.