By Rob Hirschfeld
I’ve been relatively quiet about the OpenCrowbar Drill release and that’s not a fair reflection of the level of capability and maturity reflected in the code base; however, it really just sets the stage for truly ground breaking ops automation work in the next release (“Epoxy”).
So, what’s in Drill? Scale and Containers on Metal Workloads! https://github.com/opencrowbar/core/releases
The primary focus for this release was proving our functional operations architectural pattern against a wide range of workloads and that is exactly what the RackN team has been doing with Ceph, Docker, Kubernetes, CloudFoundry and StackEngine workloads.
In addition to workloads, we put the platform through its paces in real ops environments at scale. That resulted in even richer network configurations and options plus performance and tuning. The RackN team continues to adapt OpenCrowbar to match real work ops.
One critical lesson you’ll see more in the Epoxy release: OpenCrowbar and the team at RackN will keep finding ways to adapt to the ops environment. We believe that tools should adapt to their environments: we’ve encountered some pretty extreme quirks and our philosophy is embrace don’t force change.
By Rob Hirschfeld
Cloud Ops is a brutal business: operators are expected to maintain a stable and robust operating environment while also embracing waves of disruptive changes using unproven technologies. While we want to promote these promising new technologies, the unicorns, operators still have to keep the lights on; consequently, most companies turn to outside experts or internal strike teams to get this new stuff working.
Our experience is that doing an on-site deployment by professional services (PS) is often much harder than expected. Why? Because of inherent mission conflict. The PS paratrooper team sent to accomplish the “install Foo!” mission are at odds with the operators’ maintain and defend mission. Where the short-term team is willing to blast open a wall for access, the long-term team is highly averse to collateral damage. Both teams are faced with an impossible situation.
I’ve been promoting Open Ops around a common platform (obviously, OpenCrowbar in my opinion) as a way to solve address cross-site standardization.
Why would a physical automation standard help? Generally, the pros expect to arrive with everything “ready state” including OS installed and all the networking ready. Unfortunately, there’s a significant gap between an OS installed and … installs are always a journey of discovery as the teams figure out the real requirements.
Here are some questions that we’ve put together to gauge is the installs are really going the way you think:
- How often is the customer site ready for deployment? If not, how long does that take to correct?
- How far into a deployment do you get before an error in deployment is detected? How often that error repeated across all the systems?
- How often is an “error” actually an operational requirement at the site that cannot be changed without executive approval and weeks of work?
- How often are issues found after deployment is started that cause a install restart?
- Can the deployment be recreated on another site? Can the install be recreated in a test environment for troubleshooting?
- How often are systems hand or custom updated as part of a routine installation?
- How often are developers needed to troubleshoot issues that end up being configuration related?
- How often are back-doors left in place to help with troubleshooting?
- What is the upgrade process like? Is the state of the “as left” system sufficiently documented to plan an upgrade?
- What happens if there’s a major OS upgrade or patch required that impacts the installed application?
- Can changes to the site be rolled out in stages?
- Can the upgrade be automated and rehearsed?