Building Pandemic Proof Infrastructure [Rob on TFIR]

It will disrupt the supply chain. It will create a talent crisis. It may bring our servers and cloud to a halt. IT teams should not think that their cloud is immune to this virus. It’s not.

COVID-19 has disrupted business-as-usual for organisations globally.  While the long-term ramifications of the corona pandemic are yet to be ascertained, there are crucial lessons for businesses, vendors, and technology leaders to learn from the fallout.

In an exclusive interaction with TFIR, Rob Hirschfeld, Founder & CEO, RackN, shares his thoughts on how the corona scourge will impact business and technology, and how corporates can be better prepared to overcome the impending challenges. … read the rest

If Private Cloud is dead. Where did it go? How did it get there? [JOINT POST]

TL;DR: Hybrid killed IT.

I’m a regular participant on BWG Roundtable calls and often extend those discussions 1×1.  This post collects questions from one of those follow-up meetings where we explored how data center markets are changing based on new capacity and also the impact of cloud.  

We both believe in the simple answer, “it’s going to be hybrid.” We both feel that this answer does not capture the real challenges that customers are facing.

pexels-photo-325229So who are we?  Haynes Strader, Jr. comes at this from a real estate perspective via CBRE Data Center Solutions.  Rob Hirschfeld comes at this from an ops and automation perspective via RackN.  We are in very different aspects of the data center market.    

Rob: I know that we’re building a lot of data center capacity.  So far, it’s been really hard to move operations to new infrastructure and mobility is a challenge.  Do you see this too?

Haynes: Yes.  Creating a data center network that is both efficient and affordable is challenging. A couple of key data center interconnection providers offer this model, but few companies are in a position to truly leverage the node-cloud-node model, where a company leverages many small data center locations (colo) that all connect to a cloud option for the bulk of their computing requirements. This works well for smaller companies with a spread-out workforce, or brand new companies with no legacy infrastructure, but the Fortune 2000 still have the majority of their compute sitting in-house in owned facilities that weren’t originally designed to serve as data centers. Moving these legacy systems is nearly impossible.

Rob: I see many companies feeling trapped by these facilities and looking to the cloud as an alternative.  You are describing a lot of inertia in that migration.  Is there something that can help improve mobility?

Haynes: Data centers are physical presences to hold virtual environments. The physical aspect can only be optimized when a company truly understands its virtual footprint. IT capacity planning is key to this. System monitoring and usage analytics are critical to make growth and consolidation decisions. Why isn’t this being adopted more quickly? Is it cost? Is it difficulty to implement in complex IT environments? Is it the fear of the unknown?

Rob: I think that it’s technical debt that makes it hard (and scary) to change.  These systems were built manually or assuming that IT could maintain complete control.  That’s really not how cloud-focused operations work.  Is there a middle step between full cloud and legacy?

Haynes: Creating an environment where a company maximizes the use for its owned assets (leveraging sale leasebacks and forward-thinking financing) vs. waiting until end of life and attempting to dispose leads to opportunities to get capital injections early on and move to an OPEX model. This makes the transition to colo much easier, and avoids a large write-down that comes along with most IT transformations. Colocation is an excellent tool if it is properly negotiated because it can provide a flexible environment that can grow or shrink based on your utilization of other services. Sophisticated colo users know when it makes sense to pay top dollar for an environment that requires hyperconnectivity and when to save money for storage and day-to-day compute. They know when to leverage providers for services and when to manage IT tasks in-house. It is a daunting process, but the initial approach is key to getting to that place in the long term.

Rob:  So I’m back to thinking that the challenge for accessing all these colo opportunities is that it’s still way too hard to move operations between facilities and also between facilities and the cloud.  Until we improve mobility, choosing a provider can be a high stakes decision.  What factors do you recommend reviewing?

Haynes: There is an overwhelming number of factors in picking new colos:

  1. Location
  2. Connectivity/Latency
  3. Cloud Connectivity Options
  4. Pricing
  5. Quality of Services
  6. Security
  7. Hazard Risk Mitigation
  8. Comfort with services/provider
  9. Growth potential
  10. Flexibility of spend/portability (this is becoming ever-more important)

Rob: Yikes!  Are there minor operational differences between colos that are causing breaking changes in operations?

Haynes:  We run into this with our clients occasionally, but it is usually because they created two very different environments with different providers. This is a big reason to use a broker. Creating identical terms, pricing models, SLAs and work flows allow for clients to have a lot of leverage when they go to market. A select few of the top cloud providers do a really good job of this. They dominate the markets that they enter because they have a consistent, reliable process that is replicated globally. They also achieve some of the most attractive pricing and terms in the marketplace on a regular basis.

pexels-photo-119661.jpegRob: That makes sense.  Process matters for the operators and consistent practices make it easier to work with a partner.  Even so, moving can save a lot of money.  Is that savings justified against the risk and interruption?

Haynes: This is the biggest hurdle that our enterprise clients face. The risk of moving is risking an IT leader’s job. How do we do this with minimal risk and maximum upside? Long-term strategic planning is one answer, but in today’s world, IT leadership changes often and strategies go along with that. We don’t have a silver bullet for this one – but are always looking to partner with IT leaders that want to give it a shot and hopefully save a lot of money.

Rob: So is migration practical?

Haynes: Migration makes our clients cringe, but the ones that really try to take it on and make it happen strategically (not once it is too late) regularly reap the benefits of saving their company money and making them heroes to the organization.

Rob: I guess that brings us back to mixing infrastructures.  I know that public clouds have interconnect with colos that make it possible to avoid picking a single vendor.  Are you seeing this too?

Haynes: Hybrid, hybrid, hybrid. No one is the best one-stop shop. We all love 7-11 and it provides a lot of great solutions on the run, but I’m not grocery shopping there. Same reason I don’t run into a Kroger every time I need a bottle of water. Pick the right solution for the right application and workload.

Rob: That makes sense to me, but I see something different in practice.  Teams are too busy keeping the lights on to take advantage of longer-term thinking.  They seem so busy fighting fires that it’s hard to improve.

Haynes:  I TOTALLY agree. I don’t know how to change this. I get it, though. The CEO says, “We need to be in the cloud, yesterday,” and the CIO jumps. Suddenly everyone’s strategic planning is out the window and it is off to the races to find a quick-fix. Like most things, time and planning often reap more productive results.

Thanks for sharing our discussion!  

We’d love to hear your opinions about it.  We both agree that creating multi-site management abstractions could make life easier on IT and relatable to real estate and finance. With all of these organizations working in sync the world would be a better place. The challenge is figuring out how to get there!

Shouldn’t we have Standard Automation for Commodity Infrastructure?

sre-seriesOur focus on SRE series continues… At RackN, we see a coming infrastructure explosion in both complexity and scale. Unless our industry radically rethinks operational processes, current backlogs will escalate and stability, security and sharing will suffer.

bookAn entire chapter of the Google SRE book was dedicated to the benefits of improving data center provisioning via automation; however, the description was abstract with a focus on the importance of validation testing and self-healing. That lack of detail is not surprising: Google’s infrastructure automation is highly specialized and considered a competitive advantage.

Shouldn’t everyone be able to do this?

After all, data centers are built from the same basic components with the same protocols.

Unfortunately, the stack of small (but critical) variations between these components makes it very difficult to build a universal solution. Reasonable variations like hardware configuration, vendor out-of-band management protocol, operating system, support systems and networking topologies add up quickly. Even Google, with their tremendous SRE talent and time investments, only built a solution for their specific needs.

To handle this variation, our SRE teams bake assumptions about their infrastructure directly into their automation. That’s expedient because there’s generally little operational reward for creating generic solutions for specific problems. I see this all the time in data centers that have server naming conventions and IP address schemes that are the automation glue between their tools and processes. While this may be a practical tactic for integration, it is fragile and site specific.

Hard coding your operational environment into automation has serious downsides.

First, it creates operational debt [reference] just like hard coding values in regular development. Please don’t mistake this as a call for yak shaving provisioning scripts into open ended models! There’s a happy medium where the scripts can be robust about infrastructure like ips, NIC ordering, system names and operating system behavior without compromising readability and development time.

Second, it eliminates reuse because code that works in one place must be forked (or copied) to be used again.  Forking creates a proliferation of truth and technical debt.  Unlike a shared script, the forked scripts do not benefit from mutual improvements.  This is true for both internal use and when external communities advance.  I have seen many cases where a company’s decision to fork away from open source code to “adjust it for their needs” cause them to forever lose the benefits accrued in the upstream community.

Consequently, Ops debt is quickly created when these infrastructure specific items are coded into the scripts because you have to touch a lot of code to make small changes. You also end up with hidden dependencies

However, until recently, we have not given SRE teams an alternative to site customization.

Of course, the alternative requires some additional investment up front.  Hard coding and forking are faster out of the gate; however, the SRE mandate is to aggressively reduce ongoing maintenance tasks wherever possible.  When core automation is site customized, Ops loses the benefits of reuse both internally and externally.

That’s why we believe SRE teams work to reuse automation whenever possible.

rebar-1Digital Rebar was built from our frustration watching the OpenStack community struggle with exactly this lesson.  We felt that having a platform for sharing code was essential; however, we also observed that differences between sites made it impossible to share code.  Our solution was to isolate those changes into composable units.  That isolation allowed us take a system integration view that did not break when inevitable changes were introduced.

If you are interested in breaking out of the script customization death spiral then review what the RackN team has done with Digital Rebar.

Even if you don’t use the code, the approach could save your SRE team a lot of heartburn down the road.  Of course, if you do want to use it then just contact us at sre@rackn.com.

%d bloggers like this: