1 comment on “Weekly Recap of All Things Site Reliability Engineering (SRE)”

Weekly Recap of All Things Site Reliability Engineering (SRE)

pexels-photo-273011

Welcome to the first post of the RackN blog recap of all things SRE. If you have any ideas for this recap or would like to include content please contact us at info@rackn.com.

SRE Items of the Week

Things I Learned Managing Site Reliability for Some of the World’s Busiest Gambling Sites by Ian Miell

INTRO TO POST

For several years I managed the 3rd line site reliability operation for many of the world’s busiest gambling sites, working for a little-known company that built and ran the core backend online software for several businesses that each at peak could take tens of millions of pounds in revenue per hour. I left a couple of years ago, so it’s a good time to reflect on what I learned in the process.

In many ways, what we did was similar to what’s now called an SRE function (I’m going to call us SREs, but the acronym didn’t exist at the time). We were on call, had to respond to incidents, made recommendations for re-engineering, provided robust feedback to developers and customer teams, managed escalations and emergency situations, ran monitoring systems, and so on.

The team I joined was around 5 engineers (all former developers and technical leaders), which grew to around 50 of more mixed experience across multiple locations by the time I left.

I’m going to focus here on process and documentation, since I don’t think they’re talked about usefully enough where I do read about them.
_____

2017 trend lines: When DevOps and hybrid collide by Rob Hirschfeld (@zehicle)
IBM Cloud Computing News

INTRO TO POST

What happens when DevOps methods meet hybrid environments? Following are some emerging trends and my commentary on each.

There are two major casualties as the pace of innovation in IT continues to accelerate: manual processes (non-DevOps) and tightly-coupled software stacks (non-hybrid).

We are changing some things much too quickly for developers and operators to keep up using processes that require human intervention in routine activities like integrated testing or deployment. Furthermore, monolithic platforms—our traditional “duck-and-cover” protection from pace of change—are less attractive for numerous reasons, including slower pace, vendor lock-in and lack of choice.

RECENT SRE AND DEVOPS EVENTS

SRECon17 Americas

CloudNativeCon + KubeCon 2017 March 29-30, 2017 in Berlin

IBM Interconnect March in Las Vegas, NV

  • Christopher Ferris, IBM CTO Open Technology and Rob Hirschfeld “Open Cloud Architecture: Think You Can Out-Innovate the Best of the Rest” – SLIDES

DevOps Summit

  • “Best Practices in Operating Hybrid Infrastructure that Spans Clouds and the Data Center” – BLOG / SLIDES

UPCOMING MEETUPS & PODCASTS

Continuous Discussions (#c9d9) Episode 66: Scaling Agile and DevOps in the Enterprise – April 11, 2017 at 10am PT. Rob Hirschfeld a guest in this Electric Cloud podcast.

UPCOMING EVENTS FOR RACKN

Rob Hirschfeld and Greg Althaus are preparing for a series of upcoming events where they are speaking or just attending. If you are interested in meeting with them at these events please email info@rackn.com.

DockerCon 2017 : April 17 – 20, 2017 in Austin, TX
DevOpsDays Austin : May 4-5, 2017 in Austin TX
OpenStack Summit : May 8 – 11, 2017 in Boston, MA
Interop ITX : May 15 – 19, 2017 in Las Vegas, NV
Open Source IT Summit – Tuesday, May 16, 9:00 – 5:00pm : Rob Hirschfeld to speak
Gluecon : May 24 – 25, 2017 in Denver, CO

  • Surviving Day 2 in Open Source Hybrid Automation – May 23, 2017 : Rob Hirschfeld and Greg Althaus

OTHER NEWSLETTERS

SRE Weekly (@SREWeekly)Issue #66

0 comments on “Don’t Balkanize My Installer, Yo!”

Don’t Balkanize My Installer, Yo!

kubernetesLast week, RackN announced our enterprise support for Kubernetes using nothing but upstream Ansible from the project itself.  This effort represents years of effort by the RackN founders to keep platforms interoperable via open and shareable operations automation.

That’s why our Digital Rebar approach targets underlay challenges and leverages existing automation tools instead of investing yet another install path.

dcosThis week, we added Install Wizard templates to the DC/OS install automation we build in collaboration with Mesosphere last year.  That makes it even easier to run DC/OS on physical infrastructure.  Like our Kubernetes work, the Digital Rebar automation uses the same community dcos_install.sh that’s used in the community documentation.  The difference is that we’re also driving all the underlay prep and configuration automatically.

If this approach appeals to you, contact RackN and join in the open Day 2 revolution.

Interested in seeing the DC/OS install in action?  Here’s a demo video:

 

1 comment on “Are you impatient enough to be an SRE?”

Are you impatient enough to be an SRE?

sre-seriesOur focus on SRE series continues… At RackN, we see a coming infrastructure explosion in both complexity and scale. Unless our industry radically rethinks operational processes, current backlogs will escalate and stability, security and sharing will suffer.

SRE minded teams are very impatient about eliminating manual, routine and non-differentiated work.

I’ve been talking to a lot of people about SRE lately in the context of helping Ops get out of the way while coping with increasing load and complexity.  Why are they so impatient? Because they know that ops demand is constantly increasing, there’s no “good enough” when it comes to finding ways to automate tasks and move up stack. Without consistent improvement in automation, teams will get buried (my post about Ops Debt).

The core SRE mantra needs to be “Own Ops, don’t be owned by Ops.”

Yet, outsourcing ops responsibility to a service is equally problematic for an SRE.  They cannot give up responsibility for the integrated system.  In fact, that’s one of the basic reasons why Google’s SRE teams went from just “web site reliability” to full system thinking.  Every aspect of the infrastructure stack needs to be considered when looking at system performance and reliability.  For example, something deep like SSD drive write behavior or GPU BIOS could make a critical difference.  SREs need to be able to root cause issues and black box infrastructure (a.k.a. Cloud) can get in the way.

SRE teams must balance owning the full stack versus focusing on what makes their job unique.

That’s why we have been rethinking about how SRE teams approach infrastructure.  Instead of trying to turn infrastructure into a black box services; we’ve designed the Digital Rebar composable Ops platform that embraces and contains heterogeneity with a high degree of transparency and control.  This is critical because SREs cannot afford to keep reinventing automation at the bottom of the stack.  We must be able to share and leverage best-practices on infrastructure provisioning and platform deployment.  

Like the hardware that runs it, the foundation automation layer must be commoditized.

That means that Operators should be able to buy infrastructure (physical and cloud) from any vendor and run it in a consistent way.  Instead of days or weeks to get infrastructure running, it should take hours and be fully automated from power-on.  We should be able to rehearse on cloud and transfer that automation directly to (and from) physical without modification.  That practice and pace should be the norm instead of the exception.

That’s what we are building at RackN.  Our primary goal is to reuse automation whenever possible.  That was our top design priority for Digital Rebar and it drives our customer engagement models.  If you’d like to hear more, download our SRE white paper.

More information:

1 comment on “Shouldn’t we have Standard Automation for Commodity Infrastructure?”

Shouldn’t we have Standard Automation for Commodity Infrastructure?

sre-seriesOur focus on SRE series continues… At RackN, we see a coming infrastructure explosion in both complexity and scale. Unless our industry radically rethinks operational processes, current backlogs will escalate and stability, security and sharing will suffer.

bookAn entire chapter of the Google SRE book was dedicated to the benefits of improving data center provisioning via automation; however, the description was abstract with a focus on the importance of validation testing and self-healing. That lack of detail is not surprising: Google’s infrastructure automation is highly specialized and considered a competitive advantage.

Shouldn’t everyone be able to do this?

After all, data centers are built from the same basic components with the same protocols.

Unfortunately, the stack of small (but critical) variations between these components makes it very difficult to build a universal solution. Reasonable variations like hardware configuration, vendor out-of-band management protocol, operating system, support systems and networking topologies add up quickly. Even Google, with their tremendous SRE talent and time investments, only built a solution for their specific needs.

To handle this variation, our SRE teams bake assumptions about their infrastructure directly into their automation. That’s expedient because there’s generally little operational reward for creating generic solutions for specific problems. I see this all the time in data centers that have server naming conventions and IP address schemes that are the automation glue between their tools and processes. While this may be a practical tactic for integration, it is fragile and site specific.

Hard coding your operational environment into automation has serious downsides.

First, it creates operational debt [reference] just like hard coding values in regular development. Please don’t mistake this as a call for yak shaving provisioning scripts into open ended models! There’s a happy medium where the scripts can be robust about infrastructure like ips, NIC ordering, system names and operating system behavior without compromising readability and development time.

Second, it eliminates reuse because code that works in one place must be forked (or copied) to be used again.  Forking creates a proliferation of truth and technical debt.  Unlike a shared script, the forked scripts do not benefit from mutual improvements.  This is true for both internal use and when external communities advance.  I have seen many cases where a company’s decision to fork away from open source code to “adjust it for their needs” cause them to forever lose the benefits accrued in the upstream community.

Consequently, Ops debt is quickly created when these infrastructure specific items are coded into the scripts because you have to touch a lot of code to make small changes. You also end up with hidden dependencies

However, until recently, we have not given SRE teams an alternative to site customization.

Of course, the alternative requires some additional investment up front.  Hard coding and forking are faster out of the gate; however, the SRE mandate is to aggressively reduce ongoing maintenance tasks wherever possible.  When core automation is site customized, Ops loses the benefits of reuse both internally and externally.

That’s why we believe SRE teams work to reuse automation whenever possible.

rebar-1Digital Rebar was built from our frustration watching the OpenStack community struggle with exactly this lesson.  We felt that having a platform for sharing code was essential; however, we also observed that differences between sites made it impossible to share code.  Our solution was to isolate those changes into composable units.  That isolation allowed us take a system integration view that did not break when inevitable changes were introduced.

If you are interested in breaking out of the script customization death spiral then review what the RackN team has done with Digital Rebar.

Even if you don’t use the code, the approach could save your SRE team a lot of heartburn down the road.  Of course, if you do want to use it then just contact us at sre@rackn.com.

0 comments on “Surgical Ansible & Script Injections before, during or after deployment”

Surgical Ansible & Script Injections before, during or after deployment

RackN CEO, Rob Hirschfeld, has been posting about our unique composable operations approach with Digital Rebar to enable hybrid infrastructure and mix-and-match underlay tooling.

This post shows some remarkable flexibility enabled by the approach that allow operators to take limited, secure operations against running systems.

via Surgical Ansible & Script Injections before, during or after deployment. — Rob Hirschfeld

 

1 comment on “Bugs Bunny, Prince and Enabling True Hybrid Infrastructure Consumption”

Bugs Bunny, Prince and Enabling True Hybrid Infrastructure Consumption

OK- Stay with me on this. I’m drawing parallels again.  🙂

Like many from my generation, my initial exposure to classical music and opera was derived from Bugs Bunny on Saturday mornings (culturally deprived, I know). One of the cartoons I remember well was with Bugs trying to get even with the heavy-set opera singer who disrupts Bugs’ banjo playing. In order to exact his revenge, Bugs infiltrates the opera singer’s concert by impersonating the famous long-hared (hared…get it?) conductor, Leopold Stokowski. He proceeds to force the tenor to hit octaves that structurally compromise the amphitheater and as it crumbles leaves him bruised and battered. Bugs is as always, victorious.

bugs

In examining Bugs’ strategy (let’s assume he actually had one), Bugs took over operations of the orchestra’s musical program to achieve his goal of getting the tenor “in-line” so to speak. As I prepare to head down to the OpenStack Conference in Austin, TX next week, I’m seeing similar patterns develop in the cloud and data center infrastructure space which are very “Bugs/Leopold-like”. With organizations deciding on how to consolidate data centers, containerize apps and move to the cloud, vendors and open source technologies offer value, however true operational, infrastructure and platform independence are not what they appear to be. For example, once you move your apps off the data center to AWS or VMware and then later determine you are paying too much or the workload is no longer is appropriate for the infrastructure, good luck replicating the configuration work done on CloudFormation on another cloud or back in the data center. Same rationale is applicable to other technologies such as converged infrastructure and proprietary private cloud platforms. As the customer, to achieve scale and remove operational pain you must fall in line. That in itself is a big commitment to make in a still-evolving and maturing technology industry and a dynamic business climate.

On an unrelated topic, I was saddened to learn of the passing of Prince this past week. While not a die-hard fan, I liked his music. He was a great composer of songs and had a style all to his own. Beyond his music and sheer talent, I admired his business beliefs and deep desire to maintain creative ownership and control of his music and his brand.

princeDespite his fortune and fame, there was a period in the middle of Prince’s career in which he felt creatively and financially locked-in by the big record companies. Once Prince (and the unpronounceable symbol) broke away from Warner Music, he was able to produce music under his own label. This action enabled him to create music without a major record label dictating when he needed to produce a new album and what it needed to sound like. In addition, he was now able to market his new recordings to the distribution platform that supported his artistic and financial goals. While still having ties to Warner Music, he was no longer bound by their business practices. Along with starting his own music subscription service, Prince cut deals with Arista, Columbia, iTunes and Sony. Prince’s music production had operational portability, business agility and choice (seven Grammy awards and 100 million record sales also help create that kind of leverage.).

While open APIs and containers offer some portability, at RackN we believe they do not offer a completely free market experience to the cloud and infrastructure consumer. If the business decides it is paying too much for AWS, it should not allow for the operational underlay and configuration complexity to lock them to the infrastructure provider. They should be able to transfer their business to Google, Azure, Rackspace or Dreamhost with ease. We believe technologies that create portable, composable operational workflows drive true infrastructure and platform independence and as a benefit, reduces business risk. Choosing a platform and being forced to use it are two very different things.

In conclusion, when considering moving workloads to the cloud, converged infrastructure platforms or using DevOps automation tools, consider how you can achieve programmable operational portability and agility. Think about how you can best absorb new technologies without causing operational disruption in your infrastructure. Furthermore, ensure you can accomplish this in a repeatable, automated fashion. Analyze how you can abstract away complex configurations for security, networking and container orchestration technologies and make them adaptable from one infrastructure platform to another. Attempt to eliminate configuration versioning as much as possible and make upgrades simplistic and automated so your DevOps staff does not have to be experts (they are stressed out enough.).

If you are attending the OpenStack Conference this week, look me up. While I am far from a music expert, i’ll be happy to share with you my insights on how to spot a technology vendor that likes to play a purple guitar as opposed to one that eats carrots and plays the banjo.

-Dan Choquette: Co-Founder, RackN

 

 

 

1 comment on “From Start to Scale: learn faster with heterogenous deployments”

From Start to Scale: learn faster with heterogenous deployments

Why mix VMs and Physical? Having a consistent deploy approach can dramatically speed learning cycles that result in better scale ops. I would never deploy production OpenStack on VMs but I strongly recommend rehearsing that deployment on VMs hundreds of times before I touch metal.

Over the last two months, the RackN team redefined “heterogeneous” infrastructure in Digital Rebar from being “just” multi-vendor hardware to include any server resource from containers and Vagrant/Virtualbox to clouds like AWS or Packet. To support this truly diverse range, there were both technical and operational challenges to overcome.

The technical challenge rises from the fundamental control differences between cloud and physical infrastructure. In cloud, infrastructure is much more prescribed – you cannot change most aspects of your system and especially not your network interfaces or IPs. To provision hardware efficiently, we had to establish control over the very things that Cloud systems manage for you. 

That management diversity exercised the full extent of the Digital Rebar “functional ops” architecture.

Over the last year, we’ve been unwinding baked-in control assumptions from earlier versions of Digital Rebar. That added flexibility allows Digital Rebar to mix control APIs for infrastructure ranging from using Cobbler to Docker, Vagrant and AWS. Since we could already cope with heterogeneous control APIs using Digital Rebar’s unique functional ops design, we retained the ability to mix and match container, virtual and physical infrastructure.

The operational challenge was more subtle. We were motivated to make this change by first hand observations of the fidelity gap. I am a strong believer that container platforms will directly target metal in the next two years. The challenge is how do we get there from our current virtualization-focused infrastructure.

It’s easy to look at the completed work as an obvious step forward. Looking over my shoulder, I know that it took years of learning and perseverance to create a platform that was flexible enough to handle both extremes of control. Even more important was understanding why it was so important for a physical scale deployment platform to provide ops fidelity for developers too.

With the infrastructure work behind us, we’re seeing Digital Rebar deliver real operational transformation. We want to help IT embrace containers and immutable infrastructure without having to discard the hard won battles installing cloud and traditional infrastructure. Most critically, we hope that you’ll join our open community and share your operational journey with us.

%d bloggers like this: