Industry Analysis: AWS Outposts

This is the first interview of the multi-part series about AWS Outposts and it’s impact on the IT industry.  While not the first data center appliance in market, Outposts represents the most extreme version of tethered, remote managed infrastructure in market because the control plane for the system remains with AWS and the hardware itself is manufactured by AWS.

In the series, we’ll dig deeply into how the technology works and it’s impact on the broader IT market.

 

Via Stark&Wayne: PXE BOOT COREOS USING DIGITAL REBAR

This week, premier CloudFoundry consultancy Stark and Wayne posed about their Digital Rebar integration for CoreOS, CloudFoundry and Kubernetes.  This work integrates with their DRP MoltenCore content pack.

Having just a plain CoreOS machine is not as useful in itself, however CoreOS can be used as a building block for building clusters. DRP exposes some powerful primitives for orchestrating common cluster patterns, through the use of shared profiles which are updated by the machines themselves. This simple primitive can be used to implement master election, rolling updates, and distribution of bootstrap token.  – Roben Koster

For more details, please see https://starkandwayne.com/blog/rackn-hd-bosh/

Digital Rebar Platform v4.2 Release

These release notes span the v4.1 and v4.2 releases. The focus is on features, not the numerous fixes for bugs and minor issues.

4.1 Release Items

  • Contexts – this is a major feature that allows workflows to transition between instances of runners on the machines and other locations (such as Docker containers)
  • Run without root permissions – enables operation in highly secure environments.
  • Tasks can have expanded Role and Claim rights – allows tasks to have special scope of operations inside of DRP
  • Transactions Improvements to WAL 
  • DRPY runner –  python version of the runner for ESXi environments
  • Ansible Inventory uses Profiles – 100% mapping between DRP and Ansible groups.
  • Redfish Support – IPMI integration for later hardware models
  • ESXi Installation – VMware Integration

    UX Custom Columns – Allows user to define columns in machine view

  • UX Expand View Filters – Expand filters to leverage all API supported filtering

4.2 Release Items

  • JQ integrated into DRP CLI – eliminates the need for installing JQ as a stand alone requirement.
  • BSDTAR integrated into DRP Server – eliminates the need for installing bsdtar on the server
  • Human/table formats for DRPCLI – makes it easier to read DRP output
  • DRP CLI can install itself as a persistent agent – from script contexts makes it easier to run DRP runner as a service
  • Plugins use Websocket events – migrate for legacy event model to improve stability.
  • UX Save in Header – makes it easier to have multiple edits on the same item

    UX diff view – allow operator to compare new vs installed content (fixes regression)

  • Updated license retrieval and install process – streamline process so that multiple users do not need RackN logins if a license is installed
  • UX bulk actions for Endpoints – useful for multi-site management

Beta Items

  • Bootstrap Agent Integrated into DRP Server – allows for on DRP server operations without use of Dangerzone context (v4.2)

 

Server firmware is fragile! Why haven’t we fixed it? The HPE SSD patch challenge.

Clocks
https://www.pexels.com/photo/art-city-clock-clock-face-277458/

Last week, HPE alerted us of a critical software bug that would cause SSDs to fail after three point seven years of use. The problem is not that there was a firmware bug, those happen all the time, but how little HPE provided to fix the problem. It basically throws the fix back to the operator to run scripts on each server.

SSDs are an example of firmware that is not typically managed by vendor out of band (HPE iLO, Dell DRAC) tooling. PDU, Fan, GPU, FPGA, and NIC firmware are also examples of deep firmware that must also be managed.

When we formed RackN 5 years ago, we thought that operators would want to make sure their server firmware was patched; however, in spite of a string of critical patches, firmware continually falls under the threshold of must manage software. This is remarkable because it is a well known performance and failure point for infrastructure.

We’ve spent the last few years trying to understand why the hardware is so often ignored in the infrastructure equation and refuse to accept that “it’s too hard” as the answer.

Instead, our industry has failed to treat firmware as part of an integrated infrastructure stack. We’ve accepted poor APIs and controls from vendors. We’ve fostered a belief that virtualization shields us from hardware. We’ve accepted the cloud “your not smart enough to run a data center” marketing. We’ve allowed these fallacies to make us complacent.

We need to treat firmware patching as part of a continuously integrated data center mindset. Where we can perform source-code controlled migrations of systems that allow us to safely and quickly make deep changes as part of regular deployment cycles.

Fully integrated provisioning is not fantasy: At RackN, we’re helping operators implement exactly these processes. It’s not magic, but it does require system level processes that prioritize workflows, inventory, discovery and integration to coordinate physical and application lifecycle.  It’s time for everyone to start thinking this way because this SSD patch is just the latest in an increasingly urgent string of firmware updates.

YES – VM + Containers can be faster than Bare Metal!

Pat Gelsinger, VMware CEO, said that VM managed Containers could be 8% faster than bare metal during the VMworld keynote (@kitcolbert). On the surface, this comment defies logic: bare metal should be the theoretical limit for performance like the speed of light. While I don’t know the specifics of his test case, the claim of improved performance is credible.  Let’s explore how.

RackN specializes in bare metal workloads so let me explain how it’s possible in the right cases that containers in VMs benchmark faster than containers alone.

The crux of the argument comes down to two factors:

  1. Operating systems degrade when key resources are depleted
  2. CPUs are optimized for virtualization (see NUMA architecture)

Together, these factors conspire to make VMs a design necessity on large bare metal systems.

A large RAM and CPU core system can become saturated with container workloads even in the 10s of containers. In these cases, the performance cost for operating system to manage resources starts to take away from the performance. Since typical hypervisor hosts have a lot of resources, the risk of over saturation is very high.

The solution on high resource hosts is to leverage a hypervisor to partition the resources into multiple operating system instances. That eliminates over saturation and improves throughput for the host. We’re talking about 10 vms with 10 containers instead of 1 host with 100 containers.

In addition to simple partitioning, most CPUs are optimized for virtualization. That means that they can run multiple virtualization operating systems on the same host with minimal overhead.  The non-virtualized host does not get to leverage this optimization.

Due to these factors AND with the right tuning, it would be possible to demonstrate improved container performance for hosts that were optimized for running a hypervisor. The same would not hold true for systems that are size optimized for only container workloads. Since the container optimized machines are also much cheaper, the potential performance gain is likely an not a good ROI.

While bare metal will eventually come; this strange optimization reinforces why we expect to see hypervisors continue to be desired in container deployments for a long time.

RackN Digital Rebar v4 Release

RackN is excited to announce v4 of Digital Rebar.  In this fourth generation, we recognize the project’s evolution from provisioner into a data center automation platform.  While the external APIs and behavior are NOT changing (v4.0 and v3.13 are API compatible), we are making important structural changes that allow for significant enhancements to scale, performance and collaboration.

This important restructuring enables Digital Rebar to expand as a community platform with open content.

RackN is streamlining and inverting our license model (see chart below) by making our previously closed content and plugins open and taking commercial control of our implementation behind the open Digital Rebar APIs.  This change empowers both expanding community and sustaining commercial engagement.

Over the last 2 years, our operator community around Digital Rebar and RackN content has frequently asked for source access to our closed catalog of content and plugins while leaving core development exclusively to RackN.  We believe that providing open, collaborative access is critical for the growth of the platform; consequently, Generation 4 makes these critical components available.

RackN will curate and support foundational content packs and plugins in the open.  They include items including operating system templates, out of band management (IPMI), image deploy, classification and our server life-cycle (RAID/BIOS) components.  We will be working with the Digital Rebar community to determine the right management model for items in this broad portfolio of content.

The RackN implementation powering the Digital Rebar API will be significantly enhanced in this generation.

We are integrating powerful new capabilities like multi-site management, 10,000+ machine scale, single-sign-on and deeper security directly into the implementation.  These specialized enhancements to the platform enable an even broader range of applications. Since the community has focused on contributing to platform content instead of the core; we believe this is the right place to manage the commercial side of sustaining the platform.

Please note that our basic commercial models and pricing plans are not changing.  We are still maintaining Digital Rebar and RackN binaries free for operators with up to 20 machines.  The change also allows us to enable new classes of use such as non-profit, academic and service provider.

Thank you to the Digital Rebar community!  We are handing you the keys to revolutionized data center operations.  Let’s get started!

Generation 4 clarifies the boundary between open and proprietary
Layer Component Generation 3 Generation 4
Content Platforms, Apps, Etc Mixed Digital Rebar APLv2
Ops Best Practices RackN Digital Rebar APLv2
O/S Image Deployer RackN Digital Rebar APLv2
OOB & Firmware RackN Digital Rebar APLv2
Utility Workflows RackN Digital Rebar APLv2
O/S Install Templates Mixed Digital Rebar APLv2
Platform Agents (Runner) Digital Rebar APLv2 Digital Rebar APLv2
CLI Digital Rebar APLv2 Digital Rebar APLv2
Sledgehammer Digital Rebar APLv2 Digital Rebar APLv2
API Digital Rebar APLv2 Digital Rebar APLv2
Enterprise Extensions (Multi-Site, SSO, RBAC) RackN RackN
Web UX RackN RackN
API Implementation Digital Rebar APLv2 RackN
Commercial Support RackN RackN

RackN founding member of LF Edge

Today, the Linux Foundation announced LF Edge to help build open source platforms and community around Edge Infrastructure. We’re just at the start of the effort with projects and vendors working in parallel without coordination. We think this LF Edge initiative is important because the Edge is already a patchwork of different data centers, infrastructure and platforms.

RackN has been designing our DEEP Edge federated control around open Digital Rebar because we know that having a shared abstraction at the physical layer makes everything we build above easier to sustain and more resilient.

We’re excited to be part of this effort and help that you’ll collaborate us to build some amazing Edge capabilities.

Note: We’re also watching the OpenStack Foundation work with Edge also. Hopefully, we’ll see collaboration between these groups since there are already overlapping projects.

Podcast – Rob Lalonde on HPC in the Cloud, Machine Learning and Autonomous Cars

Joining us this week is Rob Lalonde, VP & General Manager, Navops at Univa.

About Univa

Univa is the leading independent provider of software-defined computing infrastructure and workload orchestration solutions.

Univa’s intelligent cluster management software increases efficiency while accelerating enterprise migration to hybrid clouds. We help hundreds of companies to manage thousands of applications and run billions of tasks every day.

 Highlights

  • 1 min 6 sec: Introduction of Guest
  • 1 min 43 sec: HPC in the Cloud?
    • Huge migration of workloads to public clouds for additional capacity
    • Specialized resources like GPUs, massive memory machines, …
  • 3 min 29 sec: Cost perspective of cloud vs local HPC hardware
    • Primarily a burst to cloud model today
  • 5 min 10 sec: Good for machine learning or analytics?
  • 5 min 40 sec: What does Univa and Navops do?
    • Cloud cluster automation
  • 7 min 35 sec: Role of Scheduling
    • Job layer & infrastructure layer
    • Diversity of jobs across organizations
  • 9 min 30 sec: Machine learning impact on HPC
    • Survey of Users ~ Results
      • Machine learning not yet in production ~ still research
      • HPC very much linked to machine learning
      • Cloud and Hybrid cloud usage is very high
      • GPUs usage for machine language
    • 15 min 09 sec: GPU discussion
      • Similar to early cloud stories
    • 16 min 00 sec: Concurrency in operations in HPC & machine learning
      • Workload dependency ~ weather modeling
    • 18 min 12 sec: People bring workloads in-house after running in cloud?
      • Sophistication in what workloads work best where
      • HPC is very efficient ~ 1 Million Cores on Amazon : Successful when AWS called about taking all their resources for other customers 🙂
    • 23 min 56 sec: Autonomous cars discussion
      • Processing in the car or offloaded?
      • Oil and Gas exploration example (Edge Infrastructure example)
        • Pre-process data on ship then upload via satellite to find information required
      • 29 min 12 sec: Is Kubernetes in the HPC / Machine Learning world?
        • KubeFlow project
      • 35 min 8 sec: Wrap-Up

Podcast Guest:  Rob Lalonda, VP & General Manager, Navops

Rob Lalonde brings over 25 years of executive management experience to lead Univa’s accelerating growth and entry into new markets. Rob has held executive positions in multiple, successful high tech companies and startups. He possesses a unique and multi-disciplined set of skills having held positions in Sales, Marketing, Business Development, and CEO and board positions. Rob has completed MBA studies at York University’s Schulich School of Business and holds a degree in computer science from Laurentian University.

Podcast – Dave Blakey of Snapt on Radically Different ADC

Joining us this week is Dave Blakey, CEO and Co-Founder Snapt.

About Snapt

Snapt develops high-end solutions for application delivery. We provide load balancing, web acceleration, caching and security for critical services.

 Highlights

  • 1 min 28 sec: Introduction of Guest
  • 1 min 59 sec: Overview of Snapt
    • Software solution
  • 3 min 1 sec: New Approach to Firewalls and Load Balancers
    • Driven by customers with micro-services, containers, and dynamic needs
    • Fast scale and massive volume needs
    • Value is in quality of service and visibility into any anomaly
  • 7 min 28 sec: Engaging with DevOps teams for Customer interactions
    • Similar tools across multiple clouds and on-premises drives needs
    • 80% is visibility and 20% is scalability
    • Podcast – Honeycomb Observability
  • 13 min 09 sec: Kubernetes and Istio
    • Use cases remain the same independent of the technology
    • Difference is in the operations not the setup
    • Istio is an API for Snapt to plug into
  • 17 min 29 sec: How do you manage globally delivered application stack?
    • Have to go deep into app services to properly meet demand where needed
    • Immutable deployments?
  • 25 min 24 sec: Eliminate Complexity to Create Operational Opportunity
  • 26 min 29 sec: Corporate Culture Fit in Snapt Team
    • Built Snapt as they needed a product like Snapt
    • Feature and Complexity Creep
  • 28 min 48 sec: Does platform learn?
  • 31 min 20 sec: Lessons about system communication times
    • Lose 25% of audience per 1 second of website load time
  • 34 min 34 sec: Wrap-Up

Podcast Guest:  Dave Blakey, CEO and Co-Founder Snapt.

Dave Blakey founded Snapt in 2012 and currently serves as the company’s CEO.

Snapt now provides load balancing and acceleration to more than 10,000 clients in 50 countries. High-profile clients include NASA, Intel, and various other forward-thinking technology companies.

Today, Dave has evolved into a leading open-source software-defined networking thought leader, with deep domain expertise in high performance (carrier grade) network systems, management, and security solutions.

He is a passionate advocate for advancing South Africa’s start-up ecosystem and expanding the global presence of the country’s tech hub.

Podcast – Antonio Pellegrino on Infrastructure as Software at the Edge

Joining us this week is Antonio Pellegrino, Founder & CEO at Mutable.

About Mutable
Mutable helps software developers create scalable and fast web services by automating DevOps and providing edge technology all around the world.

 Highlights

  • 0 min 35 sec: Introduction of Guest
  • 1 min 09 sec: Are you products edge-focused or just a market to promote to?
    • Building products to meet customers need for containers including edge
  • 2 min 58 sec: Sounds like a platform?
    • Developers create policies on latency, hardware needs, etc
  • 5 min 51 sec: Be aware of the location and structure of deployment hardware
    • Includes management and tracking of network traffic
    • Services see everything as one network even though it isn’t
  • 7 min 49 sec: Why not Kubernetes instead?
    • Kubernetes built for single network
  • 9 min 16 sec: Edge vs Core
    • Temporary services is a big factor in edge
    • HBO and Games of Throne example for supporting access to the feed
  • 13 min 45 sec: Edge is not infinitely elastic
    • Facial recognition example on iPhone
  • 17 min 53 sec: Algorithm to spin up services when needed not always available
  • 19 min 52 sec: How does software learn what to spin up and down
  • 21 min 22 sec: How do you manage storage?
    • Follows standard S3 storage model
  • 24 min 15 sec: NiCs are a core component ~ language for packet management
    • Sounds like a micro-kernel
    • Why people don’t leverage this concept more often?
  • 28 min 32 sec: Manipulate and manage networking at application layer
    • Mutable is building a full stack
    • Operator approach vs developer approach on networking
  • 33 min 21 sec: Control Plane challenges
  • 34 min 27 sec: What is the starting experience with Mutable like?
  • 35 min 55 sec: How is cable industry responding to edge computing?
    • Opportunity to deploy 5G
  • 39 min 23 sec: Wrap-Up

Podcast Guest: Antonio Pellegrino, CEO and founder if Mutable

Antonio is the CEO and founder of Mutable, the next generation platform as a service for microservices and distributed computing. He is a serial entrepreneur, and ran one of the largest e-sports streaming companies of its time. Antonio has built startups centered around developer tools for the past 8 years, saving their customers millions of dollars. Mutable has been a driver in microservices while distributed computing as we now it, allowing developers to push the application to the edge.

%d bloggers like this: