KPI Categories
Infrastructure operations should align to the following general business priorities:
- Agility – The measure of adaptability. How well can your systems embrace change?
- Security – How protected and up-to-date a system is. Are you able to respond to threats?
- Reliability – All about the stability of the ‘as a service’ experience. Is your system’s foundation rock-solid?
- Consistency – The ability to scale. Are you able to produce predictable results?
- Repeatability – The core of automation. Can you reuse code and keep conformity?
- Delivery – The ultimate litmus test. How quickly can you go from concept to execution?
How to use this information
The KPIs presented here are intended as a menu of measures for discussion. Businesses should identify a top priority area (see categories below) to focus efforts rather than selecting a broad set of objectives.
Outside of measurement, the list below may help operational teams reposition themselves by being more realistic about their performance in key areas. We’ve found that many teams accept poor results from automation by believing that the current state is as good as it can be. The list below is designed to expose overlooked areas of improvement that can have disproportionate impact.
Listen to CEO Rob Hirschfeld give advice on choosing and improving KPIs.
Agility KPIs | Target | Typical | Why Important? |
---|---|---|---|
Lead Time for Changes (LTTC) | 1 day | 15 days | DORA metric. Fast change times enable faster iteration and learning. In production, it enables faster patch and update. |
Average System Age | 30 days | 120 days | Faster turnover rates represent confidence in automation and reduced customization |
# of operations platforms | 1 | 10 | Each platform represents specialized skills and management interfaces. |
% vendor locked infra | 20% | 80% | Reduced vendor risk, improved portability, easier talent acquisition |
% re-creatable environments | 90% | 5% | Improved fault tolerance, Improved dev/test/prod movement, Improved reusability |
Security KPIs | Target | Typical | Why Important? |
Patches lag behind | 1 minor | 2 major | Being behind creates risk since maintainers do not typically backport fixes between major versions. It can take significant effort to upgrade between major versions. |
% exceptions from standard | 5% | 50% | Any exception from standards requires additional management and creates exposure risk |
Time since patch | 1 week | 12 weeks | Patch are incremental, low risk updates that often address security issues. |
% manual effort to remediate | 5% | 80% | Requiring manual effort when fixing (remediating) issues is toil, pulls people off other priorities, and dramatically extends the time to fix known issues. |
Reliability KPIs | Target | Typical | Why Important? |
Mean Time to Recovery (MTTR) | 1 hour | 8 hours | DORA metric. System outage has both direct (cannot work) and indirect (work deferred) impacts in business delivery. |
½ life of automation | 18 months | 3 months | Automation that frequently requires updates and fixes is less likely to be used |
% Idempotence | 90 | 20 | Automation that changes behavior or makes unexpected changes creates risk for the operator and requires human attention and monitoring. |
Reliability of 2nd run | 99% | 75% | Similar to idempotence but easier to measure. Rerunning automation should complete, fail cleanly and never harm a system. |
Consistency KPIs | Target | Typical | Why Important? |
Change Failure Rate (CFR) | 10% | 25% | DORA metric. Teams must be confident that they can change systems smoothly or they will resist making or batch changes. |
Workflow success rate | 99% | 80% | To build a strong foundation, operators need to be confident that automated requests will complete. Failures typically require toil or manual effort to correct. |
Last System Scan freshness (CMDB) | 7 days | 90 days | Lack of system awareness makes it difficult to plan and execute change within the system. Key in evaluating drift. |
Prod variance from test | 5% | 50% | Production sites should closely match test and staging environment(s) |
Multi-site variance | 5% | 30% | Production sites should be tightly managed to be using the same components even if they are using different vendors. |
Repeatability KPIs | Target | Typical | Why Important? |
% custom automation (per team) | 5% | 90% | Custom automation cannot be holistically maintained when errors or changes are discovered. This causes toil, hidden security risks, personal lock-in and makes it difficult to promote standards. |
Effort to Reset | 1 hour | 40 hours | Resets are toil, fast resets free teams to iterate and test before committing changes |
% conforming day 1 | 99% | 50% | Systems should start in a conforming state when delivered. |
% conforming day 90 | 95% | 25% | Automated processes should ensure conformance is maintained |
Delivery KPIs | Target | Typical | Why Important? |
Delivery Backlog | 1 day | 8 weeks | Slow delivery of infrastructure encourages internal customers to bypass operations creating silos and technical debt for compliance and management. |
Avg Time to verify | 10 min | 3 days | Many operational errors arise when systems do not match expectations. Being able to verify systems before changes and also on an ongoing horizon (drift) is critical to consistent operations. |
Avg Time to secure | 10 min | 1 day | Bringing up, changing systems or zero-days can leave systems exposed. It is important to ensure that the window is as short as possible. |
Configuration issues caught in Test | 90% | 10% | Being able to prevent the escape of critical issues into production dramatically improves systems resilience. It also helps teams be more confident in system automation and changes. |
Delivery to In Use Time | 1 hour | 45 days | Beyond simply not getting value for an idle asset, onboarding systems allows teams to better coordinate onboarding and to quickly find and resolve issues with new infrastructure while attention is focused on the delivery. |
Customer Journey
Walk through the steps and stages of the integration and use of Digital Rebar.
Case Studies
Read through stories about how, by using Digital Rebar, RackN customers improved target metrics by up to 10x.