Better metrics for data center optimization
Connecting state and local government leaders
By understanding the health, availability and risk of the data center and its underlying infrastructure, IT operations managers can make better decisions about optimization.
Ensuring data center availability has long plagued IT operations because of silos or gaps between IT operations, security operations and facilities. These gaps must be addressed to enable more accurate and holistic decision-making -- especially with respect to data center optimization.
The draft Data Center Optimization Initiative, released in November 2018, proposed several new metrics by which federal data center optimization efforts are measured, including a new metric around data center availability. If mandated, this DCOI data center availability metric may introduce new challenges. Although facility availability can be measured as a single metric, it has proven to be considerably inaccurate and may, in fact, stifle agencies' ability to predict and resolve issues necessary to maintaining availability of the data center and any interdependencies critical to the agency mission.
That is why federal agencies could benefit by measuring sub-metrics that represent the health, availability and risk of the data center and its underlying infrastructure. Taking this business service approach -- dynamically grouping components by geography, application type or technology stack -- to data center optimization can position an agency to predict and resolve problems faster to better ensure availability.
With a business service construct, collection of metrics around health, availability and risk of the underlying IT components of the business service, along with dynamic, real-time mapping of infrastructure and applications that enable the service, can give IT managers real-time operational views to support isolation of root problems identification of service impacts. It’s possible to abstract devices and "bubble up" the individual device and IT services into composite metrics that represent the overall status of the business service. However, a presentation of sub-metrics can enable an executive- or management-level view of the business service that actually delivers a deeper understanding to the overall state of availability of the data center.
Say an agency has four identical servers carrying the entire workload where a single server would function suitably. The three excess servers are essentially back-ups, ready for use in case of a failure of one of the other systems. In this example, if one server fails, the service is still 100% available. The system’s health, however, degrades to 75%; therefore, causing the risk to rise to 25%. These metrics are important because they break down barriers that obstruct executive oversight of the business service. Formerly, data center managers might receive a single alert that indicates where a server CPU utilization level had fallen below a certain threshold. With more granular metrics, a utilization alert can automatically trigger the addition of another server or two to support more traffic, and it can auto-adjust the business service policies to recalculate new health, availability and risk metrics -- all without human intervention. Redundancy and self-healing features should be baked into each layer of the data center.
When it comes to data center optimization, there is no be-all, end-all definition of health, availability and risk. IT operations teams can define them and create automation and event policies as needed. As more software-defined services, artificial intelligence, machine learning and advanced analytics move into the data center, IT ops teams will have more ways to capture actionable IT insights, understand the interdependencies between infrastructure and applications and automate manual tasks to drive efficiency. A topology mapping approach between business processes and the systems that run them promotes automation -- including remediation, configuration management database enhancement and advanced incident enrichment -- resulting in less management, maintenance and troubleshooting.