Automating the zombie roundup –

Part 2: One or two weapons will not be enough. By Andy Lawrence, Vice President of Research for Datacentre Technologies and Eco-Efficient IT at 451 Research.

  • 10 years ago Posted in

In the first instalment of this two-part article which appeared in the November 2014 issue of Data Centre Solutions, we discussed the issue of zombie or comatose servers – equipment that does little work but is supported and managed at some expense, and considered why few software tools exist or are used to help solve the problem. This report describes in more detail the software tools required, and how they are more powerful together as part of a platform than as isolated tools.

Identifying and eliminating zombie servers is a difficult task in the dark, especially if the hunter only has a couple of simple weapons. But once management information and software systems and processes are in place, it can become a relatively straightforward and regular activity that will save datacentres substantial amounts, both in dollars and power consumption. The most successful datacentre operators will be those that invest in the appropriate datacentre infrastructure management (DCIM) and other integrated tools in a strategic way, reaping multiple benefits.
One thing is clear: it is possible to eliminate zombie servers by adopting good processes and using the right tools. So what software is needed to maintain a zombie-free environment? We have identified the following:
· Server-level power monitoring. If a server is using power, then the IT or datacentre staff – and most importantly, any higher-level management tool – knows that the server exists. Power consumption is a sign of how much a server is actually being used (although this data can be misleading and needs to be analysed), as well as the cooling required. This is a basic function of DCIM monitoring systems, although not all DCIM systems have monitoring capabilities and not all deployments measure consumption down to the server level (which can be expensive).
· An asset management system. Asset management, again part of most DCIM products, shows the server's specifications, physical location in the datacentre, power use and deployment history. It may also show ownership and history; this is essential for identifying whether a server is current and needed, and for managing any future steps.
· An IT monitoring or management tool. In order to know whether a server is actually doing anything, it is necessary to read the processor and system activity, preferably in real time. This is where tools such as Intel's DCM (Data Center Manager), IBM's Systems Director, Dell's Active System Manager and HP's Insight manager are useful, although it is also possible to capture activity using open source tools such as Nagios. Intel's DCM is able to provide power data, carry out some power control and provide system activity data, making it particularly useful in this context.

Leading DCIM suites are now able to map system activity to power consumption and location; these include StruxureWare (Schneider Electric), which incorporates Intel DCM, and Emerson Network Power's Trellis, which is designed to monitor both the IT and physical infrastructure together. Other products are now also able to integrate, including from JouleX (Cisco), FieldView Solutions, iTRACS (CommScope), Nlyte, Modius, Optimum Path, Power Assure, sEnergy Thermal (Nortek) and TSO Logic. (Most of these are DCIM suppliers; TSO is a datacentre software firm that offers a component of DCIM.)
· IT systems management tools integration. Most DCIM systems are not able to show what applications are being run, nor who is responsible for these, and what resources each activity is using; therefore, the ability to map or integrate the physical and power information stored in DCIM with an IT asset management system, and probably VM orchestration tools, will ideally be needed. This will enable applications to be directly mapped to important data, such as who owns and manages equipment and applications, virtual and physical server use and power consumption.

Integrations between DCIM and IT management software are only just emerging. Companies such as Sentilla with its IT infrastructure intelligence have long been working on this; CA Technologies, which has both CA DCIM and a portfolio of IT management products, is a pioneer in this area; Emerson is also working toward this with a number of tools and partner applications running on its Trellis platform – it also has an alliance with IBM.
· Zombie-hunting analytics. Using these tools, which are widely available today, it is possible to gather, in real time and historically, data that is useful in identifying zombie servers (or indeed, other equipment): servers that routinely use the minimum amount of power; that have limited IO activity; that are associated with obsolete accounts; that are heavily underutilised all the time; or that have an odd combination of applications that would probably not be useful or secure.

But having the data does not mean it is pulled together or presented in a useful manner, or that operators know how to look for it. That is where analytics tools come in. 1E's NightWatchman Server, for example, holds profiles of the typical logs and power-usage patterns, enabling 'working' servers to be separated from the rest. Similarly, Viridity's EnergyManager (now part of StruxureWare) enables the data to be pulled together on one screen.

It is also important for the identification of the zombie patterns to reference the asset management system in order to identify which servers are eligible or suitable for replacement, and to see the age, power and power consumption of the machine. This is an important step because there is a paradox at play: servers that are lightly utilised may look like candidates for replacement, but they may not necessarily be zombies. They may, in fact, be the newest and most powerful servers that are simply underdeployed while older machines look deceptively busy but are burning energy and processor cycles and achieving far less.
· A Zombie management process? If an enterprise datacentre, in particular, needed to routinely replace zombie servers, it would be possible to build a composite application that sits on top of the various tools identified above, treating the elimination of unused servers as a regular business process. This process would routinely look for servers that could be consolidated or replaced by monitoring power and utility, applying analytics to further check for suitability, check for ownership and profile with asset management systems, and then manage a process of retirement or consolidation. This might involve contacting appropriate owners and finance departments, and retiring or reusing software licenses, as well as updating DCIM systems. An integration and workflow product, such as TDB Fusion's Holistic DCIM, designed to integrate DCIM and other applications and systems, would be suitable for this, although other generic platforms, such as IBM Tivoli or Oracle business process manager tools, could be used.
A DCIM and DCSO platform
Why has no one (to our knowledge) done this? As we described in last month’s magazine, the reason is that all of this is too difficult and too expensive for a one-off exercise. Software products such 1E's NightWatchman Server or Viridity's EnergyManager only tackle part of the problem, and produce only one-off results.
But the value proposition is entirely different if the elimination of zombie servers is part of an overall software-driven datacentre management strategy. Once the initial investment is made and process changes enacted, the management of workloads and of server replacement, consolidation and upgrading becomes just one more relatively low-cost opportunity to improve the overall cost profile and management of the datacentre.
Once a platform is in place – and that includes the DCIM systems, the IT monitoring and the ITSM integration – a host of new functions and processes become possible: better capacity planning; dynamic datacentre provisioning; automated load sharing and datacentre failover; energy chargeback; risk-assessed workload and cloud workload deployment; and KPIs for systems and infrastructure.
 

2 pages

Pics: Zombies (!)

Headline:
 

Intro:
Part 2: One or two weapons will not be enough. By Andy Lawrence, Vice President of Research for Datacentre Technologies and Eco-Efficient IT at 451 Research.

In the first instalment of this two-part article which appeared in the November 2014 issue of Data Centre Solutions, we discussed the issue of zombie or comatose servers – equipment that does little work but is supported and managed at some expense, and considered why few software tools exist or are used to help solve the problem. This report describes in more detail the software tools required, and how they are more powerful together as part of a platform than as isolated tools.
Identifying and eliminating zombie servers is a difficult task in the dark, especially if the hunter only has a couple of simple weapons. But once management information and software systems and processes are in place, it can become a relatively straightforward and regular activity that will save datacentres substantial amounts, both in dollars and power consumption. The most successful datacentre operators will be those that invest in the appropriate datacentre infrastructure management (DCIM) and other integrated tools in a strategic way, reaping multiple benefits.
One thing is clear: it is possible to eliminate zombie servers by adopting good processes and using the right tools. So what software is needed to maintain a zombie-free environment? We have identified the following:
· Server-level power monitoring. If a server is using power, then the IT or datacentre staff – and most importantly, any higher-level management tool – knows that the server exists. Power consumption is a sign of how much a server is actually being used (although this data can be misleading and needs to be analysed), as well as the cooling required. This is a basic function of DCIM monitoring systems, although not all DCIM systems have monitoring capabilities and not all deployments measure consumption down to the server level (which can be expensive).
· An asset management system. Asset management, again part of most DCIM products, shows the server's specifications, physical location in the datacentre, power use and deployment history. It may also show ownership and history; this is essential for identifying whether a server is current and needed, and for managing any future steps.
· An IT monitoring or management tool. In order to know whether a server is actually doing anything, it is necessary to read the processor and system activity, preferably in real time. This is where tools such as Intel's DCM (Data Center Manager), IBM's Systems Director, Dell's Active System Manager and HP's Insight manager are useful, although it is also possible to capture activity using open source tools such as Nagios. Intel's DCM is able to provide power data, carry out some power control and provide system activity data, making it particularly useful in this context.

Leading DCIM suites are now able to map system activity to power consumption and location; these include StruxureWare (Schneider Electric), which incorporates Intel DCM, and Emerson Network Power's Trellis, which is designed to monitor both the IT and physical infrastructure together. Other products are now also able to integrate, including from JouleX (Cisco), FieldView Solutions, iTRACS (CommScope), Nlyte, Modius, Optimum Path, Power Assure, sEnergy Thermal (Nortek) and TSO Logic. (Most of these are DCIM suppliers; TSO is a datacentre software firm that offers a component of DCIM.)
· IT systems management tools integration. Most DCIM systems are not able to show what applications are being run, nor who is responsible for these, and what resources each activity is using; therefore, the ability to map or integrate the physical and power information stored in DCIM with an IT asset management system, and probably VM orchestration tools, will ideally be needed. This will enable applications to be directly mapped to important data, such as who owns and manages equipment and applications, virtual and physical server use and power consumption.

Integrations between DCIM and IT management software are only just emerging. Companies such as Sentilla with its IT infrastructure intelligence have long been working on this; CA Technologies, which has both CA DCIM and a portfolio of IT management products, is a pioneer in this area; Emerson is also working toward this with a number of tools and partner applications running on its Trellis platform – it also has an alliance with IBM.
· Zombie-hunting analytics. Using these tools, which are widely available today, it is possible to gather, in real time and historically, data that is useful in identifying zombie servers (or indeed, other equipment): servers that routinely use the minimum amount of power; that have limited IO activity; that are associated with obsolete accounts; that are heavily underutilised all the time; or that have an odd combination of applications that would probably not be useful or secure.

But having the data does not mean it is pulled together or presented in a useful manner, or that operators know how to look for it. That is where analytics tools come in. 1E's NightWatchman Server, for example, holds profiles of the typical logs and power-usage patterns, enabling 'working' servers to be separated from the rest. Similarly, Viridity's EnergyManager (now part of StruxureWare) enables the data to be pulled together on one screen.

It is also important for the identification of the zombie patterns to reference the asset management system in order to identify which servers are eligible or suitable for replacement, and to see the age, power and power consumption of the machine. This is an important step because there is a paradox at play: servers that are lightly utilised may look like candidates for replacement, but they may not necessarily be zombies. They may, in fact, be the newest and most powerful servers that are simply underdeployed while older machines look deceptively busy but are burning energy and processor cycles and achieving far less.
· A Zombie management process? If an enterprise datacentre, in particular, needed to routinely replace zombie servers, it would be possible to build a composite application that sits on top of the various tools identified above, treating the elimination of unused servers as a regular business process. This process would routinely look for servers that could be consolidated or replaced by monitoring power and utility, applying analytics to further check for suitability, check for ownership and profile with asset management systems, and then manage a process of retirement or consolidation. This might involve contacting appropriate owners and finance departments, and retiring or reusing software licenses, as well as updating DCIM systems. An integration and workflow product, such as TDB Fusion's Holistic DCIM, designed to integrate DCIM and other applications and systems, would be suitable for this, although other generic platforms, such as IBM Tivoli or Oracle business process manager tools, could be used.
A DCIM and DCSO platform
Why has no one (to our knowledge) done this? As we described in last month’s magazine, the reason is that all of this is too difficult and too expensive for a one-off exercise. Software products such 1E's NightWatchman Server or Viridity's EnergyManager only tackle part of the problem, and produce only one-off results.
But the value proposition is entirely different if the elimination of zombie servers is part of an overall software-driven datacentre management strategy. Once the initial investment is made and process changes enacted, the management of workloads and of server replacement, consolidation and upgrading becomes just one more relatively low-cost opportunity to improve the overall cost profile and management of the datacentre.
Once a platform is in place – and that includes the DCIM systems, the IT monitoring and the ITSM integration – a host of new functions and processes become possible: better capacity planning; dynamic datacentre provisioning; automated load sharing and datacentre failover; energy chargeback; risk-assessed workload and cloud workload deployment; and KPIs for systems and infrastructure.