Why data centre owners must involve operations from the start

To improve uptime, data centre owners and operators have traditionally focused on the physical infrastructure that supports IT, incorporating independent redundancies, monitoring systems, failover schemes and more, writes Andy Lawrence, Research Vice President - Datacenter Technologies (DCT) & Eco-Efficient IT, 451 Research.

  • 10 years ago Posted in

ON THE WHOLE, the strategy has worked, yet research by Uptime Institute (an independent division of The 451 Group) and others shows that large-scale outages continue to plague the data centre industry - and that some operators continue to do much better than others. The level of downtime that still occurs might be surprising to some, given the significant economic consequences of service disruption and widespread use of standards, techniques and technologies dedicated to maintaining continuous availability.

A major reason for some of the continuing problems, according to Uptime, is that design alone cannot guarantee data centre efficiency or availability. Operations management (e.g., capacity management, change management, incident management), maintenance strategies, staff and contractor training, and emergency-response procedures all affect availability. There are signs that this message is beginning to sink in: Uptime is reporting that a growing number of data centre owners and MTDC clients are requiring third-party validation of operational best practices to ensure optimal facility performance. (An analogy might be that airline owners and passengers don’t just want to know that an aircraft is certified as functional when it leaves the factory - they want to be assured that the crews know how to maintain and fly it safely.)

In its latest Start with the End in Mind initiative, Uptime goes even further. In new data centres, it says, operations holds the key to efficiency (including availability). Operations is the ultimate client in any data centre expansion, Uptime asserts, and as such should be integral to the project from conception. By focusing on the way the data centre will be run from the earliest planning stage, owners increase the efficiency, uptime and ROI of their facilities while reducing cost and risk.

Most data centre outages are caused by human error
Partly as a result of the success of the Tier-classification system and the general adoption of redundancy in data centre designs, overall data centre outages caused by component failures are rare, and attempts to increase uptime solely through improvement of the physical infrastructure are reaching a level of diminishing returns. Less than one-third of the unplanned outages reported in a recent survey conducted by the Ponemon Institute for Emerson Network Power were attributed to equipment failures, and respondents reported that most of those outages were avoidable: almost all were attributed to either human error or an equipment failure that may have been prevented had adequate training, monitoring or maintenance procedures been in place.

These results are not dissimilar to Uptime’s Networks data. Uptime’s Networks have tracked incidents and outages in member facilities for over 25 years, and have compiled a detailed dataset of more than 5,000 incidents in over 400 data centres. Although outages in Networks member facilities are exceedingly rare (approximately one per decade), virtually all can be traced back to human error.

These findings reinforce the importance of a comprehensive management and operations program to ensure data centre availability and maximize efficiency. Organizations that closely align data centre operations with business objectives and use industry best practices as the benchmark for continuous monitoring and improvement optimize data centre performance and realize the most efficient return on their investment possible.

The new focus on training and operations from Uptime opens up further possibilities: while availability is generally good, almost all research suggests that energy efficiency and use of capacity is not. Over time, the focus on training and ongoing operations may offer a new channel for disseminating best practices in capacity management and energy efficiency.

Will independent verification of operational best practices become a requirement?
As the physical infrastructure of the data centre becomes increasingly commodified, operational performance rises in prominence. For multi-tenant data centre (MTDC) operators and other IT service providers that need to meet client-imposed uptime requirements, an objective performance assessment can be a key differentiator.

Most organizations have internal training and review procedures in place, and there are some standards developed for other industries or other purposes (e.g., ISO, ITIL, SSAE 16, SAS 70, EN 50600) that have been adapted to also address data centre facility availability. Historically, however, third-party validation of operational best practices based on a data centre-specific system has not been generally available.

That has now changed, and facility owners are taking note. Uptime Institute, in consultation with industry stakeholders, has developed one operations standard that is delivered based on two operations-assessment protocols specifically designed for data centres and created by data centre owners: Tier Standard: Operational Sustainability, for Tier-certified facilities, and the M&O (Management and Operations) Stamp of Approval, for data centres that are not Tier certified.

Both these methodologies address the site management behaviors and decisions that impact long-term data centre performance, such as staffing and organization (staffing levels, qualifications and skill mix); training and professional development; preventative maintenance programs and processes; operating conditions and housekeeping; planning management; coordination practices and resources; and more.

Are we entering a new stage in data centres, where operations are certified? Certainly, design and build certification has become increasingly important in recent years. It is now common for owners to include design or constructed facility certification requirements in data centre construction request for proposals (RFPs) - and for potential tenants to ask for certification from MTDC operators.

Now that credible operational certification is available, an increasing number of owners and tenants are including requirements for operational certifications in their facility management RFPs; some even carry significant penalties if the contractor fails to meet or sustain minimum standards. For example, the Province of Ontario recently included a requirement for operational certification in an RFP with a $1m penalty should their IT service provider fail to comply. (A detailed case study is available on the Johnson Controls Global WorkPlace Solutions website.)

Operations holds the key to reliability
Operational excellence is not just about availability, but also efficiency. Uptime Institute research and field experience indicates that even in new builds, operations holds the key to efficiency.

The design-build phase is typically less than 5% of the data centre’s lifespan, yet the team responsible for 95% of the facility’s life - the operations team - is often not involved until the facility is commissioned. This is a mistake, Uptime states: organizations that view data centre expansion as a ‘design build operate’ process rather than a function of change management put the efficiency, uptime and ROI of their facilities at risk. Uptime reports that data centres where operations staff were integral to
the construction process from conception run more reliably and profitably from day one.

And according to Uptime, conception really does mean ‘conception’: in the most efficient and reliable data centres, those who operate it are brought into the new build, retrofit or expansion process in the preconstruction/planning phase. This ensures that the team that will run the facility on a daily basis is involved in the decisions that will affect how efficiently it can be run.

This observation is the inspiration behind Uptime’s Start with the End in Mind initiative. Led by Lee Kirby, CTO of Uptime Institute and former senior executive at Lee Technologies, Uptime’s new program details how design/build and operations development should occur simultaneously. A typical data centre build, retrofit or expansion process involves five phases: pre-construction, design, construction, commissioning and turnover.

Involving the operations team at each phase of the process will ensure not only that the facility is engineered to optimize maintainability, but also that the operations team can provide continuity for knowledge management and transition to production.

Certifications, if desired, are incorporated as milestones, and review and optimization of operational procedures continue as an iterative process throughout the facility’s lifespan - ensuring that, as Uptime puts it, ‘it doesn’t end in tiers.’

The table on page 9 shows the activities that should occur concurrently to ensure the facility is running optimally on day one.

The 451 take
Most data centre outages are caused by human error. This can never, of course, be eliminated, but the risks can be reduced by systematically and consistently following a program of operational best practices. Data centre operators that want to improve their facility’s reliability may find the operations-assessment protocols offered by Uptime a helpful resource, and their Start with the
End in Mind initiative reinforces the role
of the operations team in optimizing efficiency.

Although the argument for obtaining third-party verification of operational performance is less obvious for the typical enterprise owner/operator, certification could be a key differentiator for an IT service provider. Uptime reports that an increasing number of clients are including operational-certification requirements in their RFPs. This has implications for facility managers and
MTDC providers alike.