Storage capacity growth, and the incumbent CAPEX and OPEX, are a justifiable concern for IT managers. In most cases, data growth and the storage capacity needed to house it are the top concerns cited by IT. As we see rapid adoption of virtualization to gain control of data center operations, this issue is exacerbated. With each virtual machine requiring its own independent storage, even a reasonably sized virtual server deployment can require a considerable amount of storage. Virtualized data sets by their very nature consist of many identical pieces, aggregating to massive volumes of redundant data. Having to store and manage multiple VM images that must be accessible by multiple physical hosts is necessary for functions like vMotion, virtual desktops, distributed resource management (DRM) and site recovery management (SRM).
Storage efficiency-enhancing applications like thin provisioning, snapshots, cloning and deduplication all achieve various capacity, cost and management savings. However, deduplication alone, if implemented properly for virtual environments, may be the only solution needed for minimizing the impact of data growth and its impact on storage consumption.
Deduplication offers space-saving functionality similar to thin provisioning, snapshots and cloning but is able to do it as a singular service and will result in incremental storage savings compared to all of the others combined. Deduplication reduces the amount of storage required through the elimination of redundant information and today’s high performance deduplication engines can do so without negatively impacting performance.
Deduplication divides data into chunks of variable lengths based on analysis of its content, then combs it to identify redundant segments. When a duplicate is found deduplication software leaves a reference or pointer in the original location and discards the unnecessary segment saving the storage space it would have consumed. Today’s high performance deduplication engines perform this function in-line before the redundant segment is ever written to another device (versus post-processing duplication – store it first, then identify duplicate data afterwards – which is more likely to be used in backup and archive applications).
Hypervisors that offer thin provisioning, snapshots and cloning for storage efficiency may save storage capacity, but do so at a significant performance hit. Speeding the creation of additional VMs is a valuable benefit, but not when applications pay an I/O penalty. The adverse I/O impact generally creates a new need – additional administration, the addition of third-party software, bigger hardware or another layer of management to optimize I/O randomization created as storage is virtualized.
Hardware platform-based storage services can accomplish some efficiency functions without notable performance impact thanks to dedicated processors. This comes at a much higher price tag, incurs additional administration resources, and has limited configurability since all functions must be set up in advance and individually managed. Using snapshots and clones in particular can lead to more problems than solutions when second and third iterations of data sets (snapshots of snapshots or clones of clones) add to management overhead, capacity, and inevitable drain on performance.
In contrast, not only does deduplication conserve storage capacity, it’s simpler and more efficient to make additional volumes and copies directly from the deduplicated data, and performance actually improves.
Deduplication eliminates the need for snapshots and cloning. When making a copy of one VM, the deduplication application combs the data for redundancy, creates the reference pointers, removes the unnecessary data, and creates an efficient replica.
Thin provisioning of VMs is also largely unnecessary, since in most virtualized environments, an entire physical volume is allocated to the clustered hosts, which create VMs as new files on that one volume – all that’s being provisioned is zeroed blocks inside the file. Since zeroed blocks are by their very nature the same, they are eliminated by the deduplication engine.
Virtual environments have such unique data protection requirements, including backing up images of each VM, where there is a high ratio of duplicate data due to duplicate operating system environments, applications, service packs etc. It is not uncommon to see deduplication rates of 35X and higher in these scenarios, and corresponding reduction in backup times, enabling backup window attainment, and recovery efficiency.
With shared storage capacity consumed at a faster rate within virtualized environments, organizations have begun to implement a variety of storage efficiency technologies within their data centers in an attempt to combat the increased costs that result from having to store and manage multiple virtualized data sets. But deduplication may be the only requirement since it delivers advantages over other applications intended to ease the pain of runaway capacity expansion in virtual environments.
As we continue to see virtualized storage evolve with the emerging Software Defined Storage (SDS) implementations as part of broader Software Defined Data Centers (SDDC), data deduplication can and should be an integral part of these ecosystems. It will deliver more efficiency and enable data optimization to be embedded into the fabric of these approaches. As converged data centers evolve, applying efficiency capabilities such as deduplication will benefit the end user because it will be transparent, cost effective, and efficient and will enable businesses to manage their IT capital and operating costs.
Additional information on how deduplication can provide savings across virtualized environments is available at http://permabit.com/resources/