OVER THE PAST YEARS, and particularly since the financial crisis in 2008, there has been an on-going discussion in the market centred on the cost against performance of traditional storage versus flash. Indeed, the adoption of flash is on the rise for several reasons including, of course, the fact that it is fast. But, as this article discusses, significant challenges remain when adopting this technology.
All-flash arrays undoubtedly represent a major evolution in storage technology, and the finger-pointing and name calling by their vendors is a source of cheap entertainment (likely the only cheap part of the whole story)! But, as with so many past technology evolutions, the vendor focus on device features and functions distracts from the much bigger customer challenge: how to design, deploy, and support compute systems combining multiple types and flavors of devices from multiple vendors, while optimising availability, performance, and cost. The emergence of faster flash arrays doesn’t magically solve this “systems integrator dilemma”; to the contrary, some of the inherent technology characteristics, especially at this relatively early stage of market maturity, make these challenges harder, not easier.
Customers looking to adopt all-flash arrays have to address three primary challenges:
1. Developing accurate cost/benefit analysis
One of the primary benefits of flash storage is performance, and the AFA vendors love their “vanity” benchmarks that universally tout eye-poppingly high IOPS numbers to prove that benefit. The reality, however, (as reflected in many of the vendors’ comments) is that those benchmark numbers are typically generated using much smaller than real-world exchange sizes, and further using the read/write balance, sequentiality, level of data compression, and other factors most favorable to that vendor’s product. (Which of course is in no way intended to suggest that a technology vendor would ever “cook” a benchmark)!
Your real-world workload doesn’t look much like the vendor’s benchmark workload. Your mileage not only *may* vary, it’s going to vary – in all likelihood very considerably – as a result. Having an accurate expectation about achievable performance for your specific workload is critical to being an informed consumer with a realistic cost/benefit justification. This is especially critical given the price premium that many vendors are asking for those performance benefits.
2. Validating vendor claims
Given the early state of this market segment and the often-conflicting claims thrown around by the vendors (as amply demonstrated in this article), it’s understandable that many customers want to put those claims to the test in their own labs before committing to purchase.
Gathering benchmark data to support an expensive purchase decision sounds like a great idea in theory. In practice, however, executing a high-IOPS benchmark accurately and repeatably across multiple vendor offerings is a distinctly non-trivial undertaking, involving multiple hardware and software components set up with a whole bunch (to use the correct technical descriptor) of parameters. The test configuration and execution have to be closely monitored to ensure that the load test setup actually generates the desired I/O pattern, that the infrastructure is optimally tuned, that conditions don’t vary from run to run, and of course that the vendor-reported performance numbers are accurate.
Inaccuracy and variability in this process serve no one’s interests. Customers certainly don’t want their decisions guided by flawed data, and vendors want their products to reliably exhibit the best possible performance. Until the market matures, however, to the point where vendors can and will provide defensible performance numbers for the broad mix of real-world workloads, benchmarking will continue to be prerequisite in many purchase decisions.
3. Delivering production performance
Bolting an all-flash array into the existing production environment and achieving anything approaching the performance measured in the pristine, isolated lab environment is an even less trivial undertaking. There are a plethora of host and fabric conditions that have to be satisfied to realize high I/O performance across the entire customer-integrated system – literally from the flash cells in the array into host RAM on the host.
Monitoring, tuning, and controlling these conditions as the entire system changes around them is fundamental to ensuring consistent system performance. Yet in many of the production environments that we have assessed, these conditions are so imbalanced that the addition of significantly faster storage (which, ironically, is often the storage vendor’s universal prescription for “whatever ails the SAN”) would actually have made the entire system significantly slower. As Fibre Channel has only rudimentary flow control, the SAN as a system works best when request and response rates are well-balanced between hosts and storage; making any one part of the system much faster can badly upset that balance and result in conditions such as buffer-credit starvation and head-of-line blocking backing up across ISLs that badly impact overall performance and availability. The only way to comprehensively and deterministically address these challenges is by employing an IPM or Infrastructure Performance Management solution, a real-time monitoring system for the entire IT infrastructure. Only then can emerging issues be identified before they cause disruption to business performance, ultimately hitting the bottom line. While this is true for a business of any size, identifying performance issues in larger companies, which may be running thousands of servers, is often near impossible.
Virtual Instruments has delivered infrastructure optimisation across hundreds of SAN infrastructures, including many that have or are in the process of adopting flash storage. Our VirtualWisdom platform can be easily installed on the back of Traffic Access Points or TAPs; it is compatible with all vendor hardware and has won several awards for its non-disruptive, real-time monitoring and analysis capabilities. Most data centres have built-in redundancy to cope with spikes in demand for capacity. By using VirtualWisdom, infrastructure managers can see precisely where latency and performance degradation occurs, and they can proactively identify and address any traffic issues before they become major issues. With VirtualWisdom this granular and comprehensive insight into the IT infrastructure is achieved without impacting application performance or end user experience; and it doesn’t add any ‘load’ on to the system the way polling does. Once implemented VirtualWisdom reads the Fibre Channel protocol in real time, end-to-end regardless of vendor equipment.
The only customer outcome worse than buying an expensive array that proves to be much slower than expected would be for that array to impact the performance and/or availability of the entire production environment into which it is deployed.