New flash storage: So why isn’t everything faster?

It’s time to call in the infrastructure performance management professionals, writes Alex D’Anna, Director of Solutions Consulting, Virtual Instruments.

11 years ago Posted in

WHEN CUSTOMERS INVEST in the most current storage technologies whether it be flash or the next generation All Flash Array (AFA) they expect to see an immediate improvement in operational speed to justify outlaying such expense. Most new storage technologies are designed to deliver an immediate improvement in performance through some feature that is specific to the device itself (speed of drives, amount of cache) but this does not always happen and we’re increasingly finding that customers need infrastructure performance management solutions to help identify the root of their IT problems.

When we speak to customers or engage in SAN troubleshooting, we find that if new technologies, such as storage, SANs or hosts, have been introduced, most of the unexpected slow-downs are caused by mis-configurations which can be found at different points in the infrastructure from VM or physical, right through to storage. This is why it’s really important to think of adding new technology not as an add-on or simply as throwing out the old and bringing in the new, but as implementing an interactive part to the whole IT infrastructure.

We have found that an audit of the infrastructure, if focused on the right areas, not just one area such as the storage array itself, will bring better clarity for the reasons that impact critical business applications. From an infrastructure perspective this end-to-end view, which in storage terminology is VM-Server-SAN-Storage and LUNs, often allows the customer to invest wisely by investing in the right place rather than throwing money at the problem without a good result.
The list of customers who have invested in faster storage only to find that the application performance hasn’t improved is very long.
Before introducing flash, or even a private cloud, it’s important to
cover all four layers of the IT infrastructure which broadly are: host and switch; virtual machine; I/O stack; and storage. Having an end-to-end view of all of these layers is important for finding and highlighting performance issues.

When we look at what causes slow performance, there are several pain points that come up on a regular basis:

£ Storage array configuration – Let’s take a common example: your business wants to deploy a new application or upgrade an old one. Customers are doing this all the time. Frequently a database, or other application that is customer facing, is far more popular than expected. A customer recently expected 60,000 users on a new application and ended up with 3 million users.

In some respects this is a good problem to have for them as a business, but unfortunately this completely overwhelmed the storage, the SAN, and the host. Users were not happy to say the least, and the business and application owners even less so.

So the design from day one is sufficient, and you architect it as best as you know how but once the application is under real load the array itself isn’t always sufficient to handle the strain. Then things also change. For anyone to expect that a storage array should perform for up to three to five years against all workloads is fairly optimistic in my experience. And so it becomes critical to measure other components that make up that I/O stack. Furthermore, as the report logs on an array might not be at the I/O level, we often see much longer intervals than milliseconds which is what you need to measure the I/O. A problem may not be captured, particularly if not looking at every I/O and in real-time. What we mean here is that the experience of an application is not measured in minutes or hours, or even in milliseconds (real-time) so the report logs are historic. It’s a common mistake to equate historical data which you get from polling and from performance data, where you are looking at line rate information and each I/O instead of real-time data. In addition, as most array vendors only keep 24 hours of data it may not be possible to identify the problem and see trends before they take place.

£ Switch issues – when we move to the second part of the stack to Fibre Channel switches, there are often issues with performance of the switch which have little to do with the vendor. Brocade and Cisco make great SAN switches. However just like the array, they are a device in the stack and they will only be as good as what they can see in their own product. Some believe that they can get all the performance information they need right out of their SAN switch. Well, unfortunately, that’s not the case. Let’s take a light hearted example. If I can see how busy the motorway is (throughput), I can’t necessarily see how long it’s going to take me to get home (latency). And what does my family care about? When I get home. Well I would argue that users running applications on storage infrastructure are looking at the same thing. Latency is what I’m after - throughput, while critical, not so much.

And it’s clear from the customer feedback that we get, that measuring throughput at the switch level doesn’t actually give a good indication of what the I/O experience is like.

£ Physical layer issues – bad connections can often result in re-issuing of commands which then leads to a flood of communications which will slow down databases and eradicate the benefits of flash storage. It doesn’t matter how much flash storage you buy, if your physical layer is not intact and healthy, you will not take advantage of the investment that you make.

£ Queue depth – is another interesting aspect that can cause real slow-downs and the big reason for this is that this is often set by the server team and not the storage team. Unfortunately, the larger your environment gets, the more difficult it is to manage this issue. One server manager may change queue depth to increase their own performance and unfortunately this impacts other users (or servers) sharing that path. If a single lane highway is configured or connected to more highways inaccurately via the HBA tuning server administrator, it can also lead to up to 10x performance slow down.

£ Block size – the size of the read/write functionality needs to match the block size, if not tuned for the right block-size it can also result in performance issues. This again will be highly dependent on the type of application that is running. Clearly you can mitigate this issue through faster disk, but even so, there is a lot to be said about understanding this aspect.

£ CPU configuration – even in a virtualised environment, physical servers still have finite CPU capacity and what we find is that customers leverage a lot without thinking to add CPU and memory to the physical and virtual infrastructure.

More often than not the VMware administrator will allocate too much CPU or not enough, and in the event where there isn’t sufficient CPU this can impact applications no matter what flash drives you have in place! For flash to make a difference, there also needs to be enough servers and CPU.

As we see more and more vendors enter the market with flash arrays, clearly there will be a significant disruption. However when the infrastructure is not operating to expectations, it can be any manner of mis-configuration at fault and we estimate through our experience of end-to-end monitoring that 75 to 85% of all issues is not a result of storage array problems but of something else in the stack, and the more layers in the stack, and the more densely you virtualise, the worse the issue gets.

The way to pinpoint these issues is through a real-time whole IT infrastructure monitoring solution and by proactive performance management so that any mismatches can be identified before they become real glitches. This especially applies to the larger datacentres that are operating up to hundreds or even thousands of servers where identifying a problem can be like trying to find a needle in a haystack. Customers reliant on their IT infrastructures to support mission-critical activities are finding at their cost that before introducing new technologies, it’s wise to be in control and have a detailed view of
the whole IT infrastructure.