Analytics or Bust

By Gilda Foss, SNIA Analytics & Big Data Committee Chair, NetApp.

  • 9 years ago Posted in

BIG DATA can be thought of as a characterization of datasets that are too large to be efficiently processed in their entirety by the most powerful standard computational platforms available. In fact, a more concise, contemporary definition of Big Data from Gartner (http://www.gartner.com/technology/home.jsp) defines it as “high-volume, -velocity and - variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making”. Most of us have heard of the ‘Three V’s’ by now – Volume, Velocity, and Variety. It’s important to understand these components going into any big data analytics implementation, for you must know precisely what you are dealing with.

Volume is, of course, the scale of data. 43 trillion gigabytes will be created by 2020. That’s 300 times more data than we had in 2005, just 10 years ago. 2.3 trillion gigabytes are created every day and 6 out of 7 billion people in the world have mobile phones. Furthermore, most US companies have 100 terabytes of data stored and European companies are following suit with that level of data.

Velocity pertains to the analysis of streaming data. For example, the New York Stock Exchange captures 1 TB of trade info during each trading session. By next year, there will be almost 19 billion network connections that equates to about 2 ½ connections per person on earth. The thought of the car you drive having about 100 sensors on them which monitor all sorts of things… all creating data, is pretty amazing to think about.
Variety entails all of the different forms of data. For example, there is 150 Exabyte’s of healthcare data stored in the world as of 4 years ago. Every month, Facebook shares 30 billion pieces of content and 4 billion hours of video are watched on YouTube each month. Think about the 200 million twitter uses send out about 400 million tweets every day and you will realize just how much is created from the social media space alone.

Overall, big data sets these days allow for extracting results from a variety of data types, such as emails, log files, social media, business transactions and a host of others. We can consider this variety in a more granular way by breaking it down to the most typical types. First off, we have unstructured data, such as office files, video files, audio files, etc. Then there is semi-structured data such as integrated text/media files, Web/XML files, etc. which is relatively easy to generate but difficult to query and optimize. Finally, we have Rich Media such as streaming media, Flash videos, etc.

Moreover, big data can encompass both structured and unstructured data, existing in high volumes and going through high rates of change. We will consider structured data information that is organized and formatted in a known and fixed way. The format and organization are customarily defined in a schema. The term “structured” data is usually taken to mean data generated and maintained by databases and business applications. The key factor of value with big data is to provide actionable insights so that organizations can therefor use storage systems with analytics applications to obtain information that would otherwise be undetectable, or unmanageable to derive using existing approaches.

Furthermore, big data and the platforms associated with analyzing it have the ability to change how we approach the problems facing the world today. It is critical to make the best choice when building a platform, one that provides the most flexibility while not compromising on the ability to scale out both in capacity and compute. We cannot predict the methods and algorithms to be implemented, however, without the data, capabilities and outcomes would be limited.

One such approach is to implement an infrastructure and platform that is based on flexibility and an uncompromising approach to both storage and performance. An optimal implementation is one that is built around the directly attached to the computational elements in a manner that allows for non-disruptive growth. Systems that provide storage that allow the widest variety of connectivity and capability are certainly the optimal choice when it comes to platform. A system that can deliver the backbone of some of the world largest super computers as well as the highest volume consumer facing clusters would be ideal. Another consideration for implementing the right storage system for big data analytics is one that provides extremely low latency / high bandwidth.

Considering your partners in this ecosystem of software and integrators is also important. Reliance on deterministic performance at all times is key and the robustness and performance needed here is simply impossible to guarantee in a white-box solution. Another smart consideration is to choose a system that is modular in nature and can start small and grow and evolve dynamically as the solutions and problems change.

We must always keep in mind that the applications are only as reliable as the underlying infrastructure. The ecosystem of all the potential analytics plays relies on predictable system architecture. Period. The most ideal solution that encompasses all of the pertinent requirements for a successful big data storage platform would certainly include reliable, enterprise class big data analytics in an open ecosystem with a validated solution that’s ready to deploy, enabling enterprises to get control of all their data and quickly turn it into insights.

Technical details should ideally include: High Availability (HA) (5 9’s) for big data analytic clusters, faster time from cluster crashes (ie. Hadoop, Splunk), less replication with better data availability, and higher density and capacity for fewer footprints is very important. Independent scale of compute, storage and performance at scale is a consideration as is consistent performance even when cluster is failing. In a perfect world, you’d need to add on encryption at the drive level, open implementation with best of breed products in the analytics stack (no lock-in, future-proofed), and NFS Connector to enable Hadoop to run natively on storage.

It’s important to examine traditional datacenter architectures and dive into forward-thinking modern design with your analytical software choice (ie. Hadoop) acting as the center of gravity. Legacy data-related limitations can be largely reduced with the use of open source software and distributed commodity infrastructure. YARN-enabled Hadoop, for example, leverages existing infrastructure investments and “turns data into capital” by providing deep insight into all data sources in addition to a cost-effective footprint. Additionally, learning about new ways to architect enterprise class storage and seeing the business impact and real world use cases of the managed approach is integral to implementing a robust solution for analytics. In order to do this, we must find out about new technologies that will radically change storage as well as the best practices in balancing storage, networking and compute as a whole. Looking at the analytics side, we would be remiss to not also consider how we will protect that data so that we have assurance that data is not corrupted, is accessible for authorized purposes only, and is in compliance with applicable requirements. First off, we have compression that is the process of encoding data to reduce its size.

Lossy compression (i.e., compression using a technique in which a portion of the original information is lost) is acceptable for some forms of data such as digital images in some applications, but for most IT applications, lossless compression (i.e. compression using a technique that preserves the entire content of the original data), and from which the original data can be reconstructed exactly, is required.

Then there is the ever-so-popular data de-duplication where the replacement of multiple copies of data—at variable levels of granularity—with references to a shared copy in order to save storage space and/or bandwidth occur. Protecting your Data is paramount.
It’s also interesting to think that CIO’s were traditionally the keeper of all things “information,” including IT purchasing decisions. In 2015, Chief Data Scientists will most likely hold the purse strings when it comes to key IT buying decisions related to their company’s data. Likewise, in 2015, a growing number of companies will provide individual departments and teams with their own “in-house” data scientist instead of keeping their data science team cloistered in a separate department. The time is now to bring about a breakthrough when open data becomes the new open source powerhouse as industries across the board begin unlocking their data troves to the public. The burgeoning open data will probably end up being the new open source.

According to IDC, big data is expected to grow into a $16 billion dollar industry. While there’s certainly money to be made in big data, I think data will be driving the next wave of social consciousness from providing disaster relief to support helping philanthropies fulfill their charitable missions. That’s something near and dear to my heart.
Also, the Internet of Things is not going away whatsoever, but the real value isn’t the next smart device of the day (I say this as I am glued to my iPhone 6+ most of the time when I am alone) Instead I think it’s going to be derived from the data that is aggregated on every single connected device. As a geek, I’ve always loved data management, especially since I’ve worked for a company for the past decade that stores data. But, as data becomes more accessible and analytical tools become easier to use and readily available, data science won’t be limited to those in the technology sector. In 2015, anyone with the right tools should be able to draw powerful insights from data and that’s pretty cool.

SNIA’s Analytics and Big Data committee is dedicated to fostering the growth and success of the storage industry and the use of data storage resources and services by analytics and big data applications and toolsets. To find out more about
the ABDC programs and recent Summit, be sure to visit
http://www.snia.org/forums/abdc