Q
Data analytics has been around for quite some time.
Why is there such hype around the subject, headlined
‘Big Data’, right now?
A
It is well known that companies around the globe are experiencing a surplus of data being collected which has the potential to provide invaluable business insight. The volume of this data is so large that few companies have been able to extract meaningful intelligence from it - until now. New technologies now have the ability to analyse vast and complex data sets beyond anything we could do before – cloud computing, improved networks speeds, advanced storage solutions and innovative techniques to analyse data have resulted in a new ability to turn ‘Big Data’ into value
Q
In other words, what has changed in the IT and business
worlds to focus such attention on data analytics at the
moment?
A
If you look at some of Gartner’s latest predictions business intelligence [BI] and analytics are going to remain a major focus
for companies and CIO’s through to 2017. This so called ‘fact
based decision making’ is going to have a role to play across all industries before long, and as the cost of acquiring, storing and managing data falls in the coming years, then companies are
going to find it even easier to apply BI and analytic solutions to an even far wider range of situations. As a result of these reduced costs it is likely companies will move their future investment away from traditional IT reporting solutions focused on historical data and shift them towards data modelling and governance solutions capable of predicting trends and correlations. The value of big data comes from the knowledge gained from it and what you do with it, as such, the promise of big data lies within its ability to make predictions – that’s what gets people excited.
Q
As with most new developments, is the focus on the high end
enterprise, or are the benefits of Big Data available to, and
affordable for, SMEs as well?
A
If you were to believe everything you read, you would be under no illusion that ‘Big Data’ is, well, big. However, Big Data is a relative term and applies as much to SME’s as it does to the high end enterprise. Every company has a tipping point and most organisations – regardless of size – will eventually reach a point where the three V’s – volume, variety and velocity - of their data make it difficult for them to extract business value anymore.
Whilst it’s interesting to those who are technical to focus on size, the real focus should be on business value first. Let’s be honest, not all SME’s will take the time or energy to even gather the data surrounding their company, let alone analyse it. But those that do
go through the process will potentially get a leg up on the
competition and provide themselves with insights that others simply don’t have.
The key is to focus their efforts on a few business-critical sets of
data rather than investing in an ‘all singing, all dancing’ big data solution. Companies, in particular SME’s, are not prepared to make speculative investments in expensive solutions just in case there is something to be found but instead can now turn to alternative hybrid emerging technologies which can provide a more cost effective solution.
Q
Are there specific Big Data vendors, or how/where are data
analytics solutions available?
A
The Big Data landscape is dominated by two classes of
technology:
Operational – systems that provide operational capabilities for
real-time, interactive workloads where data is primarily captured
and stored; and
Analytical – systems that provide analytical capabilities for
retrospective, complex analysis that may touch most or all of the
data. These classes of technology are complementary and
frequently deployed together.
Operational and analytical workloads present opposing requirements and systems have evolved to address their particular demands separately and in very different ways. Each has driven the creation of new technology architectures. Operational systems, such as NoSQL databases, focus on servicing highly concurrent requests while exhibiting low latency for responses operating on highly selective access criteria.
Analytical systems, on the other hand, tend to focus on high throughput; queries can be very complex and touch most if not all the data in the system at any time. Both systems tend to operate over many servers operating in a cluster, managing tens or hundreds of terabytes of data across billions of records.Technologies such as Hadoop have emerged to address Big Data challenges and to enable companies to leverage the capabilities of their operational systems by combining it with analytical systems to develop new types of products and services to be delivered by the business.
The latest Apache Hadoop framework consists of the following modules:
Hadoop Common – contains libraries and utilities needed by other
Hadoop modules
Hadoop Distributed File System [HDFS] – a distributed file-
system that stores data on commodity machines, providing very
high aggregate bandwidth across the cluster.
Hadoop YARN – a resource-management platform responsible
for managing compute resources in clusters and using them for
scheduling of users’ applications.
Hadoop MapReduce – a programming model for large scale data
processing
Hadoop and other Big Data vendors are available from specialist resellers such as S3.
Q
What are the major issues to consider when looking at Big
Data analytics and the potential benefits it offers to an end user
organization?
A
It is essential that you think about the requirements and design of your big data analytics project from the start. As your data grows so do your IT requirements – often – the gap between the business need and the IT infrastructure gets even bigger. To overcome these challenges the major issues to consider when looking at big data analytics are:
Skills – getting real knowledge out of data is not really an IT
capability but a skills issue. Big data technologies require much
more of a software development bent rather than an IT systems
management skill set. You certainly don’t need to take a big bang
approach in terms of implementation but instead leverage standard
architectural principles and in-house skills to ensure you don’t box
yourself in with either proprietary products or services
What data to collect – data is often collected and persistently
stored, mainly for disaster recovery, but this may not be the most
flexible way to maintain the data to make it have future value
Data Volumes – sampling data is not appropriate the whole data
needs to be available to provide the proper insights for the
business value
Data structures – data needs to be structured in a way that makes
it accessible for ad-hoc analytics. Once structured it’s very hard to
fix after the fact. Other factors will be how you can apply analytics
to determine what to do with your data, determine which data is
relevant and how or whether data should be stored
Technology – identify the business challenge or goal and then
align this with the technical approach and solution size. Next
decide whether batch mode processing versus real-time or
dynamic technologies are the most suitable for your needs.
Similarly, low versus high latency technologies needs to be considered.
Look beyond the hype – new software frameworks, including
Hadoop, whilst great technologies may not be right for you. Just
because they may not be right doesn’t mean Big Data is irrelevant
altogether. Consider what is best for your company’s growth before
investing purely based on price or hype
Resources – lack of resources, especially the right resources to
analyse Big Data is critical to the success or failure of the project
Big Data analytics has the potential to save companies money, grow its revenue and achieve many other objectives, across any vertical. Some of these benefits include
Building new applications – with the collection of real-time data
points on its products, resources or customers – a business can
repackage that data instantaneously to optimize either customer
experience or its own resource utilisation
Improving the effectiveness and lowering the cost of existing
applications – it can help replace highly customized, expensive
legacy systems with a standard solution that runs on commodity
hardware whilst also reducing licensing costs through the use of
open source technologies
Identify new sources of competitive advantage – it can enable
business to act more nimbly, allowing them to adapt to changes
faster than their competitors
Increasing customer loyalty – by increasing the amount of data
shared within the company – and the speed with which it is
updated – a business can more rapidly and accurately respond to
customer demand
Q
What are the issues to consider when sourcing/implementing a
data analytics solution?
A
While many data analytics solutions are mature enough to be used for mission critical production use cases, many are still in their infancy. Accordingly, the way forward is not always clear. As businesses continue to develop their Big Data strategies, there are a number of dimensions to consider when selecting the right technology partners. These include:
Software license models – there are three general types of license
for software technologies – proprietary, open-source and cloud
service, Proprietary is owned and controlled by a software
company – the source code is not available to licensees, and
customers typically license the product through a perpetual
license with annul maintenance fees for support and upgrades.
By comparison open-source software and source code is freely
available to use. Value-added components are sold together
with support services. Cloud services are hosted in a cloud
based environment outside of a customers’ data centre and are
delivered over the internet. Their model is subscription based or pay
as you go
Market adoption – to understand a technology’s adoption, in
particular, open-source products, you must consider
The number of users
The availability of conferences – how frequent and are they well
attended
Local community organized events
Online forum activity
Agility – this comprises three primary components – ease of use,
technological flexibility and licensing freedom. A technology that
is easy for users and developers to learn and understand will
enable a project to get started quicker and realize value quicker.
The more a technology makes it easier to change requirements
on the fly will make it more adaptable to the needs of the business.
Open-source products are typically easier to adopt, scale and
purchase
Main vendor versus niche provider – as many organisations are
constantly striving to standardise on fewer technologies to reduce
complexity, improve their competency and make vendor
relationships more productive then adopting a main vendors
product may help address this initiative. However, niche
technology’s may be a better fit for the project.
Q
For example, is Hadoop the only game in town?
A
While Hadoop has become synonymous with Big Data for storing and analysing huge sets of information it is not the only game in town. Saying that, the open source distributed file system and computation platform has had a remarkable impact to date.
The technology provides a distributed framework, built around highly scalable clusters of commodity servers for processing, storing and managing data that fuels advanced analytics applications. The reason for its success has been simple – before Hadoop, data storage was expensive. Hadoop lets you store as much data as you want in whatever form you need, simply by adding more servers to a cluster – these can be commodity x86 machines with a relatively low price tag. Each new server adds more storage and more processing power to the overall cluster.
Hadoop also lets companies store data as it comes in – structured or unstructured - so you don’t have to spend money and time configuring data for relational database management systems and their rigid tables – which is a very expensive proposition.
However, Hadoop has its limitations and bottlenecks. It was envisioned as a batch-oriented system - its real-time capabilities are still emerging - which has created a gap that fast in-memory NewSQL databases are rushing to fill. NewSQL database vendors, such as MemSQL and VoltDB, are working towards real-time analytics on huge data stores with latencies measured in milliseconds.
Q
Where does MapReduce fit in to the Big Data landscape?
A
For those of you who are not familiar with MapReduce it is the key algorithm [or framework] that the Hadoop engine uses to filter and distribute work around a cluster. The MapReduce program is composed of two steps:
1. Firstly, the Map procedure – this performs filtering and sorting of
data, such as sorting students by first names into queues, one
queue for each name; and
2. Secondly, the Reduce procedure – this performs a summary
operation, such as counting the number of students in each queue,
yielding name frequencies.
The program marshals the distributed servers, running all the tasks in parallel whilst also managing all communications between component parts of the system. The key aspect of the MapReduce algorithm is that if every Map and Reduce is independent of all other ongoing Maps and Reduces, then the operation can be run in parallel on different instructions and sets of data. Consequently on a large cluster of machines you can go one step further and run the Map operations on servers where the data lives.
Rather than copy the data over the network to the program, you
push out the program to the machines. The output data can be
saved to the distributed file system, and the Reducers run to merge the results.
There are, however, limitations to MapReduce as follows:
For maximum parallelism you need the Maps and Reduces to
be stateless, to not depend on any data generated by the same
MapReduce job
It is very inefficient if you are repeating similar searches again and
again. A database with an index will always be faster than running a
MapReduce job over unindexed data – this can utilize both CPU
time and power
In Hadoop Reduce, operations do not take place until all the Maps
are complete – as such no data is available until all mapping has
finished
Q
What other Big Data solutions are out there?
A
The world of Big Data solutions and vendors is divided into two camps. There are the pure play Big Data start-ups who are bringing innovation and buzz to the marketplace. And then there are the established database/data warehouse vendors who are moving into the world of Big Data from a position of strength, both in terms of an installed base and a proven product line.
Apache Hadoop, now a ten year old platform, first used by internet giants, Yahoo, Google and Facebook, led the Big Data revolution. The jury is still out on whether Hadoop will become as indispensable as database management systems [DBMS] – where volume and variety are extreme, although it has proven its utility and cost advantages. Cloudera introduced commercial support for enterprises in 2008, and MapR and Hortonworks piled on in 2009 and 2011 respectively. Among data management incumbents, IBM and EMC spinout – Pivotal each has introduced its own Hadoop distribution. Microsoft and Terradata offer complementary software and first line support for Hortonworks’ platform. Oracle resells and supports Cloudera, while HP, SAP, and others work with multiple Hadoop software providers.
In-memory analysis is gaining steam as Moore’s law brings us faster, more affordable and more memory rich processors. SAP has been the biggest champion of the in-memory approach with its Hana platform, but Microsoft and Oracle are now poised to introduce in-memory options for their flagship databases too.
Niche vendors include Actian, InfinitiDB, HP Vertica, Splunk, Platfora, Infobright and Kognition, all of which have centered their Big Data stories around database management systems focused entirely on analytics rather than transaction processing.
In addition to the Big Data solution providers mentioned above there are analytics vendors, such as Alpine Data Labs, Revolution Analytics and SAS who invariably work in conjunction with platforms provided by third party DBMS vendors and Hadoop distributors, although SAS in particular is blurring this line with growing support for its SAS-managed in-memory data grids and Hadoop environments. Other vendors focused on NoSQL, such as, 10Gen, Amazon, CouchBase and Neo Technologies, and NewSQL vendors are heavily focused on high scale transaction processing rather than analytics.
Looking forward, advances in bandwidth, memory, and processing power also have improved real-time stream processing and stream-analysis capabilities, but this technology has yet to see any wide adoption – this is however definitely a space to watch for the future...
Q
Big Data is often talked about in terms of the software solution,
but presumably there’s more to data analytics than installing
an application and off you go?
A
Other important considerations in establishing whether your Big Data environment is capable of delivering, other than the applications themselves, are:
The underlying storage – is it fit for purpose?
What special requirements do servers need; and
Is your network up to the job?
Q
For example, presumably end users can build their own Big
Data solution, or purchase one, more or less, off-the-shelf?
A
Yes they can, but I would like to think that we have helped them in their decision and steered them towards purchasing the right solutions for them.
Q
In either case, presumably the hardware infrastructure
supporting the Big Data application needs to be ‘fit for purpose’?
A
Big Data analytics may seem to be an IT ‘wonder drug’ that more and more companies believe will bring them success. But as is often the case with new treatments, there’s usually a side effect – in this case, it’s the reality of current storage technology. Traditional storage systems can fall short for both real-time big data applications that need very low latency and data mining applications that can amass huge data warehouses. To keep the Big Data analytics beast fed, storage systems must be fast, scalable and cost effective
Q
Digging deeper, what are the requirements for storage in terms
of supporting a data analytics application?
A
Storage for supporting data analytic applications differs dependent upon whether there are synchronous [real-time] or asynchronous [batch] processing requirements.
In real-time use cases, speed is a critical factor, so the big data storage infrastructure must be designed to minimize latency. Solid state devices are consequently popular options for handling real-time analytics. Flash storage can be implemented in several ways, as a tier on a traditional disk array, as a NAS system, or in the appliance server itself. Server-side flash implementation [PCIe cards] has gained popularity as it provides the lowest latency and offers a quick and easy way to get started. Flash arrays connected by Infiniband, FC or PCIe have significantly greater capacity for those needing scalable architecture and offer performance up to 1 million IOPS and more with minimal latencies as low as a few hundred microseconds. These high end solutions are available from most of the major storage players, such as ExtremIO [EMC], with a bunch of smaller vendors offering a greater variety and longer track record, including the likes of Tintri, Tegile, Violin Memory, Pure and Whiptail.
The storage challenges for asynchronous big data use cases concern capacity, scalability, predictable performance and cost. The latency of tape based systems will generally rule them out and traditional ‘scale-up’ disk storage architectures are generally too expensive. Consequently the type of storage system required to support these applications will often be a scale-out or clustered NAS product. This is file access shared storage, that can scale out to meet capacity and increased compute requirements and uses parallel file systems that are distributed across many storage nodes that can handle billions of files, without the kind of performance degradation that happens with ordinary file systems as they grow. For some time, scale-out or clustered NAS was a distinct product category, with specialist suppliers such as Isilon and BlueArc. But a measure of the increasing importance of such systems is that both of these have now been bought by big storage vendors over the past few years – EMC and HDS respectively. Others include Dell Equalogic, HP StoreAll, and Netapp clustered mode.
These systems combined with Hadoop can enable users to construct their own highly scalable storage systems using low cost hardware providing maximum flexibility. But Hadoop, specifically HDFS, requires three copies of data be created to support the high availability environments it was designed for. That’s fine for data sets in terabytes but when capacity reaches petabytes HDFS can make storage very expensive. Scale-out storage systems suffer too as many use RAID to provide data protection at the volume level and replication at the system level. Object based storage technologies can offer an alternative solution for larger environments that may run into data redundancy problems.
Object based storage systems greatly enhance the benefits of scale-out storage by replacing the hierarchical storage architecture that many use with flexible data objects and a simple index. This enables almost unlimited scaling and further improves performance. Object based storage that incudes erasure coding doesn’t need to use RAID or replication for data protection, resulting in dramatic increases in storage efficiency. There are many object based storage systems on the market, including amongst others EMC Atmos, DataDirect Networks, Netapp StorageGRID, Quantum Lattus and Cleversafe
Q
Are there any special requirements when it comes to the
servers?
A
Big Data analytic workloads are becoming increasingly compute-intensive. The amount of data and processing involved requires these workloads to use clusters of systems running highly parallel code in order to handle the workload at a reasonable cost and timeframe. Enterprise grade servers that are well suited to Big Data analytics workloads have:
Higher compute intensity [high ratio of operations to I/O]
Increased parallel processing capabilities
Increased VM’s per core
Advanced virtualization capabilities
Modular systems design
Elastic scaling capacity
Enhancements for security and compliance and hardware-assisted
encryption
Increased memory and processor utilization
Superior, enterprise grade servers also offer a built-in resiliency that comes from integration and optimization across the full stack of hardware, firmware, hypervisor, operating system, databases and middleware. These systems are often designed, built, tuned and supported together – and are easier to scale and manage.
A completely new computing platform is however on the horizon – the Microserver [ARM server] – which will provide serious innovation and a new generation of server computing to the market. The Microserver is a server that is based on ‘system on a chip’ [SoC] technology – where the CPU, memory and system I/O and such are all on one single chip – not multiple components on a system board. This means SoC servers are small, very energy efficient, reliable, scalable and incredibly well suited to dealing with tasks involving large numbers of users, data and applications. They will use about 1/10th of the power, and less than1/10th of the rack space of a traditional rack mounted server at about half the price of what a current system costs.
Q
What about the network – what characteristics does this require
to help optimize the Big Data environment?
A
Big Data environments change the way data flows in the network, Big Data generates far more east-west or server-to-server than north-south or server-to-client traffic, and for every client interaction, there may be hundreds or thousands of server and data node interactions. Application architecture has evolved correspondingly from a centralized to a distributed model. This is counter to the traditional client/server network architecture built over the past 20 years.
Pulling data from a variety of sources, big data systems run on server clusters distributed over multiple network nodes. These clusters run tasks in a parallel scale-out fashion. Traffic patterns can vary considerably and dramatically from single streams to thousands and to between nodes to handling intermediate storage staging.
Big Data solutions therefore require networks to be deployed on high-performance network equipment to ensure appropriate levels of performance and capacity. In addition big data services should be logically and physically segmented from the rest of the network environment to improve performance.
Q
Are there any other hardware considerations for Big Data?
A
Whilst not necessarily hardware considerations or requirements there are three other vendors which have interesting products relevant to Big Data environments – these cover deduplication, performance and archiving as follows:
1. If you can reduce the amount of data stored, everything else seems to get better, and this is especially true in Big Data environments. Compression and deduplication are two examples of this strategy, but applying these technologies to databases can be more complicated than with file data. One company that has tackled this problem is RainStor. It has developed big data technology that provides this data reduction in a structured environment. It can deduplicate and store large sections of a database, providing up to 40 to 1 reduction in the process. It can then allow users to search this compressed database without rehydrating the data.
2. In the area of data performance, GridIron has developed a block-based cache appliance that leverages flash and DRAM to provide application acceleration up to 10 times in high performance environments. Compared with traditional caching methods, which use file system metadata to make caching decisions, GridIron creates a ‘map’ of billions of data blocks on the back-end storage. This enables it to run predictive analysis on the data space and place blocks into cache before they are needed.
3. In Big Data archiving, the challenges can be managing the file system environment and scaling it to accommodate very large numbers of files. Qunatum’s StorNext is a heterogeneous SAN file system that provides high speed shared access among Linux, Mac, Unix and Windows client servers on a SAN. In addition, a SAN gateway server can provide high performance access to LAN clients. Also part of StorNext is the Storage Manager, a policy based archive engine that moves files among disk storage tiers and, if implemented, a tape archive.
Q
Moving on to S3, how does the company support
organizations looking to implement some kind of Big Data
project?
A
S3 plays a significant role in the implementation of Big Data technologies - we have insight, a deep level of industry expertise and an extensive knowledge of integrating specific vendor’s solution with other sector-specific tools. This enables us to cater to the individual needs of both SME and Enterprise customers and help them understand the benefits and pitfalls of Big Data, the impacts of business intelligence and eliminate the need for multiple expensive solutions. We interact with our customers on a daily basis to ensure we understand their requirements better than the vendors themselves - we can also provide bespoke vertical solutions, training and other services to differentiate our offering from our competitors.
Q
Does S3 work with specific vendors to provide Big Data stacks,
or will the company put together a best of breed solution
regardless of the vendors involved?
A
S3’s approach is to become your trusted advisor of choice – we do this by recommending what we believe to be the best solution fit based on the constraints and parameters given to us by the customer. We work with only best of breed vendors and are fiercely independent in our recommendations – honesty and integrity is the cornerstone of our ethos to creating strong customer trust and hopefully loyalty.
Q
How does S3 support a multi-vendor Big Data environment in terms of support/service?
A
We provide a full multi-vendor support service to our customers - this is provided free of charge based on a first line support model on a 9-5 basis throughout the working week. We will manage the fault call with the vendor and, where we can, use our own in-house skills and expertise to sort the problem – our current statistics show we close 80% of support calls handled through ourselves without reference to the vendor. Customers can enhance the support cover offered to include out of hours times, 24 x 7 cover, second and third line support as and when required.
Q
What success has S3 had to date in terms of providing end
users with Big Data solutions?
A
S3 is EMC’s largest provider of Isilon solutions across EMEA. We have installed and manage over 40 petabytes of Big Data systems across a wide variety of verticals across several continents.
We are currently hosting a Big Data analytics event on Tuesday 13 May 2014 at the Brewers Hall, London – if you would like to know more about this event, data analytics or big data generally then please register to attend the event on www.s3.co.uk/analytics
Q
In conclusion, what are the main points for end users to
consider when evaluating data analytics solutions?
A
For many users the key issues include flexibility, speed, ease of use and cost. It’s not clear whether any single vendor product or service can offer all of these capabilities at the moment and so it is essential that any end user takes appropriate professional advice from an expert in the Big Data field, such as S3.
We are, however, still in the early days of the Big Data analytics movement and with rapidly emerging technologies tomorrow is another day…… and what of the old guard vendors? Sure some of those big-name companies have been followers.
Some even have software distributions and have added important capabilities to existing products. But are their hearts really in it? In some cases you get the impression they are simply window dressing.
There are vested interests – namely license revenues – in sticking with the status quo – so you don’t just see them out there aggressively selling something that just might displace their cash cows. However, as we’ve seen many times before, acquisitions can suddenly change these landscapes very quickly ...