Big Data is about huge volumes of highly variable data that requires instant accessibility at high-speed. There are many categories of Big Data: many different data types in many different markets, each with their own applications. Think of high definition videos and images in the media and entertainment industry, genomics data in life sciences, seismic data in the oil and gas industries, data collected to measure or analyze social behavior. There are hundreds, if not thousands of sensor-based applications to track consumption patterns, production efficiency or trends that bring value to organizations that are able to analyze them.
Sensor and analytics data instigated a wave of innovation for what is generally referred to as Big Data Analytics. New technologies and methodologies were developed to ingest, store, process, distribute and archive massive sets of data, generally referred to as semi-structured data and mostly applied in the HPC and Supercomputing spaces. Storage solutions for Big Data Analytics are very much optimized for high Inputs Outputs Per Second (IOPS), with lots and lots of distributed processing power.
But there is a lot more to Big Data than analyzing sensor data or semi structured research data. Most of the data that will be generated over the next decade will be unstructured data like large files: such as office documents, movies, music streams and high-resolution images. Similar to how semi-structured data challenged the traditional relational database and led to new solutions such as Hadoop and MapReduce, those large volumes of unstructured data challenge the scalability of traditional file storage. Most file systems were not designed to hold billions of files and, eventually, file systems became obsolete as applications proved a lot more efficient at organizing, managing, archiving and searching unstructured data.
The new storage paradigm that will help us store these massive amounts of unstructured data is Object Storage. Object Storage systems are uniformly scalable pools of storage that are accessible through a Representation State Transfer (REST) interface – for simplicity purposes REST is a type of API using HTTP. Files – objects – are ‘dumped’ into a large uniform storage pool and an identifier is kept to locate the object when it is needed. Applications that are designed to run on top of object storage will use these identifiers through the REST protocol to locate the object when it needs to be retrieved. Objects are stored with metadata (information about the object), which enables very rich search capabilities and all sorts of analytics for unstructured data.
Object storage is not new, however. The paradigm has been around for over a decade and several platforms have been launched, tested and abandoned. But the success of Amazon’s S3 has demonstrated the benefits of architectures where applications access data directly through a REST interface. Amazon’s S3 now stores well beyond a trillion objects, but their technology mainly consists of in-house development. So how can other organizations build object storage infrastructures as cost-efficient and durable as Amazon’s S3?
The big challenge is of course to keep the storage overhead as low as possible without compromising durability. Lower overhead results in less hardware, less power usage etc. An increasingly popular technology to protect data with low overhead is erasure coding. This new data protection scheme drastically lowers the storage TCO while increasing data durability. With a proper multi-datacentre architecture, such as featured in DDN’s WOS, any organization can now build exabyte-scale object storage platforms to be used in-house or … to compete with S3.
A lot of effort has been put in optimizing object storage platforms and reducing the cost of object storage to a few pennies per GB. The next step is for ISV’s to embrace the technology and support object storage directly. Once enterprise applications support REST and use the benefits of object storage, organizations will be better able to store, analyze and archive their big unstructured data.