What’s next in data storage – beyond RAID drives….

Data is exploding, growing 10X every five years. In 2008, it was projected that over 800 Exabytes (one million terabytes) of digital content existed in the world and that by 2020 that number is projected to grow over 35,000 Exabytes. What’s fueling the growth? Unstructured digital content. Over 90% of all new data created in the next five years will be unstructured digital content, namely video, audio and image objects. The data storage, archive and backup of large numbers of digital content objects is quickly creating demands for multi-petabyte (one thousand terabytes) storage system.

What is one of the suggested solutions? Information Dispersal. First, a little background.

Current data storage systems based on RAID arrays were not designed to scale to this type of data growth. As a result, the cost of RAID-based storage systems increases as the total amount of data storage increases, while data protection degrades, resulting in permanent digital asset loss. With the capacity of storage devices today, RAID-based systems cannot protect data from loss. Most IT organizations using RAID for big data storage incur additional costs to copy their data two or three times to protect it from inevitable data loss.

RAID schemes are based on parity, and at its root, if more than two drives fail simultaneously, data is not recoverable. The statistical likelihood of multiple drive failures has not been an issue in the past. However, as drive capacities continue to grow beyond the terabyte range and storage systems continue to grow to hundreds of terabytes and petabytes, the likelihood of multiple drive failures is now a reality.

Further, drives aren’t perfect, and typical SATA drives have a published bit rate error (BRE) of 1014, meaning that once every 100,000,000,000,000 bits, there will be a bit that is unrecoverable. Doesn’t seem significant? In today’s big data storage systems, it is.

The likelihood of having one drive fail, and encountering a bit rate error when rebuilding from the remaining RAID set is highly probable in real world scenarios. To put this into perspective, when reading 10 terabytes, the probability of an unreadable bit is likely (56%), and when reading 100 terabytes, it is nearly certain (99.97%).

RAID advocates will tout its data protection capabilities based on models using vendor specified Mean Time To Failure (MTTF) values. In reality, drive failures within a batch of disks are strongly correlated over time, meaning if a disk has failed in a batch, there is a significant probability of a second failure of another disk.

Having just experienced a dual tetrabyte RAID drive failure on our media drive, I completely understand the concerns. Now we are looking for an expanded backup solution. While this is not yet ready for smaller operations, it will be available soon.

Information Dispersal, a new approach for the challenges brought on by big data, is cost-effective at the petabyte and beyond levels for digital content storage. Further, it provides extraordinary data protection, meaning digital assets are preserved essentially forever.

Information Dispersal Basics – Information Dispersal Algorithms (IDAs) separate data into unrecognizable slices of information, which are then distributed—or dispersed—to storage nodes in disparate storage locations. These locations can be situated in the same city, the same region, the same country or around the world.

Each individual slice does not contain enough information to understand the original data.  In the IDA process, the slices are stored with extra bits of data which enables the system to only need a pre-defined subset of the slices from the dispersed storage nodes to fully retrieve all of the data.

Because the data is dispersed across devices, it is resilient against natural disasters or technological failures, like drive failures, system crashes and network failures. Because only a subset of slices is needed to reconstitute the original data, there can be multiple simultaneous failures across a string of hosting devices, servers or networks, and the data can still be accessed in real time.

uppose an organization needs one petabyte of usable storage, and has the requirement for the system to have six nines of reliability – 99.9999%. Here’s how a system built using Information Dispersal would stack up against one built with RAID and Replication.

To meet the reliability target, the Dispersal system would slice the data into 16 slices and store those slices with a few extra bits of information such that only 10 slices would be needed to perfectly recreate the data – meaning, the system could tolerate six simultaneous outages or failures and still provide seamless access to the data. The raw storage would increase by 1.6 (16/10) times the usable storage, totaling 1.6 petabytes.

To meet the reliability target for one petabyte with RAID, the data would be stored using RAID 6, and replicated two, three or even four times possibly using a combination of disk/tape . The raw storage would increase by .33 for the RAID 6 configuration, and then be replicated three times for a raw storage of four times, totaling four petabytes.

Comparing these two solutions side by side for one petabyte of usable storage, Information Dispersal requires 60% the raw storage of RAID 6 and replication on disk, which translates to 60% of the cost.

When comparing the raw storage requirements, it is apparent that both RAID 5 and RAID 6 require more raw storage per terabyte as the amount of data increases. The beauty of Information Dispersal is that as storage increases, the cost per unit of storage doesn’t increase while meeting the same reliability target.

Once vendors start addressing the small to medium storage capacity requirements, this will be a viable alternative to RAID. As of now, this is really only offered to organisations using 500 terabytes or more.

Advertisements

About SCB Enterprises
System Solutions and Integration

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: