Five-Things-to-Know-About-Big-Data-Storage-Absolutdata

Some companies prefer to use HDFS for their Hadoop data warehouses; others opt to use Amazon S3. What’s the difference?

If you’re just starting to invest in Big Data, you need to consider storage options. Fortunately, we’ve come a long way from room-sized computers and storage banks that look like something out of Star Wars (but not in a cool way).

Most people automatically associate HDFS, or Hadoop Distributed File System, with Hadoop data warehouses. HDFS stores information in clusters that are made up of smaller blocks. These blocks are stored in onsite physical storage units, such as internal disk drives.

A general rule of HDFS is that data is always duplicated twice. For example, Cluster A might store a file, but there will be copies of that file in Cluster D and Cluster M. If one component fails, there’s a small risk of losing data. Also, information is stored throughout the HDFS system; a programming model called MapReduce assigns data input and output in such a way that maximum processing efficiency is achieved.

While HDFS is one way to store data, it’s not the only one that will work with a Hadoop warehouse. Amazon Simple Storage Service (hereinafter just “S3”) is another option. Like HDFS, it’s designed to handle massive amounts of information, to be very analytics-friendly, and to scale readily.

On the face of it, HDFS and S3 are quite similar, but there are some notable differences. The biggest one is the first of five facts that we’ll tackle:

#1. HDFS Relies on Physical Storage; S3 Is Cloud-Based.

Both HDFS and S3 are scalable, but they scale in different ways. HDFS is designed around physical storage, so if you need to add more storage down the line, you’ll have to literally add more storage to your systems. This can get complicated (adding more hard drives or even more machines) and expensive.

S3 is cloud-based: all your data is on Amazon’s servers. It scales automatically as you need it; aside from how much you pay per month, you don’t have to worry about changes from massive influxes of data. The AWS S3 page says “S3 stores any amount of content and lets (clients) access it from anywhere”.

#2. Your Risk of Losing Information Is Especially Small with S3.

As we already discussed, data is stored in multiple locations by default in HDFS. But unless you back it up offsite, all data is stored in one physical location. The chances of losing data in HDFS is small on a large cluster (which stores a lot of data) but somewhat bigger when you’re storing a smaller amount of data.

S3 has a loss rate of one in 10,000 objects once every 10,000,000 years. This works out to over 99.9999% durability. This is due to the fact that Amazon not only duplicates data, it keeps the duplicates on servers in different physical locations.

#3. S3 Is Cheaper, But …

HDFS is dependent on physical storage and S3 is cloud-based, but both come with associated monthly costs. For HDFS, this is a combination of the price of hardware and the cost of maintenance. While it might seem cheaper to just buy storage, ComputerWeekly says “While raw drive prices are pennies per GB, the actual cost of enterprise storage is much higher.”

S3 is both simpler and cheaper; your cost is based on how much you store, with prices around $.02 per gigabyte per month.

#4. … HDFS Has Better Performance

Because HDFS is stored on the same machines that do the data processing, performance is simply faster. Local storage has this advantage.

#5. They’re Both Secure

Both HDFS and S3 have built-in security measures, including user authentication and file system permissions. In the case of S3, it’s good to know that data can be encrypted and securely uploaded.

So, here’s the skinny on Big Storage: both HDFS and S3 will work fine as data storage systems. There are differences in cost (S3 is cheaper), scalability (S3 is easier) and speed (HDFS has it here). The deciding factor is whether your data analytics are going to be primarily cloud-based or if you prefer local storage. Many cloud-first data systems use HDFS as a local cache for S3; others use S3 exclusively. What works for you will depend on your organization’s budget, expertise, and setup.

Authored by LK Sharma, Director, Head of Technology, Big Data and Business Intelligence at Absolutdata Analytics and Vrinda Gupta, Analyst at Absolutdata Analytics

Related services and products: Big Data Analytics Services, Data Integration Services