About scale-out storage ARM-ification and $/GB

I came across two interesting news items this week – OpenIO introducing a 96-HDD appliance for its object storage platform and Western Digital launching 12 and 14TB disks!

At first glance if you sum the two, it’s like crazy: just think about a single 96-slot appliance full of 14TB disks, which means 1.3PB in a 4U box, or 13PB in a datacenter rack. Again, it sounds crazy but in reality it’s totally different and it is absolutely brilliant!

Is a 14TB HDD too big?

hd-314TB is a lot (the 12TB is based on PMR technology while the 14TB is based on SMR); and as far as I know, HDD vendors are expecting to release 20 and 25 TB HDDs, and not in the too distant future (but I must also admit that some are skeptical about this roadmap).

No matter what the future is reserving for us, 14TB is a lot for a 3.5″ HDD and it’s quite unmanageable with all traditional storage architectures. RAID makes no sense at all (whether it’s single, dual or even triple parity!), losing a 14TB disk could easily become a nightmare with very long rebuilds, impacting the performance the whole time (and without taking into account that triple parity RAID sucks performance wise).

Distributed RAID mechanisms or, better yet, erasure coding, could be a solution. Blocks are distributed on a very large number of disks and thanks to an N:N rebuilding mechanism the impact is limited… but how many disks can you fit in a single system? (For example, IIRC an HPE 3PAR 28000 can have 1920 disks max, but I’m pretty sure this number could be halved for 3.5″ drives… and I’m not too sure you’d buy such a powerful, expensive, array just for the capacity!).

Go Scale-out then!

Computer rack serversLet’s think scale-out then. Easier and cheaper right? Well… maybe!

Since you can’t think of 12/14TB HDD as a performance device, the lowest $/GB is highly likely what you are aiming for. And how many disks can you fit in a modern storage server? Between 60 and 90 depending on a few design compromises you have to withstand. But hey! We’re talking about something between 840 and 1260TB in 4U, this is absolutely huge!

Huge, in this case, is also a synonym of issues. You solve the problem of the single disk fail, but what happens if one of these servers stops? That could easily become a major nightmare! In fact, this solution is unfeasible for small clusters, and in this case small refers only to the number of nodes and not to capacity. 10 nodes, 1 rack, equals to 12PB of storage. It’s raw storage, but even if we take into account a 40% capacity loss for data protection, we are still in the range of 8+ PB! Losing a node in this scenario means 1/10th of 8PB, 800TB!!! Think about rebuilding data, metadata and hash tables for all of that? What will it take to get your cluster back at full speed? Well, it is true that some storage systems are more clever than others and can rebuild quickly, but it’s still a massive job to do…

A simple workaround exists of course, but it doesn’t make any sense from the $/GB perspective. Putting fewer disks on more nodes is easy but it simply means more CPUs, servers, data center footprint and power… hence a higher $/GB.

Making nonsense work

Even by taking the ability to scale for granted (and I know it’s not always the case), a larger number of nodes introduces a lot of issues and higher costs. More of everything: servers, cables, network equipment, time and so on. In one word, complexity. And, again, not all the scale-out storage systems are easy to manage, with easy-to-use GUIs, etc.

img_20161209_180203I think that OpenIO, with its SLS, has found the right solution. Their box is particularly dense (96 3.5″ HDDs or SSDs!!!) but the box is the less interesting piece of the solutions. In fact, density is just a (positive) consequence.

You can think of SLS as a complete scale-out cluster-in-a-box. Each one of the 96 slots can host a nano-node, which is a very small card with the hard disk in the back and equipped with a dual-core ARM-v8 CPU, RAM, flash memory and two 2.5gb/s Ethernet links. The front-end connector, very similar to what you usually find on a SAS drive, is plugged directly into the SLS chassis just as it is for a normal hard disk in a JBOD.

All the 96×2 Ethernet links are connected internally to two high speed 40gb/S Ethernet switches. The switches have 6 actual ports that can be used for back-to-back expansion of the chassis or for external connectivity.

Failure domain is one disk, which equals to one node. And hardware maintenance can become lazier than before: you can afford to break many disks before going into the datacenter and swapping all of them in a single (monthly?) operation.

Not that this is a new idea. I heard about this idea for the first time years ago and projects like Kinetic are going towards the same direction, not to mention that a Ceph-based cluster was built on ARM not so long ago. This one is just more polished and refinished. A product that makes a lot of sense nonetheless and has a lot of potential. And truth be told, hardware components are designed by Marvell. But, again, it’s still the software that does all the magic!!!

OpenIO’s object storage platform, SDS, has a very lightweight backend, allowing it to run smoothly in a small ARM-based device. Even more so, SDS has some unique characteristics when it comes to load balancing and data placement making it scalable and perfectly suited for this kind of infrastructure. A nice web GUI and a lot of automation in cluster management are the other key components to get it right. They briefed me a couple of weeks ago and they were able to get a new 8TB nano-node up and running in less than a minute, without any intervention if not the disk swap! (and as far as I can see of their internal design, 12 or 14TB won’t change much this time).

They claim 0.008/GB/Month (for a 96x8TB SLS4U96 box with a 3 year support contract) and I think it is incredibly low.

I didn’t get the chance to ask about performance figures, but the first customers have already received their SLS-4U96 and in January I’ll be able to meet one of them. I can’t wait!

Closing the circle

No matter what you think, HDDs or SSDs are going to have similar problems: large capacities are challenging. We are talking about 14TB HDDs now, but vendors are already talking about future 50 and 100TB SSDs, with 32TB size already available! They sound big today, but you’ll be storing more in the future…

Data is growing, and you have to think about putting it somewhere at the lowest cost. The problem is that you can trade durability, resiliency and availability for a lower $/GB. Especially because you want it cheap, but not really really really really cold! We like it cold-ish or, better, warm-ish, and with the increase of use cases for object storage in the enterprise (backup repositories, storage consolidation, collaboration, big data lakes and so on) you absolutely need something which can give the best $/GB, but without too many compromises or risks.