Thinking In Public
I built my first NAS around 2018, and I followed a DIY guide for it. I was looking for something compact, and the setup used a small MiniITX case, with an Atom D processor and could hold up to eight 2.5” drives. So I bought eight 2TB drives and got the whole thing set up. I followed the instructions, did proper testing and burnin of the memory and drives, and got the whole thing running.
It was absolutely wonderful. I had ~10TB of storage space, and I was excited to be able to store things and no longer have to deal with a bunch of hard drives scattered all over the place. I also wouldn’t have to deal with the annoyance of a drive dying requiring a lengthy recovery process from backup. The only problem? The performance of the thing was awful.
I spent years trying to diagnose what the problem was. Eventually, I just installed more drives in my 2010 Mac Pro, put them on the network, and called it a day. I eventually figured out what some of the performance problems were: I had a really awful network switch in the path that wasn’t passing traffic through at a reasonable speed. I think the thing was either worn out, buggy, or something else was wrong with it. A couple years ago I looked into this NAS some more. I still had it, and even though I had built a much larger NAS that was much faster, I still wondered what the problems were. Was it FreeNAS? Was it FreeBSD? Was it something else? Eventually I figured out that the drives I had purchased used this technology called Shingled Magnetic Recording technology. This is a newer way of storing data on a hard drive. The model we’re all used to is called Conventional Magnetic Recording technology.
So what’s the problem? For the most part it reduces down to this: SMR drives are absolutely terrible for RAID. In fact, from some of the research I did, people felt these drives were useless for all purposes. I certainly got this impression. I decided that it was best to just leave this NAS sitting there, since I didn’t really need it for anything.
A couple years ago, I decided that this little server might actually be more useful than I had thought. I decided that I would replace all of the SMR drives with some NAS rated SSDs. But sometimes life just sweeps you up, and a number of years passed and I still hadn’t actually installed these drives. I’ve been doing some recent upgrades to my servers, mostly to prepare for doing more local AI work, and I decided that I should finally put this server back in service. The Atom D processor is quite old, but it still works. The thing has 32GB of RAM and once I install the 8 drives, I can actually get some great throughput on it. (For various reasons I’m planning on using mirrored vdevs instead of RAIDZ, since I want to optimize for IOPS over redundancy.)
This is all great, but I had never figured out what to do with these SMR drives I had laying around. I actually asked Claude what I should do, and the answer I got was less than ideal. It told me to throw them in a drawer and forget about them. Basically, it told me to junk them.
This didn’t sit well with me. I had the space to install 6 of them in my servers, and that space wouldn’t be used for anything else. Why throw away 12TB of storage space just because the technology has been deemed “bad”. But also, what of that impression that I (and many others) have about this technology being bad? Dropbox has been able to deploy it with great success. And clearly the drive manufacturers believe in this technology. So why has it been so difficult to adopt?
As I was thinking through this, it felt like a perfect opportunity to employ the Verbund principle. This is essentially an ideology that says the by-products of one process should become the inputs to another, minimizing the amount of waste created by the combined processes. These drives were currently a by-product, but might I be able to use them as input into another process?
I haven’t done a ton of research, but the dislike for SMR drives comes from a specific set of circumstances. To understand we first need to establish that there are two kinds of SMR drives: host managed and device managed. Wait, let’s backup a bit, first I should explain how these drives actually work. If you already know how they work, just skip this next section.
With a conventional hard drive there are tracks on the surface. Each track is made up of sectors, which are the individually readable and writeable segments of a disk. These are usually in 512 byte or 4,096 byte sizes. These tracks and sectors do not overlap, so as long as you’re writing an entire sector, there is a 1:1 ratio of writes, meaning writing 1 sector writes 1 sector. You might be thinking, “Of course it’s a 1:1 ratio! What else would it be?”. Well, if you don’t properly align the filesystem to the sectors on disk, you can wind up with the filesystem’s idea of a sector actually crossing across two sectors. In that case, writing 1 sector would actually be writing 2 sectors. This is called Write Amplification, and it doesn’t just happen with misaligned HDDs, it also happens with SSDs. In fact, it can happen with properly aligned filesystems if writes are less than sector sizes. Anyway, the point is that with conventional magnetic recording, the drive can write to any segment at any time without needing to write other segments.
This is not the case with Shingled Magnetic Recording technology, where we instead write overlapping tracks. So if you want to write to a sector, you actually need to write to several sectors to rewrite the data that was overwritten. Why do this? Well, it turns out that the size of a write head is pretty fixed, but we can actually make read heads much smaller. By shingling the tracks we can pack much more space onto a platter and increase the density of the drives. This is great, you can get up to 25% more space just by changing the size of the read head and overlapping the tracks. What’s not great is that now you have a write amplification problem.
This is solved by breaking the drive into zones, where each zone has overlapping tracks but there are buffer tracks between zones that ensure that the writes from one zone don’t overlap with writes from another zone. Since writing to the beginning of a zone would require writing all subsequent data in that zone, each of these zones is append only.
In a way, this makes SMR drives somewhat like solid state drives. With SSDs you can write at the page level, but you must erase at the block level. With SMR drives, you can write at the zone level, but modifications require a read-modify-write loop that relocates the data into a new zone.
Alright, that should be enough information about SMR drives to understand the next section.
With a host managed drive, all of the zones of the drive are exposed directly to the operating system. You need a filesystem that is specifically designed to work with these drives, and they won’t appear as block devices. This means that they are not drop in replacements for CMR drives. If you want to use them, you need to write software specifically for them.
With a device managed drive, the zones and the management of them is all handled in firmware on the drive. The drive exposes itself as a regular HDD and the operating system will show it as a block device. This is a lot like how SSDs work, where the firmware implements a translation layer, allowing it to relocate and move pages as it sees fit. With device managed SMR drives, there is a similar translation layer that will handle caching writes, moving data, and performing garbage collection.
Much like how SSDs don’t work well with lots of small modifications and deletes, SMR drives don’t work well with lots of random writes. Unfortunately, most modern filesystems have their metadata structured as a lot of random writes. For more advanced file systems, like ZFS or anything using RAID, the number of small writes explode. And it’s not that SMR drives can’t handle random writes, they can. But they use a cache, which is essentially some CMR tracks on the disk, to handle these random writes. And once that cache fills, the drive is forced to perform synchronous garbage collection, moving the data out of the cache and onto the SMR tracks. This means that a user will see sustained high performance until the cache fills and then performance will crater as the foreground garbage collection is happening.
If your filesystem doesn’t know this is happening, it can cause all sorts of problems. Not just with slow throughput, but the drive might be seen as timing out and force things like resilvers to happen, which could cause even more problems.
So it’s not that these drives are bad. They just don’t work well with filesystems that think they are CMR drives. The bigger problem was that drive manufacturers started selling SMR drives without telling people that they were SMR drives, which led to these widespread problems and a general dislike for the technology.
So these drives aren’t useless, they just need to be used in a way that works with the translation layer instead of pretending it doesn’t exist. I would prefer a world where this translation layer was something that happened in the OS kernel, so that the zones of all models of these drives could be exposed, however this is not the world we live in. It’s also just an easier argument to say “these drives are drop in replacements for what you have". And for most people, it’s likely the case that these drives work fine.
But this all gives me an opportunity. I’ve always wanted to build a filesystem, but I’ve never really been able to convince myself of a good reason to do so. The production ready filesystems work great for nearly all use cases, and I really enjoying having and using ZFS. But these drives presented a novel opportunity. Could I design an analysis suite that extracts the necessary information from these drives to understand performance characteristics, and then design a filesystem that could run on these drives that ensures limits and the like aren’t exceeded?
Something as simple as pairing the drives with an SSD for metadata could result in rather good sustained performance over time. Bundling them together in a RAID fashion but designing the RAID software to be log structured or append only means that each drive could be managed properly. See, each of these drives will begin doing garbage collection after some idle point. If you have many of them installed in a system and they act as a single drive, the write load can be shared across them such that enough time is given to each drive to clear out its cache so it can take more sustained writes. At least in theory. If you then layer the parity information on top of this, it could be possible to design a RAID system that actually works with SMR drives instead of against it. Down the road maybe a new vdev type for ZFS could be created specifically for host managed and device managed SMR drives. But I would start with building a suite to analyze the drives, then build a filesystem that works with them and go from there. The possibilities here are really exciting. This filesystem could work with both host managed and device managed drives. It gives me an opportunity to build something I’ve always wanted to build. This is the verbund ideology in action: what was junk is now the input I needed for a process I’ve wanted to run for a while. This is using by-products properly. I knew I needed to push back against Claude, and it turns out that by doing so I’ve found my way to an excellent conclusion. Now I just have to plan out the project.
Verbund is a powerful concept. It allows us to gain much more efficiency out of the processes in our lives. I’m constantly looking for opportunities to reuse old things. Sometimes it works and sometimes it doesn’t, but having it as a default means I can stumble across cool projects like this one.
If I’m successful with my endeavor and actually build this filesystem, it’s possible that in the future SMR drives will no longer be derided as awful and useless. It might wind up being the case that using them in a ZFS cluster becomes ideal. In theory, they’ll be cheaper per TB compared to CMR drives, because the same platters can store more data.
Either way, I’m excited to see where this goes. And I'm definitely not throwing away my SMR drives.