A CRDT Clock

I've built a lot of software in my career, but there have been a few aspects that I've come to dislike about most modern tech stacks. One of the significant ones is the mismatch of the stack. The technologies we use make sense within a single layer of the stack, but when one looks at the stack as a whole it's rather easy to find duplicated effort or even layers that attempt to undo and then redo work that is already being handled at another layer. The result is a lot of software that is fighting against each other within the same application.

It's not necessarily bad to have different layers in tension with each other. A lot of the time it's a fact of life. Too much of the time, however, it is unnecessary. Driven because of the abstractions that dependencies expose to us. Sometimes this tension is caused by breaking what should be a single layer into multiple layers. This happens most often with data storage.

There are many reasons why we split application functionality from data storage, but one that's always seemed the most plausible is that most of the web is written in PHP. With PHP there is no resident memory: each request starts from scratch. Any persistent storage needed to be pushed to another service that was running, like a database process or a cache process. This separation was never ideal, but it was necessary.

With programming languages like Go, Rust, JavaScript, and plenty of others, this separation is unnecessary. Applications built in these languages benefit from having a long running process, and therefore can efficiently store data themselves instead of needing to rely on a separate process for database or cache functionality. Still, though, the database and cache technologies developed over time are robust and straightforward to implement and operate.

The issue is that the majority of database and caches services were designed for a single process world. That is, they were not designed with high levels of concurrency or parallelism in mind. Sure, within a single process (or really shared memory space) these applications can embrace concurrency or parallelism. But once one needs to step beyond what can be done with shared memory, they become more burdensome. Their disadvantages begin to outweigh their advantages.

The lack of years of building these services into our running application services means that there is little in the way of knowledge or experience building database or cache services into our software directly. In recently years, we've doubled down on this architecture, developing databases and cache systems that can do concurrency and parallelism outside a single shared memory space. These have worked, but they are quite complicated, and for the most part they attempt to emulate the previous shared memory systems.

I have grown frustrated with this status quo. The research and knowledge we had in the 90s and early 2000s is much different from what we know now. Even some of the technologies we've developed in the 2010s have started showing their age as the hardware world becomes increasingly concurrent and increasingly parallel.

For a long time now, linearizability has been the desired consistency model for data storage systems. This is despite the fact that effectively no software requires linearizability. Much of the time when software developers believe they need it, it's because they have not thought through their system sufficiently. Or it is because the linearized model is easier for software engineers to grapple with. Either way it has led to massive complexity and scaling issues. Sure, we have Paxos and Raft, but the latter is incredibly difficult to scale, sometimes requiring things like MultiRaft. The former has been notoriously difficult to implement and operate.

But these issues come from the need to be linearizable. If we design our data and our software in a slightly different way, the need for most linearizabillity disappears. Unfortunately, so does any reason to use Raft, because Raft produces a strictly ordered log and the consensus system is not capable of concurrency nor parallelism (that's the entire point of it). Paxos, however, is capable of both concurrency and parallelism.

In fact, when combined with another recent technology, we can begin to create data storage systems to assume concurrency and parallelism first, and we only add in linearizability in the specific areas where we need it and when we need it. That other technology is CRDTs.

While many people's view of CRDTs is stuck in 2014, the research has advanced considerably over the last decade. Instead of state based merging, many modern views of CRDTs lean heavily on the idea of a log of operations. Using one of a variety of logical clocks, we can order the operations in a log causally. Furthermore, I have a theory that one can implement Paxos on top of a CRDT, removing several of the common implementation issues with Paxos.

To that end, I've decided that I want to start building applications where the data storage and application logic are part of the same software layer. No external database server, and each application instance is also a storage instance. By using a CRDT for the bulk of data, what were issues of trying to achieve synchronization now become broader distributed systems problems. We can design our data and application to handle a state where full synchronization has not occurred. Furthermore, we can shift from having the server be the point of coordination to the end-user being the point of coordination.

I've been working over the last year toward a database design that uses CRDTs at its core. I've gone through several iterations and I think I've prototyped and explored enough to know what I want to build at this point. And it all starts with a clock.

I've designed a CRDT clock. It is 10 bytes in size, which means it can easily serialize to a 16 character string using base32 encoding. The clock consists of three components: a prefix, a tick value, and a suffix. The prefix allows for differentiation of clocks, so clock values (called ticks) that are from a development environment that accidentally wind up in production can be easily identified. The tick value is a 56 bit monotonically increasing value. This should provide enough values for the majority of applications. Finally, the suffix is used to provide identifiers for different server instances (or threads within those instances).

Each CRDT operation (which I've called instructions) is assigned a unique Clock Tick. The suffix means that we can have independently incrementing clock instances. When synchronizing clock instances, all clocks set their tick value to whatever the highest tick value is among them. For logical simplicity, any skipped clock ticks are considered to be no-op instructions. This yields a CRDT that can be totally ordered.

One of the benefits of this design is that consensus can be achieved with some clever designing. If consensus is needed, all the instances need to synchronize, then the log can be analyzed to determine which value of the potential choices was inserted into the log first. This is a simple form of consensus that is not fault tolerant, well, at least not fault tolerant quickly. The protocol could allow for the declaration of a process to be faulty, and in that case any clock ticks less than a specific value would be rejected by all other processes. This would allow for consensus even in the presence of failures, assuming the faulty declaration can occur in some stable way.

But consensus is not usually what people want. What they do want is consistency, int he general sense. That is, people generally want to see a consistent view of the world and when that view changes they want it to happen in a logical manner. A combination of Strong Eventual Consistency and a causal token users provide on each request allows for an always forward moving view of the world. The key difference between Eventual Consistency and Strong Eventual Consistency is that the strong variant requires that processes that have the same state produce the same results. By using a causal token, the user can place a floor on how old or stale the state of the world is that they are willing to accept.

A clock is just the first part of a long journey toward a data storage solution that is both embedded within an application and capable of being both concurrent and parallel.

The Shaded Greenhouse

A CRDT Clock