Thinking In Public
Distributed systems is a misnomer. Like many things in software engineer, it was given a name before we actually understood what it was and therefore that name embodies an overly restricted view of the concept. Even for those who believe they have a strong understanding of distributed systems, their expansive knowledge often encompasses what is a relatively small corner of the space.
The problem of scope doesn’t apply to just distributed systems, it also applies to how we’ve named several concepts within distributed systems. Take Eventual Consistency as an example. When most hear that term, they think of a system that has a set of traits that sets is apart from system with Strong Consistency. These two models of consistency are often defined in terms of the CAP theorem, which states that in the presence of a network partition a system can either be available (and accept writes) or consistent (and maintain safety).1
This view of consistency, whether eventual or strong, is that it conflates different concepts together. This is similar to the conflation between static and strong typing. Much like how there are two different spectrums of typing, those being static to dynamic and strong to weak, there are two different spectrums at play when talking about Eventual and Strong consistency. Eventual consistency is about the convergence of state and what happens in the presence of latency, while Strong consistency is about the observation of histories. Put another way, Eventual Consistency is about the state of a system while Strong Consistency is about the observation of that state. More simply: Eventual Consistency is about writes and Strong Consistency is about reads. Just as it would be wrong to label a dynamically typed system as “not typed” it is wrong to label an Eventually Consistency system as “not consistent”. In actuality, Eventual Consistency is more about conflicts, whether they are possible, and how much they can be observed.
This level of nuance extends to the entirety of Distributed Systems. A series of introductory books on distributed systems frames it as a property of a specific kind of computing. Much like how parallel computing is about efficiency, and real time computing is about on time computing, distributed systems (or distributed computing) is about uncertainty.
One way to deal with uncertainty is to remove it. This is the corner of distributed systems that most of the research is focused on, and what most software engineers seek to achieve. The difference between Eventual Consistency and Strong Consistency therefore reduces down to whether one wants to design software that copes with uncertainty, therefore reducing system complexity, or design software that eliminates uncertainty, therefore increasing system complexity.
The guarantees that Strong System provides are statements about what possible state can be observed. This ensures that certain assumptions can be made. For example, that once a value is observed, previous versions of that value will not be observed. The problem is that by partaking in this elimination of uncertainty we also eliminate desirable qualities like concurrency and parallelism.
But it’s not necessarily required that eliminating uncertainty requires eliminating concurrency and parallelism. While it is simpler to understand a system when all three of these are eliminated, one is effectively created a virtual system, since distributed systems tend to have concurrency and parallelism (even if that concurrency is time based instead of space based). This makes it tough to implement well and even tougher to implement correctly.
Raft is a good example of an algorithm that does this. Within Raft itself there is no concurrency and there is no parallelism, that’s the entire point. From the algorithm’s perspective, only one thing happens at a time. This eliminates many of the benefits of a distributed system, but it does maintain the quality of redundancy, which is often what people want.
Paxos, on the other hand, is an algorithm (well, family of algorithms) that is closer to Eventual Consistency than Raft. Even the base Paxos algorithm allows for a high degree of concurrency and parallelism: there is no strict ordering requirement. While it has been challenging to implement Paxos, many of those aren’t the fault of Paxos but instead of the surrounding algorithms that are required to implement a full system. Paxos is a much smaller scope consensus system when compared with Raft, but that also makes it far more flexible. For example, most of the operations within a software system do not conflict, and of those that do a small change in data modeling can eliminate most conflicts. Therefore, running everything through a heavy consensus system like Raft doesn’t actually provide benefit. Much of the desired outcome, even in terms of observable state, can be handled with other distributed systems concepts like values or identifiers that encode causal history.
With Paxos, one can blend both of these worlds together. A more advanced Paxos algorithm only requires the more expensive consensus machinery when there is an actual conflict. As long as there are no conflicts between values, Paxos operates more like a synchronization protocol than a consensus protocol. When a conflict is proposed, however, the system automatically switches to doing full consensus. This blending means it is far easier for build a system that uses Paxos and maintains efficiency through concurrency and parallelism than it is to build a system that uses Raft and maintains efficiency (which often requires something like MultiRaft).
Distributed Systems, Eventual Consistency, and Strong Consistency are all poorly named concepts. Assumptions about what those terms mean has led us to design algorithms and systems that don’t actually serve our needs or overly constrain us. When not framed through the perspective of uncertainty and designing software that embraces uncertainty, Distributed Systems wind up as overly complex, bug prone systems that unnecessarily remove benefits. While it is unlikely that we can fix the name, we should advance our understanding of the concepts so we can build better software.
Even this view is partially wrong, because consistency is more about reads than writes. The restriction is not on whether one can write, but on whether one can observe writes that have occurred in subsequent reads. ↩︎