Distributed Systems and the End of the API

This is a written (expanded) narrative of the content from a talk I first gave at PhillyETE on April 23rd, 2014. It mostly follows the flow of the presentation given then, but with a level of detail that I hope enhances clarity of the ideas therein. The talk’s original slides are available, though the key illustrations and bullet points contained therein are replicated (and somewhat enhanced) below. When audio/video of the talk is published, I will update this page to link to it. Discussion about this piece has taken place on Hacker News, reddit, and Lobsters.

I have two claims of which I would like to convince you today:

  1. The notion of the networked application API is an unsalvageable anachronism that fails to account for the necessary complexities of distributed systems.
  2. There exist a set of formalisms that do account for these complexities, but which are effectively absent from modern programming practice.

The outline of this piece is as follows:

Let’s start with some definitions, just because “distributed systems” and “API” are broad terms that have many connotations to different people. I’m talking about a particular (large) subset of the many different aspects of these things.

Distributed Systems

There is a tongue-in-cheek saying that captures this generalization and defines what a distributed system is:

A distributed system is one where a machine I’ve never heard of can cause my program to fail.
Leslie Lamport

This is something of a meme among those that study and work within fields related to distributed systems. (We’ll return soon to Mr. Lamport and his contributions beyond this catchy phrase, shown slightly tweaked above.) There are many formal definitions of “distributed system” all over the internet, but my personal riff is:

A distributed system is one that is comprised of multiple processes that must communicate to perform work.

The bottom line is that, given the ambient nature of the networks that surround us and the dependence we have upon those networks for so many of the tasks our programs, clients, customers, and users take for granted, nearly every system we build is a distributed system. Unless your software runs in a totally isolated environment — e.g. on an air-gapped computer — you are building a distributed system.

This is problematic in that distributed systems exhibit a set of uniformly unintuitive behaviours related to causality, consistency, and availability. These behaviours are largely emergent, and spring from the equally unintuitive semantics of the non-locality of the parts of those distributed systems and the networks that connect them. None of these behaviours or semantics are related at all to those which we — as programmers and engineers — are typically trained and acclimated to expect and reason about.

Note that even if you are doing something small, or “normal”, or common, you are not immune to these challenges. Even the most vanilla web application is definitionally a distributed system. By sending data from one computer (e.g. a server) to another (e.g. your customer’s web browser), you end up having to contemplate and address all sorts of problems that simply don’t exist when you run a program in a single process on a single machine that doesn’t touch the network: consistency, coping with non-availability (i.e. latency, services being down, timing-related bugs caused by long-running computations or things as simple as garbage collection), dealing with repeated messages from clients with spotty connections, and more. If you’ve not been bitten by these things, that is evidence of luck (or, of your not having noticed the problems yet!), not of your being immune, or otherwise that what you’ve built is somehow not a distributed system and so isn’t subject to these challenges.

Application Programming Interface (API)

Again, there are lots of definitions of the “API” term; this is mine:

An API is the set of names we interact with in our programming languages and libraries.

This concept — naming the data, types, objects, methods, modules, namespaces, and so on in our programming practice — has been around almost as long as people have been programming computers. More concretely, its origins lie in the context of imperative programming languages and libraries. Common examples include the standard C library, Win32, POSIX, and all the names of things in your favourite Javadoc, Rubydoc, library or function man pages, and so on.

When people started hooking up computers over networks, it was natural to want to carry along this notion of using language as a way of naming things we interact with programmatically.  Of course, assigning names is not an issue; doing so is essential to being able to talk about them at all. The problem is that APIs are fundamentally only nominal descriptions. We assign names to the data and operations and objects our programs manipulate, but there is nothing in such a shorthand that talks about the semantics or limits or capabilities of those things. To abuse a Perlisism, the name of a thing is a perfect vehicle for hiding information.

Thus, when we are doing some network programming:

client.send(data);

There is nothing about the API we’re touching that informs us of the semantics of the operation we’re attempting to perform. What are the bounds on how long it will take? What are its preconditions? What it should do (or what we should do additionally) in the case of a timeout, or failure, or partial success? The undefinededness is unlimited. While the lack of clearly identifiable, defined semantics is problematic even in the case of non-networked, non-distributed, single-process context programming, it is near-fatal when you are working on a networked, distributed system: you must take on the job of identifying and understanding those semantics yourself, and situating your use of APIs such that they account for the failure modes implied by the network and other aspects of the distributed system you’re building.

The API Problem

So, using names as the sole source of conceptual leverage while programming is problematic, and particularly so when it comes to network-related APIs. This practice is likely so deeply woven into how we have cast the instruction of computers as a language problem that any change would be a monumental conceptual shift, far beyond what I want to address here. Rather, I’d like to take a look at other aspects of APIs in the distributed systems context that I believe are fatal.

Here is a rough, temporal progression of the major network API technologies that have seen wide use:

RPC → #{DCOM, CORBA} → RMI → XML-RPC → SOAP → REST → #{‘REST’, Thrift}

As far as addressing the problems of distributed systems go, these are all fundamentally equivalent. It’s possible to quibble about this on the edges, insofar as e.g. RPC mechanisms obviously have particular implementation details that are not shared by the large umbrella of HTTP APIs (which I denote above as “‘REST'”), but I claim that they all share and provide the same set of fundamental semantics:

  • a request/response lifecycle
  • always synchronous
  • presume a point-to-point communication topology (i.e. two party, client/server communication)
  • operations provided by these mechanisms are nearly always imperative, implying mutable data models and side-effecting operations
  • few constraints placed on acceptable data models or representations are imposed

These characteristics are shared by the programming language heritage that originally defined the notion of “an API”: one caller invoking operations synchronously on one callee, providing a request (arguments) in exchange for a response (a return value, sometimes), where the data involved can effectively be anything — very often mutable structures that are modified in-place — all while causing some side effect.  Unfortunately, these characteristics are a large part of why APIs are a fatally bad mismatch for the job of supporting the communications between actors in a distributed system, which are frequently not best characterized as request/response, more commonly asynchronous than not, are often not ideally point-to-point, and which suffer as much or more from mutable data models and side-effecting operations as “regular”, single-process, non-networked programs.

(At this point, you might object, as some of the API technologies I called out above do provide limited support for e.g. asynchrony.  For example, an HTTP API could reasonably respond to a request that will be processed asynchronously with a 202 Accepted status, plus a URL of a “status” resource that would eventually redirect to the final result of processing the original request. However, allowances like this amount to nothing more than idioms and “best practices”, not substantive solutions to the limitations described above. Implying otherwise would be analogous to claiming that a less capable programming language A might be argued to be equivalent to a more capable language B, simply by dint of both being Turing complete.)

Tweet text: "People hear 'RPC', and giggle, smugly shaking their head while pounding out REST integrations."

I pick on REST because it is ostensibly the pinnacle of modern API technologies, but you can replace its mention in this tweet with any other request/response, synchronous, point-to-point API mechanism that is fundamentally imperative and typically side-effecting.

APIs: Sisyphean programmer convenience

While APIs provide a particularly narrow portal to the outside world, they have done wonders for programmer convenience. Here I am talking about one particular convenience that programmers implicitly value very highly when working with network APIs, and that’s the maintenance of an isomorphism between function or method calls as found in typical imperative programming languages, and the various manifestations of network APIs, regardless of the particular technology in use.

For example, here are three common ways one might see usage of a single “create” API call exemplified:

api.create(arg1, arg2);

POST http://site.org/resource/create
[arg1, arg2]

PUT http://site.org/resource
[arg1, arg2]

The first could be either an in-process method call that doesn’t touch the network, or an RPC call that uses programming language-native stubs or a client library and which implicitly reaches across a network. The second is typical of a strictly imperative HTTP API construction. The third is the “proper” REST corollary that uses URIs, HTTP verbs, and resource representations in particular ways to achieve the specific semantics defined by REST.

In each of these cases, nominal concerns are paramount, but network and other operational semantics, failure modes, notions of causality and consistency are entirely unaccounted-for by the different API mechanisms. These unstated costs are implicitly retained by the programmer that happily uses these mechanisms, and must be balanced by either manual accommodations, error handling, and deep considerations of application-specific causality and consistency invariants…or user-visible and business-tangible failures.

One particular irony is that, while more modern API technologies have been built with full knowledge of the accepted failings of RPC, many providers of e.g. HTTP APIs provide and specifically encourage programmers to use “client libraries” for one’s preferred language, effectively restoring the classic RPC programming experience, where it’s difficult to impossible know which calls will result in strictly local computation, and which will incur one or many network operations and all of the complications that that entails.

stripe-http-api

A snapshot of the structural isomorphism in the Stripe “create plan” HTTP API and the corresponding method in its Ruby client library. BTW/FWIW, I love Stripe; it’s actually the high quality of their online documentation that enabled such a convenient juxtaposition of the current “state of the art”.

It is as if we are supposed to carry on calling what look like regular functions and methods, and that everything will just work, like magic!

This is convenient, and feels very familiar, but does not address the problem space that we are working in.

The API: an anachronism

“The API” as the fundamental point of integration between parts of a distributed system is an anachronism, a hold-over from other, simpler programming contexts that predate “distributed systems” as a discrete concept:

  • APIs necessitate an intense coupling between actors in many ways, principally:
    • By admitting only two-party client/server architectures, despite the actual myriad of application and network topologies that exist; anything else needs to be constructed out of this point-to-point primitive, or pushed into silos that do provide for more complex topologies (e.g. queues, system busses, etc).
    • By allowing arbitrary data representations to be used: in order to talk to your API, I need to reconcile how my client represents data with what your server expects and produces. It is impossible to later talk to another equivalent API without repeating this process.1
  • Common computational tasks necessitate asynchrony.  Though some patterns and common workarounds exist, APIs are fundamentally always synchronous.
  • APIs, as a class of technology, disavow the fundamental complexities of distributed systems:
    • Coping with network failure modes (latency, disconnection, offline contexts)
    • Making consistency choices relevant to our applications, and being aware of the choices being made for us by our underlying technology
    • Being aware of the impact that our consistency choices have on availability
    • How our data models and data representations influence and sometimes determine what is and is not possible in terms of concurrent activity by different actors

To be precise, there are seven basic network (and therefore, system) topologies, and an infinite number of permutations combining them:

500px-NetworkTopologies.svgWhen you consider the systems you build, you’ll recognize these topologies within whatever diagrams you draw. Meanwhile, APIs provide no way to naturally talk about anything other than two-party client/server communication. Just as any limited programming language can lean on Turing completeness to claim capability, we can lean on APIs and say that we can build any topology we like with them; while factually correct, this is an flaw, not a feature.

Acknowledge the network or fail

What distinguishes a network API from a “regular”, in-process, single-node API is…the network. That’s a tautology, but one worth making given the apparent primal desire of programmers to ignore the network in their modern practice (viz. RPC and client libraries for network APIs that obfuscate their true nature).

When you try to get computers to work across a network, the network is an integral part of the resulting larger system. If you do not account for its inherent nature, your larger system will fail along with that network, in ways large and small. The first step in addressing this is to understand the ways in which networks can fail, and how those failures present to your programs.

I’m by no means an expert in network failure modes (keeping up with Kyle Kingsbury’s work with Jepsen is a good first step, if you are coming from a programming-with-databases background and are looking for something accessible), but the basic network failure modes include:

  • Partitions
  • All the vibrant colours of latency
    • Variable latency, which can be caused by everything from overloaded switches to garbage collection stopping the world on a remote machine
    • Complete and permanent loss of interconnect
    • Offline operation
  • Reordered messages
  • Repeated messages

Many of these types of failures have been most well-discussed in the context of distributed databases, but they apply just as forcefully to the entirety of your systems. A partition among your application servers can cause clustered, in-memory sessions to diverge, leading to invariants being violated both server-side and in users’ web browsers as they are fed data from different divergent copies of their session. Increased latency (whatever the cause) can trip timeouts between different parts of a system, leading to spiking reconnect attempts and thus cascading, catastrophic latency. When a machine becomes totally unreachable, what parts of your system can carry on, and can the data it was serving be recovered elsewhere? What are your users’ expectations when they have no connectivity at all? When a client resends a message because it thought your service didn’t receive its first attempt due to a timeout (except it did), what do you do?

In short:

  • Your network’s problems are your system’s problems. The corollary of this is:
  • My network’s problems are your system’s problems.

That is, the networks used to reach the people and devices and vendors at the edges of your system are almost never owned and operated by you, yet your system is subject to their failures as well.

Different technologies and different data models will provide greater or lesser inherent protection from the capriciousness of networks. Of course, network API mechanisms provide exactly none, and the data representations that are commonly shipped over those mechanisms (e.g. JSON, XML, YAML, plain text, and so on) are equally of no help.

Consistency decisions affect everything

Consistency and consensus are huge topics, and so my treatment of them here will be wanting. Much like networks and their failure modes, you cannot design or build a robust distributed system without being aware of them. However, unlike networks — the problems of which are effectively a force of nature — you have within your power the ability to choose what your system’s consistency guarantees will be, presumably based on the needs of your customers, users, and organization.choose-wiselyThere are many types of consistency, and there’s no way that I could enumerate all or even some of them. Further, there is not a spectrum of consistency; the set of available options forms something more like a multidimensional space, as there are a lot of considerations that go into choosing a set of consistency guarantees. But, the most common cases as far as I can tell include:

  • Strict linearizability, where all actors in a system synchronously acknowledge each write to any shared data, yielding a global total order of all operations. Systems with this characteristic act as if their entire state is held within a single atomic reference, as found in many programming languages with such a concurrency primitive. This is incredibly expensive in terms of computational and communication overhead, but perhaps corresponds most closely with our intuitions about what happens when, and where.
  • Causal consistency, where logically temporal relationships between dependent changes to shared data are tracked, yielding a partial order of all operations. Causal consistency requires much less consensus overhead for any given request to proceed (compared to strict linearizability), but ensures many desirable properties within a system with such a guarantee. One of these is the ability to read your own writes; without some mechanism to enforce causal consistency, it is possible for e.g. a web client to write a value through an HTTP API, and then see the previous value when it performs a read or query some time afterwards.
  • Eventual consistency, where concurrent writes converge (perhaps with conflicts) such that different readers will all eventually see the same result at some point in the future when all prior writes have been applied. This implies no order at all to operations within a system; in effect, the only guarantee of eventual consistency is that of liveness, a term used in distributed systems literature to imply that all actors within a system will continue to propagate writes until all actors have seen every write.

The choices you make with regard to consistency have a compensating impact on the availability characteristics of your system. This relationship is known more formally as the CAP theorem (as introduced, and then proven), but the basic dynamic of the tradeoff should be obvious given a couple of moments thinking about the relationship between consistency and availability:

  • Insofar as every actor in your system has to synchronously acknowledge each write made by others, then those actors (a.k.a. servers, or services) cannot accept any new work until that acknowledgement process completes. During this time, your system will appear to be down (unavailable) to external parties.
  • If no consensus is required among different actors (as in the various eventual consistency models), then every actor can always accept new work. Barring resource exhaustion or exigent factors (such as a network lapse), your system will always be up (available) and responsive to external parties.

Note that different parts of a system may have different consensus requirements (and thus its consistency guarantees), and those requirements and guarantees may change over time (potentially rapidly and regularly in order to accommodate user expectations and coherence demands of different kinds of data).

Just as with networks, different technologies and data models provide more or less leverage and flexibility in accommodating or enabling different degrees of consensus, consistency, and thus availability. Unsurprisingly, network API mechanisms again provide exactly nothing: they presume nothing, except that an open socket is available for use.  It is up to you to consider these issues properly, and either prevent or resolve any inconsistencies within your system.

Likewise, while plain text does offer some meagre ways to address concurrent inconsistent changes (e.g. two actors modifying the same text document concurrently can often be resolved sanely via a diff mechanism), the “richer” data representations that are favoured by most API services and clients (again, JSON, XML, etc) are fundamentally opaque and in general make reconciling independent changes impossible in a consistent way without special, often domain-specific intervention. Further, typical implementations of data structures in our programming languages provide no ways to represent or reconcile concurrent changes at all. Together, this means that concurrent actors moving state around or representing operations using these data structures and representations have no generally applicable way of resolving conflicting concurrent changes. This forces programmers to regularly re-implement such resolution mechanisms; or, more commonly, rely completely upon centralized backend databases to allow concurrency and enforce consistency, semantics that become less reliable further away from those centralized authorities.

What do we want?

Okay, I’ve just rambled on for a long time about how distributed systems have some challenges above and beyond “regular” programming, and that APIs as we know them today do not meet those challenges and may even exacerbate them by nudging us towards glossing over their existence and scope. It’s a pretty bleak picture, and you may walk away at this point and consider me a crank.

I’m at peace with that, but I defy anyone to claim that programming networked services and applications is easy. Or that the interactions between e.g. disparate databases, application servers, caching services, load balancers, client browsers, a couple of message queues, four vendor services APIs, and that nutty Hadoop job are understandable. Or that errors stemming from the incidental complexity of the technologies involved in building such a system are rare. Or that when something does go wrong in that easy-to-build, understandable, rarely-failing distributed system, it’s easy to diagnose the problem.

None of these things are true. Of course, I’m not saying that building distributed systems will ever be as easy or as simple as other types of programming and engineering. Further, I’m not even saying that APIs are themselves the root of all of this difficulty and complexity (though they play their part); APIs just happen to be a convenient, familiar, and obvious pain point that exemplifies the primitive nature of the raw materials we’ve imported from other programming contexts to build distributed systems.

Building and reasoning about distributed systems should be easier and simpler than it is today.2 I’d like to suggest that, in order to achieve this, we need different tools than the ones we have at our disposal now.

Let’s step back. What do we want? In any distributed system we build, we want two things:

  • Communication — the ability to share data among the various actors participating in our system
  • Computation — the ability to consume and transform that data, producing new data as a result that is perhaps itself communicated

Everything else about building distributed systems is incidental. We need building blocks that:

  • allow us to control and apply these capabilities however our particular needs demand
  • actively prevent us from committing obvious errors in reasoning about the irreducible complexity of distributed systems
  • even better, are architected such that committing such errors are fundamentally impossible

That’s a tall order, and not one that can be fulfilled completely; hell, we haven’t even fulfilled one of them — computation — in the much simpler context of non-distributed, non-networked programming. But, when looking for better tools to work with, we need to keep looking towards these two primitive capabilities, lest we steer elsewhere.

We’ve been here before

The history of programming languages has seen many analogous transitions.

Decades ago, writing software required working with extremely low-level programming languages, such as assembly and C. Relative to the higher-level languages developed since, those languages are very difficult to use effectively, have been and continue to be associated with much higher rates of error, and often prevent even contemplating building software systems approaching scales that are commonplace today. I think the same kind of progression can and must happen within the context of designing and building distributed systems: we simply cannot keep dragging along this notion that one can continue to work at or near the level of spitting data out of sockets and expect to build robust and understandable distributed systems on top of such primitive primitives.

Put more simply, going back to picking on APIs, what will complete this analogy?


assembly/C : Java/Python/Clojure :: APIs : ???

Just as our predecessors identified problems with machine code and assembly and constructed abstractions in higher-level languages, we must rise above the metal of sockets, RPC, APIs, and so on.  I believe that much of this work is in identifying what not to do and what not to offer, just as higher-level programming languages are largely characterized by their constraints compared to what came before. In both contexts, what we find to be confusing and error prone — e.g. userland goto, manual memory allocation, and in-place mutation of objects in the case of programming languages, and things like accommodating network failure modes and implementing appropriate consensus mechanisms and consistency levels in distributed systems — are exactly the things that we need to abstract away from in day-to-day concerns in order to make progress.

My appeal to authority

Thankfully, many people smarter than I have been thinking about these sorts of problems in the context of distributed systems for some time. Perhaps not as long as people have been thinking about programming language problems, but close: soon after networks “happened”, people became aware of the difficulties of having multiple computers agree on things and compute manipulations over the data that they share.

In many ways, Leslie Lamport’s 1978 paper Time, Clocks, and the Ordering of Events in a Distributed System was the starting shot for serious, principled research into the challenges of distributed systems. (And the beginning of a body of work which contributed significantly to Mr. Lamport’s receipt of the Turing Award earlier this year.) What I appreciate most about this paper is that he talks about the difficulties of communication among concurrent, distributed actors as a physics problem. Contemplating the challenges of a distributed system reveals this as more than a convenient analogy, and closer to actual fact: the essential qualities of the distance between separate actors and the progression of time that paces their communications define what is and is not possible.

Mr. Lamport talks at some length about this perspective in a conversation he had with Erik Meijer (@ 15m30s):

Most people view concurrency as a programming problem or a language problem. I regard it as a physics problem.

Mr. Lamport’s work and this inclination to reach for provable approaches to problems grounds modern research in distributed systems. While much of what we do in software might be charitably characterized as engineering (nevermind Science, or even “science”), there is some comfort to be had that a thoughtful approach to distributed systems problems should flow from a more reliable basis than even the most well-designed library, framework, or language. The emphasis being on provable formalisms over clever engineering and implementation details gives me hope that solutions to these problems, once found, may be as reliable as the mathematics and understanding of physics upon which they are built.

Sound approaches nearing practicality

Over the last ten years or so, a number of sound approaches to some of the fundamental problems of distributed systems have been developed. I’ll focus on two strongly-related threads of progress:

Both of these approaches constrain the types of operations that your system can perform in order to ensure convergence over time of changes to data shared by uncoordinated, concurrent actors, and to eliminate network failure modes as a source of error. Achieving this vastly simplifies the challenge of building robust distributed systems, just as certain advances in language design simplified the problems of programming by constraining what we could express in our programs.

As I foreshadowed earlier, what’s essential about both of these approaches is not novel engineering, but the provable formalisms upon which they are premised: CALM has its roots in temporal logic, while CRDTs depend upon the algebra of semilattices.  You should understand these formalisms, at least in passing, if you are to leverage their concrete implementations with confidence and to good effect.

(Before you think that this is all in the realm of purely academic research, consider that, for example, CRDTs are already in production, and are included as a banner feature in the impending next release of Riak. These are not strictly high-minded proposals, but actionable progress in developing an understandable basis for building robust distributed systems.)

I happen to be far more familiar with CRDTs and semilattices than I am with CALM’s temporal logic, and so that will be my focus for the next section. (Much of it is also true for CALM and implementations of it.) The good news is that the math in question is really quite easy: really little more than secondary school-level algebra.

(If you’re reading this prior to May 15th, 2014, you may be interested in a talk I am delivering on that date at Papers We Love in New York discussing CRDTs using the paper that originated the ‘CRDT’ term as a vehicle.)

(Bounded) join semilattices

A lattice is a partially-ordered set where, for any two members of the set, there exists both a least upper bound (a value that is “greater than” both members, however that relation is defined for the types in question) and a greatest lower bound (a value that is “less than” both members). These relations are called the join and the meet, and if a partially-ordered set admits only one of them (i.e. it is partially-ordered in only one direction), then it is a semilattice. Finally, if a lattice has an absolute maximum or absolute minimum (often called top and bottom), it is said to be bounded.

CRDTs are premised upon join semilattices: partially-ordered sets of values for which a “greater than” relation holds for every member in the set. (Join and meet semilattices are mathematically equivalent; I believe it is largely just a matter of convention that join semilattices are used to describe and illustrate CRDTs, but the choice is not essential to their construction.) Finally, many CRDTs (based on those I am most aware of from the literature) are bounded.

a set semilattice

a set semilattice

Contemplating bounded-join semilattices is reasonably easy once you have seen a couple of the simplest examples.  One would be the set of natural numbers, where the join operation is max. Another common example is a set semilattice, pictured here, where the join operation is set union.

This depiction of semilattices is very common, perhaps because it illustrates both the set-theoretic effect of the join operation (e.g. taking the maximum of two natural numbers yields the higher of the two, and taking the union of two sets yields a set containing all of the members of both sets), and hints at the intuition you should develop about how a semilattice shared among multiple parties might progress upwards through its possible states over time (logically and otherwise). A set semilattice with a join relation of union will never lose information: the value of each member will always increase in size as joins are performed over time, across participants sharing the semilattice. Said another way, this is critical within the context of distributed systems because it means that you can have concurrent modifications being made to a structure like this that is logically shared by many different actors, and those modifications will always yield the same value when joined.

Algebraically, join and meet operations for any semilattice must satisfy three axiomatic properties (shown here with algebraic as well as more programming-oriented formalisms):

  • Associativity, the ability to batch inputs to an operation in any way without affecting the result
    (a \cdot \!\, b) \cdot \!\, c = a \cdot \!\, (b \cdot \!\, c)              f(f(a, b), c) == f(a, f(b, c))
  • Commutativity, the ability to change the order of inputs to an operation without affecting the result
    a \cdot \!\, b = b \cdot \!\, a                         f(a, b) == f(b, a)
  • Idempotence, the ability to apply an operation multiple times without affecting the result
    a \cdot \!\, a = a                            f(f(a)) == f(a)

These are really useful characteristics even in our single-node, single-process programs that, when leveraged, allow us to reason much more easily and accurately about our programs’ behaviour in the face of concurrency and parallelism.  They provide treble benefits in a distributed context though. As long as the primitive operations over an implemented data structure maintain these invariants, it is a semilattice, and is sheltered from some of the most problematic network failure modes: commutativity ensures that reordering of network messages has no ill effect, and idempotence does the same if network messages are repeated.

Further, as long as liveness is preserved (the most minimal guarantee of eventual consistency), then semilattice semantics ensure that convergence of modifications and operations initiated concurrently occurs without conflict. Consider: there is no way in which set union (the join operation of a set semilattice) could not reconcile sets modified concurrently by different parties. The result will simply be the union of those sets, and always will be.

The astute observer might point out that not all useful operations are associative, commutative, and idempotent; removing items from a set is one easy example, but anything that might be characterized as causing a loss of information over time (removals, deletions, subtractions, etc) would qualify. The key here is that only the primitive operations over semilattices and CRDTs must satisfy those axioms; other operations can be implemented in terms of those primitives, thus yielding the key characteristics we seek. The good news is that there are good ways to do this. The details of how this is done I’ll leave for another day, or for you to discover.

Data models are everything

Despite the algebraic invariants that semilattices dictate, CRDTs can and have been built representing a wide array of data types, including:

  • counters
  • registers
  • sets
  • (multi)maps
  • dense and sparse lists/vectors
  • partially-ordered sequences
  • trees
  • graphs

This looks like a totally reasonable set of data types, all of which you would expect to have available to you in any modern programming environment. Recall though that all of these data types are Replicated (the ‘R’ in “CRDT”): they presume a network transport (with certain characteristics that I won’t detail here) that will push state or changes to it from one actor to another.

This means that, when using CRDTs to tie your system together, you don’t need to resort to using impoverished representations that simply never come anywhere near the representational power of the data structures you use in your programs at runtime. Instead of working out a way to distill your model data into JSON or XML so it can be shared with other actors in your system, you simply add the data in question to a CRDT, and let its replication implementation carry it abroad, with exactly zero loss of representational fidelity.

A further bit of good news for those of us that appreciate functional programming is that semilattices naturally encourage immutability. Insofar as each element within the set that forms a semilattice always “grows” relative to its predecessors and inputs, the most naive approach to implementing CRDTs (and a reasonable mental model regardless of actual implementations) is to maintain an immutable log of operations being applied or state being added. This means things like histories, rollbacks, and consistent snapshot — features and operations that are typically considered “advanced” today because of the incidental complexity of implementing them using the data and communication substrates that are generally in use — all come effectively for free.

Finally, let’s consider again the API, our original subject of ire. Use cases that might otherwise be addressed via an API can be entirely supersetted by using a CRDT. The transformation is a simple one: change imperative, side-effecting calls like this one:

api.setName(personId, "Chas");

into reified data that you add to a CRDT, which is thus replicated from a “client” to other actors (perhaps a single “server”, if you so choose):

{:person-id person-id :name "Chas"}

The change is subtle, but has a tectonic effects. “Operations”, when cast as data, become computable: you can copy them, route them, reorder them freely, manipulate them and apply programs to them, at any level of your system. People familiar with message queues will think this is very natural: after all, producers don’t invoke operations on or connect directly to the consumers of a queue. Rather, the whole point of a queue is to decouple producer and consumer, so “operations” are characterized as messages, and thus become as pliable as any other data. All of the same leverage applies when you replace APIs with CRDTs.

Pick a programming model

The great thing about CRDTs is that they do not require you to shift your programming practice wholesale. CRDT implementations have generally materialized as libraries, not specialized runtimes or languages, so you can readily use whatever programming language you like with a CRDT. I personally think this is a very good thing, as different languages and runtimes offer different capabilities and excel in different domains. While we’re talking about distributed systems here, much of the computation that we want to do is essentially local even when the inputs and results may be pulled from and pushed to a shared medium. Thus, having the freedom to choose particular programming technologies and tie them together via CRDTs feels like having the best of both worlds.

On the other hand, there are certain approaches to distributed systems that, in attempting to address their unique challenges, propose a total and holistic shift in programming practice. Bloom — the programming language that has been used to research and prototype the CALM theorem — is an effort in this direction. The vast majority of data structures that you manipulate as a matter of course in Bloom are lattices or other monotonic constructs. The tradeoff here is that while Bloom can statically identify which parts of your program are non-monotonic — in join semilattice terms, where your program state descends in the partial order, and thus is exposed to e.g. common network failure modes — Bloom is the only place where you can reap the benefits of that analysis.

Languages aside, I think there are a number of programming models that are particularly well-suited to the sorts of computation and domains to which the largest distributed systems are often applied:

  • Event sourcing
  • Stream-based computation (e.g. Storm)
  • I’m particularly interested in tuple spaces (typified by Linda, of which many implementations exist for modern programming languages) and other sorts of blackboard systems, which I think provide a very useful metaphor for characterizing scale-invariant concurrent distributed computation.
  • Reactive patterns of all sorts have grown in popularity and caché of late, such as FRP. Many of these reactive approaches are quite at home in a distributed context, where you might characterize computational services in general as operations that are triggered in response to data arriving in a local CRDT that matches a pattern or satisfies a query over the data being replicated in from other actors, similar to how triggers in relational databases are fired when a new result is found for a SQL query.

The bottom line is that it’s not at all clear yet which programming models are best suited for use with CRDTs…though I have to say, I do like my options.

With constraints come costs

Recalling the comparison I made earlier to programming languages enforcing progressively more constraints to minimize error and maximize ease of understanding, it’s worth asking what constraints CRDTs, semilattices, and other related approaches necessitate in order to yield their benefits. There are two, as far as I can tell:

It is imperative that CRDT implementations ensure that their operations adhere to the algebraic axioms of semilattices. (I think it is no accident that many CRDT implementors also happen to be very interested in things like property-based testing [e.g. QuickCheck, QuickCheck, test.check, double-check, and so on] and theorem provers like Coq, both of which are good tools for defining and verifying strict invariants like these.) This is definitional, but further means that operations and data types that are not naturally associative, commutative, and idempotent need to be recharacterized in terms of primitives that are. There are costs here (as with any abstraction, this boils down to a certain amount of computational, storage, and transmission overhead), but I’ll submit that the benefits are well worth it.

Secondly, if you want to benefit from the advantages offered by e.g. CRDTs, then you must use them to mediate all interactions between all actors in your distributed system. This implies a certain kind of boil-the-ocean sort of posture that can appear to be very costly, especially if you have a large investment in existing technologies that don’t yield similar benefits. Given my background with and habituation to functional programming and immutable data structures, I think of this constraint as being similar: once convinced of their utility, you generally try to maximize the footprint of the “sane” world that uses pure functions operating over immutable data structures, and minimize the surface area of its interaction with the often quite incomprehensible world of effects and in-place mutation. Likewise, the parts of a system that connect with and share state using CRDTs will be much easier to reason about and debug, so it is natural to want to maximize their use relative to the parts of your system that are tainted by things like usage of imperative APIs and less capable data representations. Both cases have a dynamic that favours those building new systems as opposed to those that maintain or steward large existing systems.

Totally aside from the actual constraints imposed by these approaches, a final word of caution might be that all of this is incredibly new. Even though they offer formalisations of mechanisms that have been employed in the past (very rarely, and never in principled, general-purpose ways), both the CALM theorem and CRDTs have only existed as discrete named concepts for the last few years. The first practical implementations of them are even newer than that. Significant bits of theory and implementation details are still up for grabs (e.g. CRDT “garbage collection” is a bit of a hot topic right now as these things go), so act accordingly. This is powerful stuff, and its potential is immense, but there will be rough waters between now and when CALM implementations and CRDTs are thought of as indispensable and obvious choices.

What do we want?

Earlier in this piece, I laid out what we want when we build a distributed system: communication and computation. I believe that CRDTs (or something like them), with their sound formal basis, give us a way towards building systems that hew close to those essential capabilities, avoiding the incidental complexity of APIs and much of the rest of the incidental complexity that comes along with modern typical distributed programming practice.

I believe these two essential capabilities can be sustained by services (and people!) reactively manipulating a shared substrate of replicated data. CRDTs appear to be an ideal, practical implementation of this substrate that connects disparate actors using reliable replication mechansisms, and thus allow for arbitrary computational models and flexible application and network topologies.

Resources

In addition to the links-as-citations that I have included throughout this piece, I could append here a thorough set of references. However, in doing so, I would effectively duplicate much of Chris Meiklejohn’s great work in compiling his Readings in Distributed Systems. I encourage you to absorb as much as you can from the resources referenced there if you have interest in these topics.

A harder alternative path would be to read about the well-grounded approaches I mentioned earlier; look up unfamiliar terminology, follow their bibliographies, and shout out on Twitter when you get stuck (there’s a good community of researchers and practitioners sharing references and experiences there):

Finally, keep an eye on the Quilt Project, the manifestation of my work in this and other areas. CRDTs are an essential part of its foundation.

Footnotes

  1. This dynamic alone significantly affects the social equation and economy between providers and consumers of network APIs…something I’m eager to discuss at length at some point, but not here, now. back
  2. Anyone claiming otherwise is implying that the status quo is acceptable, where the only parties that can reasonably build large, robust distributed systems are large organizations that have the resources (generally, many millions of dollars yearly) to spend on the engineering talent and bodies that appear to be necessary given the technologies currently in widespread use. back
About these ads