Update Succession in Replicated Key-Value Stores
In an earlier blog we looked at the use of vector clocks for keeping track of temporal relations between events in an asynchronous event system. To recap:
In an asynchronous event system each event is marked by a node-clock pair; intra-node temporal relations between events are based on clock values at a given node; and inter-node temporal relations between events rely on message transmittals being earlier than corresponding message receipts.
And we saw earlier that in such a system, an event E2 is a temporal successor of another event E1 if and only if the vector clock of E2 dominates the vector clock of E1. [The vector clock of an event is a map from the nodes of a system to the latest clock values of those nodes known to have occurred before (or at the same time as) the event.]
In this blog we’ll see how to extend the use of vector clocks to keep track of update succession in update-anywhere replicated key-value stores. This type of store is exemplified by Amazon’s Dynamo, and by the open source system Voldemort. In the literature I have seen so far on these systems, it is assumed without elaboration that vector clocks can be used to represent update succession. But as we’ll see shortly, this assumption is not immediately evident. And to prove it requires some constraints on asynchronous updates.
My aim in this blog is to outline the difference between temporal succession and update succession, and to show what this difference means for the use of vector clocks in update-anywhere data stores.
General Asynchronous Events
In order to demonstrate the use of vector clocks for update succession, I need to digress first to generalize the model of message passing between asynchronous events. The first generalization is to allow events to be both transmitters and receivers of multiple messages. The second generalization is to allow loopback messages from a node to itself.
Figure 1 depicts this more general model.
Figure 1. General Asynchronous Event System
[In Figure 1, the notation E(node, clock) designates an event that occurred at the given node and the given clock value at that node.]
It is easy to extend earlier arguments about the equivalence of temporal succession and vector clock dominance to this more general model. The main difference between the two models is that in propagating vector clocks we may now have to include the vector clocks from multiple message sources in our maximal merge computation.
Asynchronous Updates and Distributed Versions of Data Items
A replicated data store includes a set of key-value pairs, called data items, each of which is replicated to a number of nodes. For high write availability, an update to the value of a data item is allowed to be written at any available node.
Then independent and asynchronous updates of the value associated with a given key may have to be written to different nodes, so that multiple versions of a data item may coexist in the store as a whole. Each such differing version of a data item may, in its own right, carry useful information. Therefore, in general, these versions are not allowed to blindly overwrite each other. For maximum flexibility, coexisting versions of a data item are resolved (merged) by application code specific to the use of each instance of a data store.
This scenario leads to a model of the evolution of a data item in which:
- A read by the application may cause a number of different versions of the data item for a given key to be read from the data store.
- The application creates a single updated version of the data item based on all the versions read.
- The new version is then written and it obsoletes and replaces all the versions read in this particular update operation.
The updated version is then an update successor of each version read. And the versions read for the update are update precursors of the updated version. Of course, the update successor and the update precursor relations are transitive. And we define them to be reflexive as well.
We know that an updated version of a data item should obsolete and replace all of its [proper] precursors. But when the new version of the data item is first written, some of its precursors may not be present at the node of this initial write. And even for those precursors that are present at this initial write node, there are replicate copies at other nodes that also need to be purged. While this new version will be replicated to all replicate nodes, replications may have to take place asynchronously to the original write of this new version. Therefore, the new version of the data item needs to carry with it information about its precursors, so that they can be purged once its replicate copies reach other nodes.
Writes of Data Items as Asynchronous Events
Conceptually, we may consider an entire update operation – including all its reads, their resolution, and the subsequent write of a new version – as an event in a general asynchronous event system. And for the purpose of tracking update succession, we may consider this event as occurring at the node in which the new update version is first written and at the clock value of the write at that node. Looked at in this manner, the corresponding reads can be thought of as messages sent from earlier write events (earlier versions of the data item) to the new update event (the new version of the data item).
The upshot is that if we identify versions of a data item with update (or initial write) events, we have here a system of events that is similar to our earlier general asynchronous event system.
Figure 2 depicts the update succession of versions in this scenario.
Figure 2. Update Succession with Asynchronous Versions
In Figure 2, a version of a data item initially written to node n at clock value c of that node is depicted as V(n, c). Note in particular, in Figure 2, that a version may be the immediate update precursor to two asynchronous versions – as in V(1, 1) being asynchronously updated to V(2, 3) and to V(3, 2) – and that a version may be the immediate update successor of two asynchronous versions – as in V(1, 7) succeeding both V(2, 3) and V(3, 4).
Update Succession versus Temporal Succession
The similarity of our update/version event system and our earlier asynchronous event system depicted in Figure 1, leads us to associate vector clocks with write events (and corresponding versions of a data item) and to try to use them to determine the update succession of versions, and thereby to cause the obsolescence and purge of updated versions.
But before we can make the leap between the two event systems, there is another crucial property of asynchronous event systems that we have yet to establish for write events in a replicated data store: the linear temporal succession of events within each node according to their clock values.
Is there, in fact, a linear order of update succession for a data item within each node according to clock value in an update-anywhere replicated data store? Well, not by default. Following is a trivial counter-example.
Consider two different clients reading the same version of a data item, and proceeding to update it independently at the same node. If the system blindly writes both update versions to the data store, then one can occur at a clock time later than the other. But the second update version is not an update successor of the first: it was not created by reading the first and updating it, and it does not obsolete the first. This is a crucial difference between the event system of update-anywhere replicated data stores, and the general asynchronous event system we saw earlier.
The Case for Read Validation
The only way I know to remove this difference is to assume that in an update, reads are validated within the update transaction at its primary write node for those versions of the data item that were created at that node. If further versions of the data item – later versions than those that were read by the update operation – were created at the node where the update is first written, the update would be rejected and possibly retried.
Read validation specialized to the the primary node of an update in this manner would imply that the versions of a data item created at a given node are totally ordered in time via the local clock value at the node, and that this total ordering entails update succession: each version of a data item created at a node is an update successor of the version immediately before it. I’ll call this condition totally ordered local update succession.
Totally ordered local update succession: Within the sequence of versions of a data item created at a given node, ordered by their clock values at that node, each version is the result of an update whose read set included the immediately preceding version.
At this point, we have the sought-after similarity in the structure of the predecessor relation and its relation to clocks for general asynchronous events, and the structure of the precursor relation and its relation to clocks for write events of a given data item (and for corresponding versions) in an update-anywhere replicated data store. But as have seen, temporal succession defined through the predecessor relation for asynchronous events is equivalent to vector clock dominance. Therefore, update succession defined through the precursor relation for write events of a given data item (and for corresponding versions) must be equivalent to vector clock dominance as well.
Propagating Vector Clocks to New Versions
To maintain vector clocks for versions of data items, we need to perform the maximal merge of the immediate precursors of a version plus the node-clock of the version itself. The precursors are the versions read by the update operation. So reads need to piggy-back vector clocks with each version of a data item read. And, of course, writes need to store the new (maximally merged) vector clock with each update version of a data item. All versions of the same data item residing at the node of the write and dominated by the new vector clock then become obsolete and may be purged.
But What about Replicate Writes?
Replicate writes were excluded from our event system of writes/versions because replicate writes do not in fact create new versions of data items, nor new vector clocks. A replicate write simply copies a version and its vector clock intact from one node to another. Whether an update operation reads a version of a data item from its initial write node or from a replicate node is immaterial to the relationship between that version and the update version.
Of course, upon reaching a replicate node, a replica’s vector clock obsoletes any versions of the data item whose vector clocks it dominates, and allows them to be purged from that replicate node.
Acknowledgments
Thanks to the members of the Silicon Valley Patterns Group and in particular to Wayne Vucenic and Chris Tucker for useful discussions on distributed key-value stores. A special thanks to Jay Kreps, the creator of Voldemort, for participating in our group discussions.

