Thinking about Syncing, Part 4: keeping track

Publish date: 2018-05-31

Tags:

In Part 3 we argued that the concerns of application code differ from those of synchronization code. In this part we will take a moment to explore the latter in more depth.

The needs of synchronization

In Part 1 we discussed at a high level what a sync system has to do: move information around between its clients to allow them to reach agreement on the state of the world.

More concretely, syncing storage systems need to consider equivalence, management of identifiers, management of change, and detection of conflict.

Management of identifiers

Most systems need some way to identify things — entities in the domain of discourse. These might be numeric identifiers, globally-unique random identifiers (UUIDs/GUIDs), or salted hashes of some key. Sometimes a system will use one kind of identifier internally and another to sync. The allocation, rewriting, or in-out mapping of these identifiers is a concern of sync systems.

Equivalence

When multiple timelines can exist, it’s possible for the same conceptual entity to be identified by more than one name. Only systems in which every entity is given a stable identifier predictably derived from its unique attributes (such as a URL), or systems in which identity allocation is a centralized function, can avoid this.

When more than one identifier exists for the same entity, they can be ‘smushed’ — one is replaced with the other — or mappings can be built.

Management of change

‘Change’ is a blanket term. For our purposes, it consists of:

Addition: the introduction of a new fact into the system by one or more of its clients.
Update: the assertion by a client that an old fact has been replaced by a new fact.
Retraction: the assertion by a client that an old fact is no longer true.
Deletion: the forced un-stating of a fact so that it’s no longer part of any state, including any historical record.
Expiration: the pruning of old state, either on the individual client or system-wide, due to irrelevance. This is typically done to reduce storage footprint and improve speed. Expiration can result in clients diverging or introducing duplicate identifiers.
The expansion or alteration of the ontology or schema of the system itself.
A change in the set of syncing devices or the expected behavior of the system (e.g., turning a feature on or off).

Not only does the system need ways to model these things, but it also needs to keep track of ordering and progress: the same change shouldn’t be applied more than once, no matter the topology of the system.

Detection of conflict

When two changes conflict, the conflict should be detected and some resolution should be reached without data loss.

Beyond these concerns, we can list some requirements that we would like a real-world, practical sync system to meet.

Consistency (in the database sense)

The state that the system stores and syncs should remain consistent with some expected properties — e.g., required fields should be present and of the correct types. A change that breaks consistency must fail.

Eventual consistency (in the distributed system sense)

All clients in the system should converge on the same end state.

Quiescence

When no client is introducing new changes, no significant activity should occur: we expect the system to rapidly stop exchanging data once conflicts have been resolved.

Atomicity

It should not be possible for other clients to see only part of another client’s changes, because it makes maintaining correctness more difficult.

Incrementality

Adding new facts to existing entities (e.g., adding creation dates to all of your bookmarks) should not require disproportionate work to be done by clients.

Adding new entities should not require disproportionate work, even if those entities are related to other entities. (E.g., adding a new history visit should not require re-uploading the title and URL and previous visits of the history item.)

Adding new clients to the set should not require disproportionate work from other clients.

Continuation of service

Ordinary changes — data additions, schema extensions, client additions, etc. — should be routine and low-impact. We don’t want to lock out clients, force upgrades, or lose data as a result of a minor change; doing so harms the user experience and adds friction to engineering. When things are working they should continue to work.

A modest proposal

Applications must design for syncability when considering identifiers. We cannot avoid thinking about uniqueness, what constitutes an identifier, when they can change, how we refer to entities (which is, after all, the point of having an identifier!), and when two entities should be considered the same… and what we should do when that occurs. INTEGER PRIMARY KEY AUTOINCREMENT will end in tears.
If the application must support disconnected operation and the other constraints discussed earlier, then it should use a log-structured store to allow for automatic conflict-detecting sync. Application code should be prepared to resolve detected conflicts.
If the application’s data is relational — and choosing to model it in a non-relational tool doesn’t change this fact! — then we must think about how identification and constraints work in our domain. If a non-relational store is used, a relational layer will accrete on top, and abstractions leak. (Sometimes they are leaky by design.)
Use events to record data if you can. Derive more narrow representations from the event-shaped, log-structured data. But even event-shaped data has a schema, and storage should make that easy to evolve.
Use event-shaped APIs to separate storage representation and concerns from the application layer. setBookmarked(url, title, true) is worse than addBookmark(url, title, timestamp) . didClickStar(page, context) might be the right abstraction to aim for.
A data model that can correctly represent the domain, support syncing, and extend to future needs, will often not be a good fit for the querying requirements of the rest of the application. Resolve this tension by maintaining two (or more) representations, not by expecting a single representation to meet all needs.