[B] 2024 Oct 10
[M] 2024 Dec 16
[L] en

Distributed Systems

What are they?
What are the essential problems?
What are the essential solutions?

What is a system?

A sequence of discrete steps without infinite loops is a script. It starts, does some things and exists.

Once you introduce an infinite loop which does things indefinitely - you have a service (or "server", "app", etc.).

Once you define some abstractions within the implementation of your service and pass data into them to construct them and/or for them to perform some operations on the passed-in data - your service becomes a system (of abstractions).

What does it mean for a system to be "distributed"?

The distinction is actually a continuum.

The less certainty you have that the data, passed from one abstraction to another, is delivered and processed - the more distributed your system is.

e.g. ordered from more to less certainty:

bind a value to a name
apply a function
read data behind a mutex
write data to X, expecting another Y to read it, where (X, Y):
1. (channel, thread)
2. (TCP socket, thread)
3. (TCP socket, process)
4. (TCP socket, virtual host)
5. (TCP socket, physical host)
6. (UDP socket, *)

and so on ...

In practice people usually draw the line at machine-to-machine communication: once your abstractions (i.e. system components) are placed on different machines - they (and thus your system) are distributed (across hosts/machines).

Upon a bit more thought you realize that the starting point of uncertainty is leaving a shared memory space, so process-to-process communication should have a way to deal with some of the same problems as machine-to-machine one.

A bit more and ... thread-to-thread doesn't look so certain either.

What problems does increased uncertainty introduce?

Time

Once you have to deal with results coming from another machine you have a deep existential realization - there is no "time"!

there is no spoon

Time is a conflation of 2 ideas:

order
state

But ... The order of what? The state of what? Exactly!

These are the essential questions you must answer before your system is defined.

When you ask "what time is it", you can only answer it by observing a state of an object (i.e. a clock). This object has no magical backchannel to the rest of the universe, so it can only tell you about itself. The concept of a universal time may or may not exist, but you can never know whether it does, so for all practical purposes - it doesn't.

Observability of

Failure

You sent some data X to host F (hoping for it to tell you what Y is), you need a reply before you can go on. You're waiting, and waiting, and waiting ... Why?

Is F is still working on it?
Did F receive your request?
Does F still exist?

You have no idea! That is the essential problem.

In practice, the only usable definition of failure is a timeout - a period of time without a reply is an arbitrary line in the sand to consider F(X) to have failed.

State

What is the state of the system as a whole right now?

An impossible to answer question, since you have neither time (what is "now"?) nor observability!

Practical solutions here once again draw lines in the sand using timeouts.

What is the state of the system as a whole given the known of states of components?

This is possible to answer if the state is represented as a CRDT:

The simplest example of a CRDT is a set that only grows, so with distributed component states being:

{foo}
{bar}
{baz, foo}

the total state of the system is {foo, bar, baz}.

This is already useful (you can implement a counter!), but obviously limiting, since you can neither remove a member, nor determine the order they were inserted in. Although with some refinement, limited removals are possible - see 2P and OR sets in the above-linked articles.

Comments: siraaj@khandkar.net