Daniel Tisdall


concurrency | correctness | sw eng

Why does Jepsen Work?

Why does the highly regarded testing tool Jepsen, which randomly partitions networks in distributed systems, find so many bugs?

The 2018 paper ‘Why Is Random Testing Effective for Partition Tolerance Bugs?’ by Rupak Majumdar, Filip Niksic shows that several notions of high test coverage can be achieved with a number of tests logarithmic in the number of cases to be tested. Moreover, the tests are chosen randomly. We explain the most general construction here. The paper contains more.

A lower bound

Consider a finite set \(M\) and a set \(T\) whose elements are subsets of \(M\). We want to find a subset \(\mathcal{F}\) of \(T\) which is small and where for each \(x \in M :\exists t \in F\) st \(x \in t\) . That is, we want to find a small set of tests from \(T\) which cover all the cases in \(M\).

If we know a positive lower bound probability \(p\) over all fixed elements \(x \in M\), for which a random \(t \in T\) contains \(x\), then we can express the probability that a candidate \(\mathcal{F}_{candidate}\) satisfies the properties of \(\mathcal{F}\) in terms of \(\vert \mathcal{F}_{candidate} \vert \) and a factor logarithmic in \(\vert M \vert\).

To show this we need to remind ourselves of high school logarithms, and some straightforward probability. We’ll say a set \(S\) covers \(x\) if \(x \in S\), and we’ll abuse this language.

Proof:

Fix a subset \(\mathcal{F} \subseteq T\) chosen uniformly randomly. We know that a an element \(t \in \mathcal{F}\) covers any fixed element \(x \in M \) with probability at least \(p\) . The probability that \(x\) is not covered is therefore \(\leq 1-p\). When taken over all elements of \(\mathcal{F}\) we have \(\mathcal{P}(\mathcal{F} \text{ does not cover } x) \leq (1-p)^{\vert \mathcal{F} \vert}\).

The probability of \(\mathcal{F}\) not covering an element of \(M\) cannot be greater than the sum of probabilities of missing a single given element (union bound). So we have

\[ \mathcal{P}(\mathcal{F} \text{ does not cover } M) \leq \vert M \vert (1-p)^{\vert \mathcal{F} \vert} \]

We want to bound this

\[ \vert M \vert (1-p)^{\vert \mathcal{F} \vert} < \epsilon \]

in terms of \(\vert \mathcal{F} \vert \).

We have

\[ (1-p)^{\vert \mathcal{F} \vert} < \frac{\epsilon}{\vert M \vert} \]

and we can take logarithm of both sides base \(1-p\), which is a decreasing function so we flip the inequality.

\[ \log_{(1-p)}(\frac{\epsilon}{\vert M \vert}) < \vert \mathcal{F} \vert \]

Using the fact that for any basis \(b\) and number \(a\) we have \(e^{\log(b)\log_{b}(a)} = e^{log(a)}\) we can get back an expression using the natural logarithm

\[ \frac{\log(\epsilon) - \log(\vert M \vert)}{\log(1-p)} < \vert \mathcal{F} \vert \]

If \(p\) is sufficiently small then \(- \log(1-p) < p\) and we have the convenient

\[ \frac{\log(\vert M \vert) - \log(\epsilon)}{p} < \vert \mathcal{F} \vert \]

We obtain a relation between our desired confidence \(\epsilon\) and \(\vert \mathcal{F} \vert \), and this scales with \(\vert M \vert\) .

One way Jepsen can find bugs is by testing cases where a particular subset of network nodes of size \(k\) is distributed evenly in a \(k\)-partition of the whole network. In this case \(M\) contains all subsets of nodes with size \(k\), and \(T\) contains all \(k\)-partitions. An element of \(t \in T\) maps to an element of \(x \in M\) if all nodes in \(x\) are in a different block (partition element) of \(t\).

Takeaway

The paper says that it’s difficult to derive bounds for more ‘dynamic’ notions of testing and coverage, those involving timing and so on. For now, we probably want to follows Ashish Darbari’s advice and keep striving for better correctness than what we get from just leaving everything ‘in the hands of random seeds’.