MCMC State Synchronisation Protocol for BOINC Parallel Tempering¶

This document specifies a complete protocol for running parallel tempering MCMC across distributed BOINC volunteer machines for the OpenAstro TTV n-body inference problem. It covers the algorithm, the async swap mechanism, failure modes, convergence diagnostics, and the server-side data schema.

1. Parallel Tempering Fundamentals¶

1.1 The Standard PT Algorithm¶

Parallel tempering (replica exchange MCMC) runs $K$ Markov chains simultaneously, each sampling a tempered version of the target posterior. Chain $k$ samples from:

$$\pi_k(\theta) \propto \left[\mathcal{L}(\theta)\right]^{1/T_k} \cdot \pi_0(\theta)$$

where $T_1 = 1 < T_2 < \cdots < T_K$ is the temperature ladder, $\mathcal{L}(\theta)$ is the likelihood, and $\pi_0(\theta)$ is the prior. The $T_1 = 1$ chain samples the true posterior. Higher-temperature chains flatten the likelihood surface, enabling exploration across modes separated by deep likelihood valleys.

[Source: Swendsen & Wang (1986), Phys. Rev. Lett. 57, 2607 — replica exchange; Earl & Deem (2005), Phys. Chem. Chem. Phys. 7, 3910 — parallel tempering review]

Within-temperature moves: Each chain $k$ independently performs standard Metropolis-Hastings (or ensemble) moves targeting $\pi_k(\theta)$. For a proposal $\theta' \sim q(\theta' | \theta_k)$, the acceptance probability is:

$$A_\text{MH} = \min\left(1,\; \frac{\mathcal{L}(\theta')^{1/T_k}\, \pi_0(\theta')\, q(\theta_k | \theta')}{\mathcal{L}(\theta_k)^{1/T_k}\, \pi_0(\theta_k)\, q(\theta' | \theta_k)}\right)$$

Inter-chain swaps: Periodically, adjacent chains $i$ and $j = i+1$ propose to exchange their states $\theta_i \leftrightarrow \theta_j$. The swap acceptance probability preserves detailed balance of the joint chain:

$$A(i \leftrightarrow j) = \min\left(1,\; \frac{\mathcal{L}(\theta_j)^{1/T_i} \cdot \mathcal{L}(\theta_i)^{1/T_j}}{\mathcal{L}(\theta_i)^{1/T_i} \cdot \mathcal{L}(\theta_j)^{1/T_j}}\right)$$

Equivalently in log form (numerically stable):

$$\ln A(i \leftrightarrow j) = \min\left(0,\; \left(\frac{1}{T_i} - \frac{1}{T_j}\right) \cdot \left(\ln \mathcal{L}(\theta_j) - \ln \mathcal{L}(\theta_i)\right)\right)$$

1.2 Required Swap Frequency¶

For PT to function, swaps must occur frequently enough that favourable configurations discovered by hot chains propagate down to the cold chain before the hot chain drifts away. The standard guideline:

Attempt swaps every $S = 10$--$100$ within-chain steps. Too infrequent ($S > 500$): chains decouple, the hot chain's mode discoveries never reach the cold chain. Too frequent ($S < 5$): computational overhead from swap logic dominates, and chains don't move far enough between swaps for the swap to be informative.
For the TTV problem with $\tau_\text{local} \approx 20$--$50$ steps (autocorrelation within a single mode), $S \approx 50$ is a reasonable starting point: one swap attempt per local decorrelation time.

1.3 The Synchronisation Requirement¶

Theoretical requirement: The swap step requires both chains $i$ and $j$ to be paused simultaneously. The acceptance probability $A(i \leftrightarrow j)$ is computed from the states $(\theta_i, \theta_j)$ at the same logical time. If chain $i$ has advanced $n$ steps past the point where its state was read while chain $j$ is evaluated at the current point, the swap no longer satisfies detailed balance for the joint process. The Markov property of the joint $(K$-chain$)$ system requires that the swap decision is made on the current joint state, not a stale snapshot.

Why this is fundamentally incompatible with BOINC:

BOINC operates on a fire-and-forget work unit model. A work unit is dispatched to a volunteer machine, runs for minutes to hours, and returns a result. There is:

No mechanism for synchronous pausing. You cannot send a message to a running BOINC work unit saying "pause at step 2,500 and report your state." The work unit runs to completion (or checkpoints on its own schedule) and reports back. There is no bidirectional communication during execution.
No guarantee of simultaneous completion. Two work units dispatched at the same time to two volunteer machines may complete hours or days apart, depending on the volunteer's CPU speed, uptime, and competing workloads.
No guarantee of completion at all. A volunteer machine may shut down, lose network connectivity, or have the BOINC client paused indefinitely. The server must tolerate chains that simply vanish.

Therefore, a direct implementation of synchronous PT over BOINC is impossible. What follows is an asynchronous approximation that preserves correctness of individual chains while allowing swaps to occur opportunistically.

2. The Async Swap Protocol¶

2.1 Design Principles¶

The protocol is designed around three invariants:

Invariant 1 (Chain correctness): Each individual chain $k$, considered in isolation, is a valid MCMC chain targeting $\pi_k(\theta)$. Swaps are state injections that, when they occur, are accepted or rejected according to the correct Metropolis-Hastings criterion. When swaps do not occur (due to timing failures), the chain simply continues its own random walk. Missing a swap is equivalent to proposing a swap and rejecting it — both result in the chain continuing from its current state.

Invariant 2 (Graceful degradation): If all volunteer machines disappear except one, that one chain continues to produce valid (though poorly mixing) samples from its tempered distribution. If only the cold chain survives, it produces valid posterior samples. The system never enters an invalid state due to missing chains.

Invariant 3 (At-most-once swaps): A chain's state is never used in two simultaneous swaps. The server enforces mutual exclusion on swap operations.

2.2 Server-Side State¶

For each chain $k \in {1, \ldots, K}$, the server maintains:

ChainState[k] = {
    theta_k:           float[D]       # Current parameter vector (D dimensions)
    log_L_k:           float          # log-likelihood at theta_k
    log_prior_k:       float          # log-prior at theta_k
    T_k:               float          # Temperature assignment
    n_k:               int            # Total step counter (cumulative across epochs)
    epoch_k:           int            # Epoch counter (number of completed epochs)
    last_checkpoint_ts: datetime      # Timestamp of last checkpoint submission
    swap_status:        enum          # One of: IDLE, SWAP_PENDING, SWAP_EXECUTING
    swap_partner:       int or null   # Chain index of pending swap partner
    swap_pending_since: datetime      # When swap_status became SWAP_PENDING
    assigned_volunteer: string        # BOINC host ID currently running this chain
    work_unit_id:       string        # Current BOINC work unit ID
}

Global server state:

SwapWindow:     W seconds     # Maximum time to wait for a swap partner
EpochLength:    E steps       # Steps per work unit (adjustable per chain)
SwapLock:       mutex         # Global lock for swap operations

2.3 Chain Execution (Volunteer Side)¶

A BOINC work unit for chain $k$ contains:

Input:

{
    "chain_id": k,
    "temperature": T_k,
    "initial_state": theta_k,
    "log_likelihood": log_L_k,
    "log_prior": log_prior_k,
    "epoch_length": E,
    "transit_data": { "times": [...], "errors": [...] },
    "prior_bounds": { ... },
    "ttvfast_config": { "dt": 0.05, "total_time": 1500 },
    "random_seed": seed_k_epoch
}

Execution on volunteer machine:

1. Load initial state theta from input
2. For step = 1 to E:
     a. Propose theta' from proposal distribution q(theta' | theta)
     b. Compute log_L' = log_likelihood(theta', transit_data) via TTVFast
     c. Compute log_prior' = log_prior(theta')
     d. Compute log_alpha = (1/T_k) * (log_L' - log_L) + (log_prior' - log_prior)
                           + log q(theta | theta') - log q(theta' | theta)
     e. If log(U) < log_alpha (U ~ Uniform(0,1)):
          theta <- theta', log_L <- log_L', log_prior <- log_prior'
     f. Record sample: store (theta, log_L, log_prior) every thin_interval steps
     g. Checkpoint to disk every checkpoint_interval steps (for BOINC restart)
3. Submit result to server:
     {
         "chain_id": k,
         "final_state": theta,
         "final_log_L": log_L,
         "final_log_prior": log_prior,
         "step_count": E,
         "samples": thinned_chain_array,
         "acceptance_rate": n_accepted / E,
         "autocorrelation_estimate": tau_hat
     }

The volunteer machine has no knowledge of swaps. It runs a standard tempered MCMC chain and reports back. All swap logic is server-side.

2.4 Server-Side Swap Logic¶

When the server receives a checkpoint from chain $i$:

PROCEDURE HandleCheckpoint(chain_i, checkpoint_data):

    ACQUIRE SwapLock

    # Step 1: Update chain i's state
    ChainState[i].theta_i       <- checkpoint_data.final_state
    ChainState[i].log_L_i       <- checkpoint_data.final_log_L
    ChainState[i].log_prior_i   <- checkpoint_data.final_log_prior
    ChainState[i].n_i           += checkpoint_data.step_count
    ChainState[i].epoch_i       += 1
    ChainState[i].last_checkpoint_ts <- NOW()
    store checkpoint_data.samples in SampleStore[i]

    # Step 2: Identify swap partner
    # Adjacent chains in the temperature ladder: j = i-1 or j = i+1
    # Choose the partner that has SWAP_PENDING status for chain i
    # If neither is pending, choose the adjacent chain with the most
    # recent checkpoint (prefer the chain that has been waiting longest)

    j <- NULL

    FOR each adjacent chain index adj IN {i-1, i+1} (if valid):
        IF ChainState[adj].swap_status == SWAP_PENDING
           AND ChainState[adj].swap_partner == i:
            j <- adj
            BREAK

    # Step 3: Execute swap or mark as pending
    IF j IS NOT NULL:
        # Partner j was waiting for us. Attempt the swap.
        ExecuteSwap(i, j)
    ELSE:
        # No partner ready. Check if any adjacent chain is IDLE and
        # its last checkpoint is within the swap window.
        FOR each adjacent chain index adj IN {i-1, i+1} (if valid):
            IF ChainState[adj].swap_status == IDLE
               AND (NOW() - ChainState[adj].last_checkpoint_ts) < W seconds:
                # The adjacent chain recently checked in but isn't waiting.
                # We cannot swap (its state may be stale or it is mid-epoch).
                # Mark ourselves as pending for this partner.
                ChainState[i].swap_status <- SWAP_PENDING
                ChainState[i].swap_partner <- adj
                ChainState[i].swap_pending_since <- NOW()
                RELEASE SwapLock
                RETURN response: "WAIT" (hold response for up to W seconds)

        # No adjacent chain is available. Continue without swap.
        ChainState[i].swap_status <- IDLE
        ChainState[i].swap_partner <- NULL
        RELEASE SwapLock
        DispatchNextEpoch(i, ChainState[i].theta_i)
        RETURN

    RELEASE SwapLock
    RETURN


PROCEDURE ExecuteSwap(i, j):
    # Both chains i and j have current states on the server.
    # Compute acceptance probability.

    delta_beta <- (1.0 / ChainState[i].T_i) - (1.0 / ChainState[j].T_j)
    delta_logL <- ChainState[j].log_L_j - ChainState[i].log_L_i
    log_alpha  <- delta_beta * delta_logL

    U <- random_uniform(0, 1)

    IF log(U) < log_alpha:
        # Accept swap: exchange states
        SWAP(ChainState[i].theta, ChainState[j].theta)
        SWAP(ChainState[i].log_L, ChainState[j].log_L)
        SWAP(ChainState[i].log_prior, ChainState[j].log_prior)
        record_swap(i, j, accepted=True)
    ELSE:
        # Reject swap: both chains keep their own states
        record_swap(i, j, accepted=False)

    # Reset swap status for both chains
    ChainState[i].swap_status <- IDLE
    ChainState[i].swap_partner <- NULL
    ChainState[j].swap_status <- IDLE
    ChainState[j].swap_partner <- NULL

    # Dispatch next epochs for both chains
    DispatchNextEpoch(i, ChainState[i].theta_i)
    DispatchNextEpoch(j, ChainState[j].theta_j)


PROCEDURE SwapTimeoutDaemon():
    # Runs every W/2 seconds on the server
    ACQUIRE SwapLock
    FOR each chain k:
        IF ChainState[k].swap_status == SWAP_PENDING
           AND (NOW() - ChainState[k].swap_pending_since) > W:
            # Swap window expired. Clear pending status.
            ChainState[k].swap_status <- IDLE
            ChainState[k].swap_partner <- NULL
            DispatchNextEpoch(k, ChainState[k].theta_k)
    RELEASE SwapLock

2.5 Correctness Argument¶

Claim 1: Individual chain correctness is preserved.

Each chain $k$ runs a standard Metropolis-Hastings algorithm targeting $\pi_k(\theta) \propto \mathcal{L}(\theta)^{1/T_k} \pi_0(\theta)$ during each epoch. The within-epoch transitions satisfy detailed balance by construction (standard MH). Between epochs, one of two things happens:

(a) No swap: The chain continues from its current state $\theta_k$. This is identical to simply continuing the Markov chain — no violation of the Markov property.
(b) Swap attempted: The server replaces $\theta_k$ with $\theta_j$ (from chain $j$) with probability $A(k \leftrightarrow j)$, or retains $\theta_k$ with probability $1 - A(k \leftrightarrow j)$. This is a Metropolis-Hastings move on the joint state space with proposal "take the other chain's state." The acceptance probability $A(k \leftrightarrow j)$ ensures that detailed balance holds for the joint distribution $\pi_k(\theta_k) \pi_j(\theta_j)$ under the swap move, which in turn ensures that the marginal chain $k$ still targets $\pi_k$.

Missing a swap is statistically identical to rejecting a swap. In synchronous PT, a swap is proposed and either accepted or rejected. In our async protocol, when timing prevents a swap, it is as if the swap was proposed and rejected (probability 0 of acceptance, which is a valid special case). The chain's stationary distribution is unaffected.

The only subtlety is that the state $\theta_j$ used in the swap acceptance computation may be from a slightly earlier logical time than $\theta_k$ (chain $j$ reported $n$ steps ago, chain $k$ just reported now). This means the swap acceptance probability is computed on a state that chain $j$ has since moved away from. However, chain $j$'s current work unit will be unaware of any swap — it will return its result, and the server will either use the swapped state or the returned state for the next epoch. The key insight is that the server treats the checkpoint states as the definitive states. Between checkpoints, the volunteer machine's internal MCMC trajectory is self-consistent. The server never modifies a running work unit's state.

Claim 2: The stationary distribution converges to the target as swap frequency increases.

In the limit where $E \to 1$ (single-step epochs) and $W \to \infty$ (infinite swap window, guaranteeing all partners check in), the protocol reduces to synchronous PT with swap attempts every step. Standard PT theory guarantees convergence to the product distribution $\prod_k \pi_k(\theta_k)$, and the $T_1 = 1$ marginal converges to the true posterior.

As $E$ increases (longer epochs, fewer swap opportunities), the mixing between temperature levels degrades, but each chain individually remains correct. The system interpolates smoothly between "perfect PT" ($E=1$) and "independent chains at fixed temperatures" ($E = \infty$). Even in the worst case, the cold chain is a valid (if slowly mixing) MCMC sampler for the posterior.

Claim 3: Graceful degradation under volunteer failure.

If a volunteer machine disappears mid-epoch, the BOINC transitioner will eventually mark the work unit as timed out (after the BOINC deadline, typically 24-48 hours). The server then:

Marks the chain as STALLED (see Section 4).
Reassigns the chain to a new volunteer, starting from the last checkpoint state.
The temperature ladder has $K-1$ active chains in the interim. Swap attempts involving the missing chain are simply skipped (equivalent to rejected swaps).
If the chain is permanently lost, the server removes it from the ladder and reindexes adjacent swap pairs. The remaining $K-1$ chains continue correctly, with a gap in the temperature schedule that reduces swap acceptance between the chains on either side of the gap.

No chain's correctness is compromised by the loss of any other chain.

3. Temperature Ladder Design¶

The TTV posterior for a 2-planet system exhibits multiple known degeneracies (detailed in [[Reverse N-Body and Astrometric Parallax]], Section 3):

Period ratio ambiguity: Multiple mean-motion resonances ($j:(j-k)$) produce the same TTV super-period, creating isolated posterior modes in $P_c$.
Phase ambiguity: For incomplete TTV phase coverage, bimodality in the perturber's conjunction time $t_{0,c}$.
Mass-eccentricity degeneracy: $m_c$ and $e_c$ trade off to produce similar TTV amplitudes, creating elongated ridges in parameter space.

For a typical system near the 2:1 resonance, there are 2-6 distinct posterior modes. The likelihood ratio between the global mode and secondary modes can be $\Delta \ln \mathcal{L} \sim 5$--$30$.

3.2 Geometric Temperature Spacing¶

The standard approach for setting temperatures is geometric spacing:

$$T_k = T_1 \cdot r^{k-1}, \quad k = 1, \ldots, K$$

where $r > 1$ is the geometric ratio. This gives:

$$T_K = T_1 \cdot r^{K-1}$$

Solving for $r$:

$$r = \left(\frac{T_K}{T_1}\right)^{1/(K-1)}$$

With $T_1 = 1$:

$$r = T_K^{1/(K-1)}$$

3.3 Swap Acceptance Rate Criterion¶

The expected swap acceptance rate between adjacent chains $k$ and $k+1$ in the geometric schedule, for a $D$-dimensional Gaussian posterior approximation, is approximately:

$$\langle A(k \leftrightarrow k+1) \rangle \approx \text{erfc}\left(\sqrt{\frac{D}{4} \cdot \frac{(r-1)^2}{r}}\right)$$

[Source: Kofke (2002), J. Chem. Phys. 117, 6911 — optimal temperature spacing for replica exchange; Predescu et al. (2004), J. Chem. Phys. 120, 6238]

For efficient PT, the target swap acceptance rate is 20--40%. Below 20%, configurations propagate too slowly through the ladder. Above 40%, too many temperatures are wasted (you could use fewer chains).

For $D = 9$ (our 2-planet TTV problem) and target acceptance rate of 30%:

$$0.30 \approx \text{erfc}\left(\sqrt{\frac{9}{4} \cdot \frac{(r-1)^2}{r}}\right)$$

The argument of erfc for 30% acceptance: $\text{erfc}^{-1}(0.30) \approx 0.733$.

$$0.733^2 = 0.537 = \frac{9}{4} \cdot \frac{(r-1)^2}{r}$$

$$\frac{(r-1)^2}{r} = \frac{4 \times 0.537}{9} = 0.239$$

$$(r-1)^2 = 0.239 r$$

$$r^2 - 2r + 1 = 0.239r$$

$$r^2 - 2.239r + 1 = 0$$

$$r = \frac{2.239 \pm \sqrt{2.239^2 - 4}}{2} = \frac{2.239 \pm \sqrt{5.013 - 4}}{2} = \frac{2.239 \pm 1.007}{2}$$

Taking the larger root: $r \approx 1.623$.

3.4 Setting the Maximum Temperature $T_K$¶

The hottest chain must be hot enough that $\pi_K(\theta) = \mathcal{L}(\theta)^{1/T_K} \pi_0(\theta)$ is nearly uniform over the prior support. This means the tempered likelihood must be sufficiently flat that it does not confine the chain to any single mode.

Criterion: $T_K$ should be large enough that the dynamic range of the tempered log-likelihood across the prior volume is $\lesssim 1$ nat:

$$\frac{\Delta \ln \mathcal{L}_\text{max}}{T_K} \lesssim 1$$

where $\Delta \ln \mathcal{L}\text{max} = \ln \mathcal{L}\text{peak} - \ln \mathcal{L}_\text{prior-edge}$ is the total log-likelihood dynamic range across the prior support.

For the TTV problem: - The peak log-likelihood (good fit): $\ln \mathcal{L} \approx -\frac{1}{2}\chi^2_\text{min} \approx -\frac{N}{2}$ for $N$ transit times, assuming $\chi^2_\nu \approx 1$. - The prior-edge log-likelihood (random parameters): $\ln \mathcal{L} \approx -\frac{1}{2} \sum_n (t_n - \hat{t}_n)^2 / \sigma_n^2$. For random parameters, the predicted transit times are essentially random relative to observations, giving $\chi^2 \sim N \cdot (\Delta t / \sigma)^2$ where $\Delta t$ is the typical TTV amplitude and $\sigma$ is the timing precision.

Typical values: For $N = 50$ transits, $\sigma = 30$ s, TTV amplitude $\sim 5$ min:

$$\Delta \ln \mathcal{L}_\text{max} \approx \frac{1}{2} \times 50 \times \left(\frac{300}{30}\right)^2 = \frac{1}{2} \times 50 \times 100 = 2500$$

Therefore: $T_K \gtrsim 2500$.

3.5 Putting It Together¶

With $T_1 = 1$, $T_K = 2500$, $r = 1.623$:

$$K = 1 + \frac{\ln T_K}{\ln r} = 1 + \frac{\ln 2500}{\ln 1.623} = 1 + \frac{7.824}{0.484} \approx 1 + 16.2 = 17.2$$

Use $K = 18$ chains with geometric ratio $r = 2500^{1/17} \approx 1.597$.

The temperature ladder:

Chain	$T_k$ (rounded)
1	1.00
2	1.60
3	2.55
4	4.07
5	6.50
6	10.4
7	16.6
8	26.5
9	42.3
10	67.5
11	108
12	172
13	275
14	439
15	701
16	1120
17	1789
18	2500

Each chain is one BOINC work unit per epoch. With 18 chains and $E = 500$ steps per epoch, each epoch is $\sim 500 \times 2\,\text{ms} = 1\,\text{s}$ of CPU time (for TTVFast). This is far too short for a BOINC work unit. Therefore, increase $E$ substantially — see Section 7 for adaptive epoch length. A practical starting point is $E = 50{,}000$ steps ($\sim 100$ s of CPU time), with swap attempts only at epoch boundaries.

Adaptive ladder refinement: After the first 10 epochs, compute the observed swap acceptance rates between all adjacent pairs. If any pair has acceptance $< 15\%$, insert an intermediate temperature. If any pair has acceptance $> 50\%$, remove the higher-temperature chain. This requires reassigning BOINC work units, which is straightforward at epoch boundaries.

4. Detecting Stalled Chains¶

4.1 Definition of "Stalled"¶

A chain is stalled if it has been trapped in a single posterior mode for $> M$ steps without transiting to another mode. In a multi-modal posterior, a well-mixing chain should visit multiple modes over the course of a run (facilitated by PT swaps injecting configurations from hotter chains).

Operationally, a chain in a single mode will have parameter values confined to a compact region, and the autocorrelation time $\tau$ of the chain will be short (reflecting local mixing within the mode) but the chain will never make the large jumps between modes.

4.2 Practical Diagnostic¶

Primary diagnostic: Autocorrelation time divergence.

Compute the integrated autocorrelation time $\hat{\tau}_k$ for chain $k$ using the method of Sokal (1997) or the automated windowing of Goodman & Weare (2010):

$$\hat{\tau} = 1 + 2 \sum_{t=1}^{W} \hat{\rho}(t)$$

where $\hat{\rho}(t)$ is the estimated autocorrelation at lag $t$ and $W$ is chosen by the automated windowing criterion ($W$ is the smallest value where $W \geq C \hat{\tau}(W)$ for a tuning constant $C \approx 5$).

Stall criterion: Chain $k$ is stalled if:

$$\hat{\tau}_k > M / 2$$

where $M$ is the number of steps since the last successful swap injection. The interpretation: if the autocorrelation time exceeds half the available steps, the chain cannot have produced more than $\sim 2$ independent samples in that window. For a chain that should be exploring multiple modes, this indicates it is stuck.

Secondary diagnostic: Mode occupancy monitoring.

Cluster the chain samples using a simple criterion (e.g., k-means on the $\ln P_c$ dimension, which is the primary mode separator in TTV posteriors). If chain $k$ has occupied only a single cluster for $> M$ consecutive steps while other chains (at similar or lower temperatures) have visited multiple clusters, chain $k$ is stalled.

4.3 Recovery Strategy¶

Option A: Restart from a random prior draw.

Draw $\theta_k^\text{new} \sim \pi_0(\theta)$ and restart the chain from this point. This guarantees exploration of new regions but may waste many steps if the prior is broad and the new point lands in a low-likelihood desert. The chain must "find" a mode from scratch.

Option B: Restart from a sample from a hotter chain.

Take the most recent checkpoint state $\theta_j$ from the adjacent hotter chain $j = k+1$ and inject it into chain $k$. This is equivalent to forcing a swap. To maintain correctness, this injection must be accepted with the standard swap probability $A(k \leftrightarrow j)$ — but if chain $k$ is stalled in a deep mode, the swap acceptance may be low (the hot chain's state may be in a different, shallower mode from chain $k$'s perspective).

Recommendation for the TTV problem: Use Option B with a fallback to Option A.

Argument: The TTV posterior modes are well-separated in $P_c$ but the likelihood values at different modes are not wildly different (within $\Delta \ln \mathcal{L} \sim 5$--$30$). A hot chain's state is likely to be in or near a mode that the stalled chain has not visited. The swap acceptance probability is:

$$\ln A = \left(\frac{1}{T_k} - \frac{1}{T_{k+1}}\right)(\ln \mathcal{L}(\theta_{k+1}) - \ln \mathcal{L}(\theta_k))$$

If the hot chain is in a secondary mode with $\ln \mathcal{L}$ only $\sim 10$ below the primary mode, and the temperature difference is modest, the swap has reasonable acceptance probability. If the forced swap is rejected 3 consecutive times, fall back to Option A (prior restart).

Implementation:

PROCEDURE RecoverStalledChain(k):
    attempts <- 0
    WHILE attempts < 3:
        j <- k + 1  # Next hotter chain
        IF ChainState[j] exists AND ChainState[j].swap_status == IDLE:
            # Attempt forced swap
            ExecuteSwap(k, j)
            IF swap was accepted:
                RETURN  # Chain k now has a new state from chain j
        attempts += 1

    # Fallback: restart from prior
    theta_new <- sample_from_prior()
    log_L_new <- compute_log_likelihood(theta_new)
    log_prior_new <- compute_log_prior(theta_new)
    ChainState[k].theta <- theta_new
    ChainState[k].log_L <- log_L_new
    ChainState[k].log_prior <- log_prior_new
    record_event("chain_restart_from_prior", chain=k)
    DispatchNextEpoch(k, theta_new)

5. Convergence Diagnostics¶

5.1 The Problem with Standard R-hat¶

The Gelman-Rubin $\hat{R}$ statistic compares between-chain variance $B$ to within-chain variance $W$ across $M$ parallel chains of equal length $N$:

$$\hat{R} = \sqrt{\frac{(N-1)/N \cdot W + B/N}{W}}$$

This requires all chains to have the same length and to be complete. In our async protocol, chains have different step counts (some volunteers are faster), some chains may be mid-epoch, and chains are at different temperatures (only the $T=1$ chain targets the true posterior).

5.2 Sequential Stopping Rule¶

At each server-side checkpoint of the cold chain ($T_1 = 1$), compute the running posterior statistics:

Running mean and variance: After epoch $e$ of the cold chain, with cumulative samples ${\theta^{(1)}, \ldots, \theta^{(N_e)}}$:

$$\hat{\mu}e^{(d)} = \frac{1}{N_e} \sum d$$}^{N_e} \theta_n^{(d)} \quad \text{for each parameter dimension

$$\hat{\sigma}e^{(d)} = \sqrt{\frac{1}{N_e - 1} \sum$$}^{N_e} \left(\theta_n^{(d)} - \hat{\mu}_e^{(d)}\right)^2

Stopping criterion: Define a forgetting window of the last $F$ checkpoints (e.g., $F = 20$). At each checkpoint, compute:

$$\Delta \hat{\mu}^{(d)} = \max_{e' \in [e-F, e]} \hat{\mu}{e'}^{(d)} - \min$$} \hat{\mu}_{e'}^{(d)

Stop when, for all parameters $d$:

$$\frac{\Delta \hat{\mu}^{(d)}}{\hat{\sigma}_e^{(d)}} < \varepsilon$$

where $\varepsilon$ is a tolerance parameter. Recommended: $\varepsilon = 0.05$. This means the running mean has drifted by less than 5% of the posterior width over the last $F$ checkpoints — the posterior estimate has stabilized.

Additionally, require that the total effective sample size (ESS) of the cold chain exceeds a minimum:

$$\text{ESS} = \frac{N_e}{\hat{\tau}} > N_\text{min}$$

Recommended: $N_\text{min} = 1000$ effective samples for publication-quality posteriors, $N_\text{min} = 200$ for preliminary estimates.

5.3 Rank-Normalised R-hat on Partial Chains (Vehtari et al. 2021)¶

The rank-normalised split-$\hat{R}$ (Vehtari et al. 2021, Bayesian Analysis 16, 667) improves on standard $\hat{R}$ by:

Rank-normalising all samples (replacing values with their rank, then applying an inverse-normal transform). This makes $\hat{R}$ sensitive to differences in distribution shape, not just location and scale.
Splitting each chain in half and treating the halves as separate chains. This detects non-stationarity within a single chain.

Modifications for async chains:

Unequal chain lengths: Truncate all chains to the length of the shortest chain before computing $\hat{R}$. Alternatively, use importance-weighted $\hat{R}$ where chains contribute to the statistic proportional to their ESS.
Temperature correction: Only compute $\hat{R}$ across chains at $T_1 = 1$. If there is only one cold chain, split it into 4 segments and compute split-$\hat{R}$ across segments.
Multiple cold-chain replicas: Run $R \geq 4$ independent cold chains (each as a separate BOINC work unit stream) to enable meaningful $\hat{R}$ computation. This costs 4x the cold-chain BOINC resources but provides a robust convergence diagnostic.

Convergence threshold: $\hat{R} < 1.01$ for all parameters (rank-normalised, split). Additionally, require that the bulk ESS and tail ESS (as defined by Vehtari et al.) both exceed $N_\text{min}$.

5.4 KS Test Between Adjacent Temperature Chains¶

To verify that the temperature ladder is providing adequate communication between temperatures, compare the marginal distributions of adjacent chains. After accumulating at least 500 samples from each chain:

For each parameter $d$ and each adjacent pair $(k, k+1)$:

$$\text{KS}{k,k+1}^{(d)} = \sup_x \left| F_k^{(d)}(x) - F(x) \right|$$}^{(d)

with $p$-value from the two-sample KS test.

Interpretation:

$p > 0.05$ for all parameters between adjacent chains: The tempered distributions overlap sufficiently. The temperature spacing is adequate.
$p < 0.01$ for any parameter between adjacent chains: The distributions are too different. The temperature gap is too large. Insert an intermediate temperature.
$0.01 < p < 0.05$: Borderline. Monitor for several more epochs.

Important caveat: Adjacent chains at different temperatures are expected to have different distributions (the tempered posterior broadens with temperature). The KS test here is not checking for identity — it is checking that the distributions overlap enough for swaps to be accepted. A more direct diagnostic is the observed swap acceptance rate itself (Section 3.5). The KS test is a secondary diagnostic used to understand why swap rates are low when they are.

6. Server-Side Checkpoint Schema¶

6.1 SQLite Database Schema¶

-- Temperature ladder configuration for a given inference run
CREATE TABLE pt_runs (
    run_id          TEXT PRIMARY KEY,          -- e.g., "ttv_kepler17_20260321"
    target_name     TEXT NOT NULL,             -- e.g., "Kepler-17"
    n_chains        INTEGER NOT NULL,          -- K
    n_params        INTEGER NOT NULL,          -- D (dimensionality)
    param_names     TEXT NOT NULL,             -- JSON array of parameter names
    prior_config    TEXT NOT NULL,             -- JSON: prior bounds and types
    ttvfast_config  TEXT NOT NULL,             -- JSON: dt, total_time, etc.
    transit_data    TEXT NOT NULL,             -- JSON: times, errors arrays
    status          TEXT NOT NULL DEFAULT 'RUNNING',  -- RUNNING, CONVERGED, FAILED
    created_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    converged_at    TIMESTAMP
);

-- Current state of each chain (one row per chain, updated in place)
CREATE TABLE chain_states (
    run_id              TEXT NOT NULL REFERENCES pt_runs(run_id),
    chain_id            INTEGER NOT NULL,          -- k = 1..K
    temperature         REAL NOT NULL,             -- T_k
    theta               BLOB NOT NULL,             -- Current parameter vector (D floats, little-endian float64)
    log_likelihood      REAL NOT NULL,
    log_prior           REAL NOT NULL,
    step_count          INTEGER NOT NULL DEFAULT 0,  -- Cumulative n_k
    epoch_count         INTEGER NOT NULL DEFAULT 0,  -- Number of completed epochs
    current_epoch_len   INTEGER NOT NULL,          -- E for current/next epoch
    last_checkpoint_ts  TIMESTAMP,
    swap_status         TEXT NOT NULL DEFAULT 'IDLE',  -- IDLE, SWAP_PENDING, SWAP_EXECUTING
    swap_partner        INTEGER,                   -- chain_id of pending partner, or NULL
    swap_pending_since  TIMESTAMP,
    assigned_host       TEXT,                      -- BOINC host_id
    current_wu_id       TEXT,                      -- BOINC work unit name
    acceptance_rate     REAL,                      -- Rolling MH acceptance rate (last epoch)
    autocorr_estimate   REAL,                      -- Rolling tau estimate
    is_stalled          INTEGER NOT NULL DEFAULT 0,  -- Boolean flag
    PRIMARY KEY (run_id, chain_id)
);

-- Append-only log of all checkpoints received
CREATE TABLE checkpoints (
    checkpoint_id   INTEGER PRIMARY KEY AUTOINCREMENT,
    run_id          TEXT NOT NULL REFERENCES pt_runs(run_id),
    chain_id        INTEGER NOT NULL,
    epoch           INTEGER NOT NULL,
    theta           BLOB NOT NULL,             -- Parameter vector at checkpoint
    log_likelihood  REAL NOT NULL,
    log_prior       REAL NOT NULL,
    step_count      INTEGER NOT NULL,          -- Steps completed in this epoch
    acceptance_rate REAL NOT NULL,
    autocorr_estimate REAL,
    received_at     TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    boinc_host_id   TEXT,
    boinc_wu_id     TEXT,
    wall_time_s     REAL                       -- Wall-clock time for this epoch
);

-- Thinned posterior samples from each epoch (append-only)
CREATE TABLE samples (
    sample_id       INTEGER PRIMARY KEY AUTOINCREMENT,
    run_id          TEXT NOT NULL REFERENCES pt_runs(run_id),
    chain_id        INTEGER NOT NULL,
    epoch           INTEGER NOT NULL,
    step_in_epoch   INTEGER NOT NULL,          -- Which step within the epoch
    theta           BLOB NOT NULL,             -- D floats
    log_likelihood  REAL NOT NULL,
    log_prior       REAL NOT NULL
);

-- Index for fast retrieval of cold-chain samples for convergence diagnostics
CREATE INDEX idx_samples_cold ON samples(run_id, chain_id, epoch)
    WHERE chain_id = 1;

-- Swap attempt log (append-only)
CREATE TABLE swap_log (
    swap_id         INTEGER PRIMARY KEY AUTOINCREMENT,
    run_id          TEXT NOT NULL REFERENCES pt_runs(run_id),
    chain_i         INTEGER NOT NULL,
    chain_j         INTEGER NOT NULL,
    epoch_i         INTEGER NOT NULL,          -- Epoch of chain i at swap time
    epoch_j         INTEGER NOT NULL,          -- Epoch of chain j at swap time
    log_L_i         REAL NOT NULL,
    log_L_j         REAL NOT NULL,
    temp_i          REAL NOT NULL,
    temp_j          REAL NOT NULL,
    log_alpha       REAL NOT NULL,             -- Log acceptance probability
    accepted        INTEGER NOT NULL,          -- 0 or 1
    attempted_at    TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Convergence diagnostics computed at each cold-chain checkpoint
CREATE TABLE convergence_log (
    log_id          INTEGER PRIMARY KEY AUTOINCREMENT,
    run_id          TEXT NOT NULL REFERENCES pt_runs(run_id),
    epoch           INTEGER NOT NULL,
    param_index     INTEGER NOT NULL,          -- Which parameter (0..D-1)
    running_mean    REAL NOT NULL,
    running_std     REAL NOT NULL,
    delta_mu_ratio  REAL NOT NULL,             -- |Delta mu| / sigma over forgetting window
    ess             REAL NOT NULL,             -- Effective sample size
    r_hat           REAL,                      -- Rank-normalised split R-hat (if available)
    computed_at     TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Index for convergence monitoring queries
CREATE INDEX idx_convergence ON convergence_log(run_id, epoch, param_index);

-- Chain lifecycle events (restarts, stall detections, ladder adjustments)
CREATE TABLE chain_events (
    event_id        INTEGER PRIMARY KEY AUTOINCREMENT,
    run_id          TEXT NOT NULL REFERENCES pt_runs(run_id),
    chain_id        INTEGER,                   -- NULL for run-level events
    event_type      TEXT NOT NULL,             -- STALL_DETECTED, RESTART_FROM_HOT, RESTART_FROM_PRIOR,
                                               -- TEMP_INSERTED, TEMP_REMOVED, CHAIN_LOST, CHAIN_REASSIGNED
    details         TEXT,                      -- JSON with event-specific metadata
    occurred_at     TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

6.2 Storage Estimates¶

For a single TTV run with $K = 18$ chains, $D = 9$ parameters, $E = 50{,}000$ steps per epoch, thinning factor 50, and 100 epochs:

chain_states: 18 rows, ~1 KB each = ~18 KB (updated in place)
checkpoints: 18 chains x 100 epochs = 1,800 rows, ~200 B each = ~360 KB
samples: 18 chains x 100 epochs x 1,000 thinned samples = 1.8M rows, ~100 B each = ~180 MB
swap_log: ~100 epochs x 17 adjacent pairs = 1,700 rows, ~100 B each = ~170 KB
convergence_log: 100 epochs x 9 params = 900 rows, ~100 B each = ~90 KB

Total per run: ~180 MB, dominated by the sample store. For runs accumulating on a server with dozens of target systems, this is manageable on a standard VPS with 100 GB of disk.

For bulk sample storage, consider storing the samples.theta column as compressed numpy arrays in Backblaze B2 and keeping only summary statistics in SQLite.

7. Epoch Length Optimisation¶

7.1 The Tradeoffs¶

The epoch length $E$ controls the granularity of server-volunteer interaction:

$E$ too small ($E < 1000$ for the TTV problem): The BOINC work unit completes in seconds. BOINC overhead (scheduling, file transfer, result processing) dominates. Volunteer machines spend more time downloading and uploading than computing. Swap opportunities are frequent but the overhead is unacceptable.
$E$ too large ($E > 500{,}000$): The work unit runs for hours. If the chain is stalled or the temperature ladder needs adjustment, the server cannot intervene until the epoch completes. Volunteer dropout mid-epoch wastes significant compute. Swap opportunities are too rare for effective mode communication.
$E$ optimal: The chain moves meaningfully within an epoch (explores the local mode thoroughly), the wall-clock time is 10-30 minutes (a good BOINC work unit duration), and swap opportunities occur frequently enough for inter-mode communication.

7.2 Relationship to Local Autocorrelation Time¶

Within a single mode, the chain has a local autocorrelation time $\tau_\text{local}$ — the number of steps needed to produce one effectively independent sample. For the epoch to be "meaningful," it must contain many independent samples:

$$E \gg \tau_\text{local}$$

A good rule of thumb: $E \geq 20 \cdot \tau_\text{local}$. This ensures the chain produces $\geq 20$ independent samples per epoch, providing a reliable estimate of the local posterior structure and a meaningful state to report at checkpoint time.

7.3 Adaptive Epoch Length¶

The server adjusts $E$ for each chain based on the rolling autocorrelation estimate reported by the volunteer:

PROCEDURE UpdateEpochLength(chain_k, reported_tau):
    # Update rolling autocorrelation estimate (exponential moving average)
    alpha = 0.3  # Smoothing factor
    ChainState[k].autocorr_estimate =
        alpha * reported_tau + (1 - alpha) * ChainState[k].autocorr_estimate

    tau_est = ChainState[k].autocorr_estimate

    # Target: 20 autocorrelation times per epoch
    E_target = max(20 * tau_est, 5000)  # Floor at 5000 steps

    # Ceiling: work unit should not exceed 30 minutes wall time
    # Estimate wall time: E * t_per_step (from last epoch's wall_time / step_count)
    t_per_step = last_wall_time / last_step_count
    E_max = floor(30 * 60 / t_per_step)  # 30 minutes in steps

    # Floor: work unit should be at least 5 minutes (BOINC overhead)
    E_min = max(floor(5 * 60 / t_per_step), 5000)

    ChainState[k].current_epoch_len = clamp(E_target, E_min, E_max)

7.4 Estimating $\tau_\text{local}$ on the Volunteer Machine¶

The volunteer machine estimates the autocorrelation time from its own epoch's samples using the automated windowing estimator. This is computed at the end of the epoch before submitting the checkpoint:

def estimate_autocorrelation(chain_samples):
    """
    Estimate integrated autocorrelation time using Sokal's
    automatic windowing procedure.
    chain_samples: array of shape (E, D) — one row per step, D parameters
    Returns: tau_hat (scalar, maximum over all parameters)
    """
    tau_per_param = []
    for d in range(chain_samples.shape[1]):
        x = chain_samples[:, d]
        x = x - x.mean()
        n = len(x)
        # FFT-based autocorrelation
        f = np.fft.fft(x, n=2*n)
        acf = np.fft.ifft(f * np.conj(f))[:n].real
        acf /= acf[0]
        # Sokal windowing: find W where W >= C * cumulative_tau(W)
        C = 5.0
        cumtau = 0.5  # Start with lag-0 contribution
        for t in range(1, n):
            cumtau += acf[t]
            if t >= C * cumtau:
                break
        tau_per_param.append(2 * cumtau)  # Factor of 2 for the full integral
    return max(tau_per_param)

The maximum over parameters is reported because the slowest-mixing parameter dominates the effective epoch quality. This estimate is included in the checkpoint submission.

7.5 Initial Epoch Length (Before Any $\tau$ Estimate)¶

For the first epoch of a new run, use a conservative default:

$$E_\text{init} = 50{,}000$$

At $\sim 2$ ms per TTVFast call, this is $\sim 100$ s of CPU time — a reasonable initial work unit. After the first epoch returns with a $\tau$ estimate, the adaptive procedure takes over.

8. Full System Diagram¶

8.1 Mermaid Sequence Diagram¶

sequenceDiagram
    participant V1 as Volunteer Machine (Chain i)
    participant V2 as Volunteer Machine (Chain j)
    participant BS as BOINC Scheduler
    participant SS as Swap Server (Assimilator)
    participant DB as SQLite Checkpoint Store

    Note over SS,DB: Initialisation: Create run, seed K chain states in DB

    SS->>BS: Create work unit WU_i (chain i, epoch 1, state θ_i)
    SS->>BS: Create work unit WU_j (chain j, epoch 1, state θ_j)

    BS->>V1: Dispatch WU_i (input: θ_i, T_i, transit data)
    BS->>V2: Dispatch WU_j (input: θ_j, T_j, transit data)

    Note over V1: Run E steps of tempered MCMC at T_i
    Note over V2: Run E steps of tempered MCMC at T_j

    V1->>BS: Submit result (θ_i', log L_i', samples, τ̂)
    BS->>SS: Assimilator receives result for chain i

    SS->>DB: Update ChainState[i] with θ_i', log L_i'
    SS->>DB: Store checkpoint and samples
    SS->>DB: Check: Is chain j SWAP_PENDING for chain i?

    alt Chain j has not yet reported
        SS->>DB: Set ChainState[i].swap_status = SWAP_PENDING
        Note over SS: Wait up to W seconds for chain j

        V2->>BS: Submit result (θ_j', log L_j', samples, τ̂)
        BS->>SS: Assimilator receives result for chain j

        SS->>DB: Update ChainState[j] with θ_j', log L_j'
        SS->>DB: Chain i is SWAP_PENDING — execute swap logic

        SS->>SS: Compute log α = (1/T_i - 1/T_j)(log L_j' - log L_i')
        SS->>SS: Draw U ~ Uniform(0,1)

        alt Swap accepted (log U < log α)
            SS->>DB: Swap θ between chains i and j
            SS->>DB: Log swap (accepted=True)
        else Swap rejected
            SS->>DB: Log swap (accepted=False)
        end

        SS->>DB: Reset swap_status for both chains to IDLE
        SS->>BS: Create WU_i (epoch 2, state θ_i or swapped θ_j)
        SS->>BS: Create WU_j (epoch 2, state θ_j or swapped θ_i)

    else Swap window W expires before chain j reports
        SS->>DB: Clear SWAP_PENDING for chain i
        SS->>BS: Create WU_i (epoch 2, state θ_i' — no swap)
        Note over SS: Chain j will submit later, gets dispatched then
    end

    BS->>V1: Dispatch next epoch work unit
    BS->>V2: Dispatch next epoch work unit

    Note over V1,V2: Cycle repeats until convergence criteria met

8.2 ASCII State Machine for a Single Chain¶

                        ┌──────────────────┐
                        │   INITIALISED    │
                        │  (seeded state)  │
                        └────────┬─────────┘
                                 │
                                 │ DispatchNextEpoch()
                                 v
                        ┌──────────────────┐
                        │   DISPATCHED     │
                        │ (WU sent to      │
                        │  BOINC client)   │
                        └────────┬─────────┘
                                 │
                     ┌───────────┼───────────┐
                     │           │           │
              Volunteer     Volunteer    BOINC deadline
              completes     crashes      expires
                     │           │           │
                     v           │           v
            ┌──────────────┐     │   ┌──────────────────┐
            │  CHECKPOINT  │     │   │  TIMED_OUT       │
            │  RECEIVED    │     │   │  (chain lost)    │
            └──────┬───────┘     │   └────────┬─────────┘
                   │             │            │
                   │             │     Reassign to new
                   │             │     volunteer from
                   │             │     last checkpoint
                   │             │            │
                   │             └────────────┘
                   │
           ┌───────┴────────┐
           │                │
     Adjacent chain    No adjacent chain
     also recent       available
           │                │
           v                v
    ┌──────────────┐  ┌──────────────┐
    │ SWAP_PENDING │  │    IDLE      │
    │ (wait ≤ W s) │  │ (dispatch   │
    └──────┬───────┘  │  next epoch) │
           │          └──────────────┘
     ┌─────┴──────┐
     │            │
  Partner      Timeout
  arrives      (W expires)
     │            │
     v            v
┌──────────┐  ┌──────────────┐
│  SWAP    │  │    IDLE      │
│ EXECUTE  │  │ (dispatch    │
│ (accept/ │  │  next epoch, │
│  reject) │  │  no swap)    │
└────┬─────┘  └──────────────┘
     │
     v
┌──────────────┐
│    IDLE      │
│ (dispatch    │
│  next epoch  │
│  with new or │
│  same state) │
└──────────────┘

8.3 Complete Message Flow Across All Components¶

┌─────────────┐     ┌──────────────┐     ┌──────────────────┐     ┌──────────┐
│  Volunteer   │     │    BOINC     │     │   Swap Server    │     │  SQLite  │
│  Machine     │     │  Scheduler   │     │  (Assimilator)   │     │    DB    │
└──────┬───────┘     └──────┬───────┘     └────────┬─────────┘     └────┬─────┘
       │                    │                      │                    │
       │  Poll for work     │                      │                    │
       │ ─────────────────> │                      │                    │
       │                    │                      │                    │
       │  WU (input files)  │                      │                    │
       │ <───────────────── │                      │                    │
       │                    │                      │                    │
       │  [Run E MCMC       │                      │                    │
       │   steps locally]   │                      │                    │
       │                    │                      │                    │
       │  Upload result     │                      │                    │
       │ ─────────────────> │                      │                    │
       │                    │  Validated result    │                    │
       │                    │ ────────────────────>│                    │
       │                    │                      │                    │
       │                    │                      │  UPDATE chain_state│
       │                    │                      │ ──────────────────>│
       │                    │                      │                    │
       │                    │                      │  INSERT checkpoint │
       │                    │                      │ ──────────────────>│
       │                    │                      │                    │
       │                    │                      │  INSERT samples    │
       │                    │                      │ ──────────────────>│
       │                    │                      │                    │
       │                    │                      │  CHECK swap partner│
       │                    │                      │ ──────────────────>│
       │                    │                      │  partner status    │
       │                    │                      │ <──────────────────│
       │                    │                      │                    │
       │                    │                      │ [Swap logic:       │
       │                    │                      │  accept/reject/   │
       │                    │                      │  pend/skip]       │
       │                    │                      │                    │
       │                    │                      │  INSERT swap_log   │
       │                    │                      │ ──────────────────>│
       │                    │                      │                    │
       │                    │                      │  Compute           │
       │                    │                      │  convergence diag. │
       │                    │                      │                    │
       │                    │                      │  INSERT convergence│
       │                    │                      │ ──────────────────>│
       │                    │                      │                    │
       │                    │  create_work         │                    │
       │                    │ <────────────────────│                    │
       │                    │  (next epoch WU)     │                    │
       │                    │                      │                    │

9. Failure Mode Specification¶

9.1 Volunteer Machine Disappears Mid-Epoch¶

Scenario: A volunteer running chain $k$'s epoch shuts down their computer, loses internet, or pauses the BOINC client.

Detection: BOINC's transitioner marks the work unit as timed out after the deadline (configured per work unit, typically $2 \times$ expected wall time + 24 hours buffer).

Server response:

The chain's chain_states row retains the last checkpoint state (from the end of the previous epoch). No state is lost except the in-progress epoch's samples.
The work unit is marked as ERROR in BOINC. If target_nresults > 1, BOINC automatically resends to another volunteer.
If target_nresults = 1 (our default for non-validation runs), the Swap Server detects the timeout and creates a new work unit from the last checkpoint state.
Any pending swaps involving chain $k$ are cleared after $W$ seconds (by the SwapTimeoutDaemon).
The partner chain continues normally — it simply missed a swap opportunity.

Impact: Loss of $E$ steps of compute (one epoch). No corruption of any chain state. The temperature ladder operates with $K-1$ chains until the new work unit is dispatched and picked up.

9.2 Server Goes Down During Swap Execution¶

Scenario: The Swap Server crashes after computing the swap acceptance but before writing the result to the database.

Detection: On restart, the server finds chains with swap_status = SWAP_EXECUTING in the database.

Recovery procedure:

For any chain with swap_status = SWAP_EXECUTING: reset to IDLE and dispatch from the pre-swap state (which is still in the database, since the swap write was not committed).
Both chains involved in the interrupted swap restart from their pre-swap states. This is equivalent to the swap being rejected — both chains continue from their own states.
The atomic unit of the swap operation is the database transaction that writes both chains' new states. Use an explicit SQLite transaction with BEGIN IMMEDIATE to ensure the swap is all-or-nothing:

BEGIN IMMEDIATE;
    UPDATE chain_states SET theta = ?, log_likelihood = ?, swap_status = 'IDLE'
        WHERE run_id = ? AND chain_id = ?;  -- chain i
    UPDATE chain_states SET theta = ?, log_likelihood = ?, swap_status = 'IDLE'
        WHERE run_id = ? AND chain_id = ?;  -- chain j
    INSERT INTO swap_log (...) VALUES (...);
COMMIT;

If the server crashes before COMMIT, the transaction is rolled back and both chains retain their pre-swap states.

9.3 Two Chains Arrive Simultaneously for Swap¶

Scenario: Chains $i$ and $j$ (adjacent in the temperature ladder) both submit checkpoints at the same instant, and both are processed by the server concurrently.

Prevention: The SwapLock mutex serialises all swap operations. Only one HandleCheckpoint procedure executes the swap-decision logic at a time. The first checkpoint to acquire the lock is processed first; the second acquires the lock after the first releases it, finds the partner's status already updated, and proceeds accordingly.

In practice (SQLite): SQLite's write lock provides natural serialisation for the swap logic. Wrap the entire HandleCheckpoint procedure in a single database transaction. The second concurrent call blocks on the write lock until the first completes.

9.4 Swap Window Expires for a Pending Chain While Its Partner Is In Transit¶

Scenario: Chain $i$ submits a checkpoint and enters SWAP_PENDING. The swap window $W$ expires. Chain $i$ is dispatched without a swap. Moments later, chain $j$ submits its checkpoint.

Handling: When chain $j$'s checkpoint arrives, the server checks chain $i$'s status and finds it IDLE (already dispatched). Chain $j$ enters SWAP_PENDING if another adjacent chain is available, or is dispatched immediately without a swap.

Impact: One missed swap opportunity. This is the expected behaviour and is by design. The protocol tolerates missed swaps gracefully.

9.5 Volunteer Returns Corrupted or Adversarial Results¶

Scenario: A volunteer's hardware has a floating-point error, or an adversarial volunteer submits fabricated results.

Detection:

Sanity checks on checkpoint: The server validates that $\log \mathcal{L}$ is finite and within a plausible range (e.g., $> -10^6$). The parameter vector $\theta$ must be within prior bounds. If these checks fail, reject the result and redispatch.
BOINC redundancy: For critical runs (publication-quality inference), set target_nresults = 2 and use the KS-test validation described in [[BOINC Integration Plan]], Section 3. Two independent volunteers must produce statistically consistent chains.
Statistical anomaly detection: If a chain's log-likelihood suddenly jumps by $> 100$ nats between consecutive checkpoints (more than any plausible MCMC move), flag the result for manual review.

9.6 All Volunteers Disappear (Total Fleet Loss)¶

Scenario: Every active volunteer machine goes offline (e.g., a BOINC client update causes a mass disconnection).

Server response: All chains stall. The server's SwapTimeoutDaemon clears all pending swaps. All chain states are preserved in the database. When volunteers return, the server re-dispatches all chains from their last checkpoint states.

Impact: The inference run pauses. No state is lost. This is the normal operating mode of BOINC — projects routinely experience periods of low volunteer availability.

10. Protocol Parameter Summary¶

Parameter	Symbol	Recommended Value	Rationale
Number of chains	$K$	18	Geometric ladder from $T=1$ to $T=2500$ with 30% swap rate
Cold temperature	$T_1$	1.0	Targets true posterior
Hot temperature	$T_K$	2500	Flattens likelihood across prior volume for TTV problem
Geometric ratio	$r$	1.60	Yields ~30% swap acceptance for $D=9$
Epoch length (initial)	$E_\text{init}$	50,000 steps	~100 s wall time at 2 ms/step
Epoch length (adaptive range)	$E$	5,000 -- 900,000	5 min to 30 min wall time
Swap window	$W$	300 s (5 min)	Tolerance for volunteer arrival jitter
Stall threshold	$M$	500,000 steps	10 epochs at default $E$
Convergence tolerance	$\varepsilon$	0.05	$\Delta\mu / \sigma < 5\%$ over forgetting window
Forgetting window	$F$	20 checkpoints	Look-back for stability assessment
Minimum ESS	$N_\text{min}$	1,000	For publication-quality posteriors
R-hat threshold	$\hat{R}$	< 1.01	Rank-normalised split R-hat
KS p-value threshold		> 0.05	Adjacent chain distribution overlap
Thinning factor		50	Store every 50th sample
BOINC WU deadline		$4 \times$ expected wall time	Buffer for slow volunteers

11. Relationship to Existing Vault Notes¶

This protocol operationalises the parallel tempering strategy described as "Option D (best for Stage 1)" in [[BOINC Integration Plan]], Section 2.1, which proposed: "Each temperature chain is an independent work unit. After all complete, implement the PTSampler swap step server-side."

The original proposal assumed synchronous completion ("after all complete"), which is unrealistic for BOINC. This document replaces that assumption with an asynchronous swap protocol that tolerates arbitrary volunteer timing while preserving the statistical guarantees of PT-MCMC.

The MCMC implementation details (TTVFast integration, parameter space, priors) follow [[Reverse N-Body and Astrometric Parallax]], Sections 1-4. The compute budget estimates from that document ($\sim 2$ ms/call, $\sim 15$ min per 2-planet system on 1 core) inform the epoch length and temperature ladder sizing here.

The broader compute architecture context is in [[Distributed Computing Strategy]], which positions BOINC as a Stage 2-3 tool. This protocol is designed for Stage 2 deployment, when the number of target systems exceeds what can be handled on a single VPS.

See also: [[BOINC Integration Plan]] for server setup and work unit packaging. See also: [[Distributed Computing Strategy]] for the full compute architecture. See also: [[Reverse N-Body and Astrometric Parallax]] for the TTV posterior geometry.