MCMC State Synchronisation Protocol for BOINC Parallel Tempering¶
This document specifies a complete protocol for running parallel tempering MCMC across distributed BOINC volunteer machines for the OpenAstro TTV n-body inference problem. It covers the algorithm, the async swap mechanism, failure modes, convergence diagnostics, and the server-side data schema.
1. Parallel Tempering Fundamentals¶
1.1 The Standard PT Algorithm¶
Parallel tempering (replica exchange MCMC) runs $K$ Markov chains simultaneously, each sampling a tempered version of the target posterior. Chain $k$ samples from:
$$\pi_k(\theta) \propto \left[\mathcal{L}(\theta)\right]^{1/T_k} \cdot \pi_0(\theta)$$
where $T_1 = 1 < T_2 < \cdots < T_K$ is the temperature ladder, $\mathcal{L}(\theta)$ is the likelihood, and $\pi_0(\theta)$ is the prior. The $T_1 = 1$ chain samples the true posterior. Higher-temperature chains flatten the likelihood surface, enabling exploration across modes separated by deep likelihood valleys.
[Source: Swendsen & Wang (1986), Phys. Rev. Lett. 57, 2607 β replica exchange; Earl & Deem (2005), Phys. Chem. Chem. Phys. 7, 3910 β parallel tempering review]
Within-temperature moves: Each chain $k$ independently performs standard Metropolis-Hastings (or ensemble) moves targeting $\pi_k(\theta)$. For a proposal $\theta' \sim q(\theta' | \theta_k)$, the acceptance probability is:
$$A_\text{MH} = \min\left(1,\; \frac{\mathcal{L}(\theta')^{1/T_k}\, \pi_0(\theta')\, q(\theta_k | \theta')}{\mathcal{L}(\theta_k)^{1/T_k}\, \pi_0(\theta_k)\, q(\theta' | \theta_k)}\right)$$
Inter-chain swaps: Periodically, adjacent chains $i$ and $j = i+1$ propose to exchange their states $\theta_i \leftrightarrow \theta_j$. The swap acceptance probability preserves detailed balance of the joint chain:
$$A(i \leftrightarrow j) = \min\left(1,\; \frac{\mathcal{L}(\theta_j)^{1/T_i} \cdot \mathcal{L}(\theta_i)^{1/T_j}}{\mathcal{L}(\theta_i)^{1/T_i} \cdot \mathcal{L}(\theta_j)^{1/T_j}}\right)$$
Equivalently in log form (numerically stable):
$$\ln A(i \leftrightarrow j) = \min\left(0,\; \left(\frac{1}{T_i} - \frac{1}{T_j}\right) \cdot \left(\ln \mathcal{L}(\theta_j) - \ln \mathcal{L}(\theta_i)\right)\right)$$
1.2 Required Swap Frequency¶
For PT to function, swaps must occur frequently enough that favourable configurations discovered by hot chains propagate down to the cold chain before the hot chain drifts away. The standard guideline:
- Attempt swaps every $S = 10$--$100$ within-chain steps. Too infrequent ($S > 500$): chains decouple, the hot chain's mode discoveries never reach the cold chain. Too frequent ($S < 5$): computational overhead from swap logic dominates, and chains don't move far enough between swaps for the swap to be informative.
- For the TTV problem with $\tau_\text{local} \approx 20$--$50$ steps (autocorrelation within a single mode), $S \approx 50$ is a reasonable starting point: one swap attempt per local decorrelation time.
1.3 The Synchronisation Requirement¶
Theoretical requirement: The swap step requires both chains $i$ and $j$ to be paused simultaneously. The acceptance probability $A(i \leftrightarrow j)$ is computed from the states $(\theta_i, \theta_j)$ at the same logical time. If chain $i$ has advanced $n$ steps past the point where its state was read while chain $j$ is evaluated at the current point, the swap no longer satisfies detailed balance for the joint process. The Markov property of the joint $(K$-chain$)$ system requires that the swap decision is made on the current joint state, not a stale snapshot.
Why this is fundamentally incompatible with BOINC:
BOINC operates on a fire-and-forget work unit model. A work unit is dispatched to a volunteer machine, runs for minutes to hours, and returns a result. There is:
-
No mechanism for synchronous pausing. You cannot send a message to a running BOINC work unit saying "pause at step 2,500 and report your state." The work unit runs to completion (or checkpoints on its own schedule) and reports back. There is no bidirectional communication during execution.
-
No guarantee of simultaneous completion. Two work units dispatched at the same time to two volunteer machines may complete hours or days apart, depending on the volunteer's CPU speed, uptime, and competing workloads.
-
No guarantee of completion at all. A volunteer machine may shut down, lose network connectivity, or have the BOINC client paused indefinitely. The server must tolerate chains that simply vanish.
Therefore, a direct implementation of synchronous PT over BOINC is impossible. What follows is an asynchronous approximation that preserves correctness of individual chains while allowing swaps to occur opportunistically.
2. The Async Swap Protocol¶
2.1 Design Principles¶
The protocol is designed around three invariants:
Invariant 1 (Chain correctness): Each individual chain $k$, considered in isolation, is a valid MCMC chain targeting $\pi_k(\theta)$. Swaps are state injections that, when they occur, are accepted or rejected according to the correct Metropolis-Hastings criterion. When swaps do not occur (due to timing failures), the chain simply continues its own random walk. Missing a swap is equivalent to proposing a swap and rejecting it β both result in the chain continuing from its current state.
Invariant 2 (Graceful degradation): If all volunteer machines disappear except one, that one chain continues to produce valid (though poorly mixing) samples from its tempered distribution. If only the cold chain survives, it produces valid posterior samples. The system never enters an invalid state due to missing chains.
Invariant 3 (At-most-once swaps): A chain's state is never used in two simultaneous swaps. The server enforces mutual exclusion on swap operations.
2.2 Server-Side State¶
For each chain $k \in {1, \ldots, K}$, the server maintains:
ChainState[k] = {
theta_k: float[D] # Current parameter vector (D dimensions)
log_L_k: float # log-likelihood at theta_k
log_prior_k: float # log-prior at theta_k
T_k: float # Temperature assignment
n_k: int # Total step counter (cumulative across epochs)
epoch_k: int # Epoch counter (number of completed epochs)
last_checkpoint_ts: datetime # Timestamp of last checkpoint submission
swap_status: enum # One of: IDLE, SWAP_PENDING, SWAP_EXECUTING
swap_partner: int or null # Chain index of pending swap partner
swap_pending_since: datetime # When swap_status became SWAP_PENDING
assigned_volunteer: string # BOINC host ID currently running this chain
work_unit_id: string # Current BOINC work unit ID
}
Global server state:
SwapWindow: W seconds # Maximum time to wait for a swap partner
EpochLength: E steps # Steps per work unit (adjustable per chain)
SwapLock: mutex # Global lock for swap operations
2.3 Chain Execution (Volunteer Side)¶
A BOINC work unit for chain $k$ contains:
Input:
{
"chain_id": k,
"temperature": T_k,
"initial_state": theta_k,
"log_likelihood": log_L_k,
"log_prior": log_prior_k,
"epoch_length": E,
"transit_data": { "times": [...], "errors": [...] },
"prior_bounds": { ... },
"ttvfast_config": { "dt": 0.05, "total_time": 1500 },
"random_seed": seed_k_epoch
}
Execution on volunteer machine:
1. Load initial state theta from input
2. For step = 1 to E:
a. Propose theta' from proposal distribution q(theta' | theta)
b. Compute log_L' = log_likelihood(theta', transit_data) via TTVFast
c. Compute log_prior' = log_prior(theta')
d. Compute log_alpha = (1/T_k) * (log_L' - log_L) + (log_prior' - log_prior)
+ log q(theta | theta') - log q(theta' | theta)
e. If log(U) < log_alpha (U ~ Uniform(0,1)):
theta <- theta', log_L <- log_L', log_prior <- log_prior'
f. Record sample: store (theta, log_L, log_prior) every thin_interval steps
g. Checkpoint to disk every checkpoint_interval steps (for BOINC restart)
3. Submit result to server:
{
"chain_id": k,
"final_state": theta,
"final_log_L": log_L,
"final_log_prior": log_prior,
"step_count": E,
"samples": thinned_chain_array,
"acceptance_rate": n_accepted / E,
"autocorrelation_estimate": tau_hat
}
The volunteer machine has no knowledge of swaps. It runs a standard tempered MCMC chain and reports back. All swap logic is server-side.
2.4 Server-Side Swap Logic¶
When the server receives a checkpoint from chain $i$:
PROCEDURE HandleCheckpoint(chain_i, checkpoint_data):
ACQUIRE SwapLock
# Step 1: Update chain i's state
ChainState[i].theta_i <- checkpoint_data.final_state
ChainState[i].log_L_i <- checkpoint_data.final_log_L
ChainState[i].log_prior_i <- checkpoint_data.final_log_prior
ChainState[i].n_i += checkpoint_data.step_count
ChainState[i].epoch_i += 1
ChainState[i].last_checkpoint_ts <- NOW()
store checkpoint_data.samples in SampleStore[i]
# Step 2: Identify swap partner
# Adjacent chains in the temperature ladder: j = i-1 or j = i+1
# Choose the partner that has SWAP_PENDING status for chain i
# If neither is pending, choose the adjacent chain with the most
# recent checkpoint (prefer the chain that has been waiting longest)
j <- NULL
FOR each adjacent chain index adj IN {i-1, i+1} (if valid):
IF ChainState[adj].swap_status == SWAP_PENDING
AND ChainState[adj].swap_partner == i:
j <- adj
BREAK
# Step 3: Execute swap or mark as pending
IF j IS NOT NULL:
# Partner j was waiting for us. Attempt the swap.
ExecuteSwap(i, j)
ELSE:
# No partner ready. Check if any adjacent chain is IDLE and
# its last checkpoint is within the swap window.
FOR each adjacent chain index adj IN {i-1, i+1} (if valid):
IF ChainState[adj].swap_status == IDLE
AND (NOW() - ChainState[adj].last_checkpoint_ts) < W seconds:
# The adjacent chain recently checked in but isn't waiting.
# We cannot swap (its state may be stale or it is mid-epoch).
# Mark ourselves as pending for this partner.
ChainState[i].swap_status <- SWAP_PENDING
ChainState[i].swap_partner <- adj
ChainState[i].swap_pending_since <- NOW()
RELEASE SwapLock
RETURN response: "WAIT" (hold response for up to W seconds)
# No adjacent chain is available. Continue without swap.
ChainState[i].swap_status <- IDLE
ChainState[i].swap_partner <- NULL
RELEASE SwapLock
DispatchNextEpoch(i, ChainState[i].theta_i)
RETURN
RELEASE SwapLock
RETURN
PROCEDURE ExecuteSwap(i, j):
# Both chains i and j have current states on the server.
# Compute acceptance probability.
delta_beta <- (1.0 / ChainState[i].T_i) - (1.0 / ChainState[j].T_j)
delta_logL <- ChainState[j].log_L_j - ChainState[i].log_L_i
log_alpha <- delta_beta * delta_logL
U <- random_uniform(0, 1)
IF log(U) < log_alpha:
# Accept swap: exchange states
SWAP(ChainState[i].theta, ChainState[j].theta)
SWAP(ChainState[i].log_L, ChainState[j].log_L)
SWAP(ChainState[i].log_prior, ChainState[j].log_prior)
record_swap(i, j, accepted=True)
ELSE:
# Reject swap: both chains keep their own states
record_swap(i, j, accepted=False)
# Reset swap status for both chains
ChainState[i].swap_status <- IDLE
ChainState[i].swap_partner <- NULL
ChainState[j].swap_status <- IDLE
ChainState[j].swap_partner <- NULL
# Dispatch next epochs for both chains
DispatchNextEpoch(i, ChainState[i].theta_i)
DispatchNextEpoch(j, ChainState[j].theta_j)
PROCEDURE SwapTimeoutDaemon():
# Runs every W/2 seconds on the server
ACQUIRE SwapLock
FOR each chain k:
IF ChainState[k].swap_status == SWAP_PENDING
AND (NOW() - ChainState[k].swap_pending_since) > W:
# Swap window expired. Clear pending status.
ChainState[k].swap_status <- IDLE
ChainState[k].swap_partner <- NULL
DispatchNextEpoch(k, ChainState[k].theta_k)
RELEASE SwapLock
2.5 Correctness Argument¶
Claim 1: Individual chain correctness is preserved.
Each chain $k$ runs a standard Metropolis-Hastings algorithm targeting $\pi_k(\theta) \propto \mathcal{L}(\theta)^{1/T_k} \pi_0(\theta)$ during each epoch. The within-epoch transitions satisfy detailed balance by construction (standard MH). Between epochs, one of two things happens:
- (a) No swap: The chain continues from its current state $\theta_k$. This is identical to simply continuing the Markov chain β no violation of the Markov property.
- (b) Swap attempted: The server replaces $\theta_k$ with $\theta_j$ (from chain $j$) with probability $A(k \leftrightarrow j)$, or retains $\theta_k$ with probability $1 - A(k \leftrightarrow j)$. This is a Metropolis-Hastings move on the joint state space with proposal "take the other chain's state." The acceptance probability $A(k \leftrightarrow j)$ ensures that detailed balance holds for the joint distribution $\pi_k(\theta_k) \pi_j(\theta_j)$ under the swap move, which in turn ensures that the marginal chain $k$ still targets $\pi_k$.
Missing a swap is statistically identical to rejecting a swap. In synchronous PT, a swap is proposed and either accepted or rejected. In our async protocol, when timing prevents a swap, it is as if the swap was proposed and rejected (probability 0 of acceptance, which is a valid special case). The chain's stationary distribution is unaffected.
The only subtlety is that the state $\theta_j$ used in the swap acceptance computation may be from a slightly earlier logical time than $\theta_k$ (chain $j$ reported $n$ steps ago, chain $k$ just reported now). This means the swap acceptance probability is computed on a state that chain $j$ has since moved away from. However, chain $j$'s current work unit will be unaware of any swap β it will return its result, and the server will either use the swapped state or the returned state for the next epoch. The key insight is that the server treats the checkpoint states as the definitive states. Between checkpoints, the volunteer machine's internal MCMC trajectory is self-consistent. The server never modifies a running work unit's state.
Claim 2: The stationary distribution converges to the target as swap frequency increases.
In the limit where $E \to 1$ (single-step epochs) and $W \to \infty$ (infinite swap window, guaranteeing all partners check in), the protocol reduces to synchronous PT with swap attempts every step. Standard PT theory guarantees convergence to the product distribution $\prod_k \pi_k(\theta_k)$, and the $T_1 = 1$ marginal converges to the true posterior.
As $E$ increases (longer epochs, fewer swap opportunities), the mixing between temperature levels degrades, but each chain individually remains correct. The system interpolates smoothly between "perfect PT" ($E=1$) and "independent chains at fixed temperatures" ($E = \infty$). Even in the worst case, the cold chain is a valid (if slowly mixing) MCMC sampler for the posterior.
Claim 3: Graceful degradation under volunteer failure.
If a volunteer machine disappears mid-epoch, the BOINC transitioner will eventually mark the work unit as timed out (after the BOINC deadline, typically 24-48 hours). The server then:
- Marks the chain as STALLED (see Section 4).
- Reassigns the chain to a new volunteer, starting from the last checkpoint state.
- The temperature ladder has $K-1$ active chains in the interim. Swap attempts involving the missing chain are simply skipped (equivalent to rejected swaps).
- If the chain is permanently lost, the server removes it from the ladder and reindexes adjacent swap pairs. The remaining $K-1$ chains continue correctly, with a gap in the temperature schedule that reduces swap acceptance between the chains on either side of the gap.
No chain's correctness is compromised by the loss of any other chain.
3. Temperature Ladder Design¶
3.1 The Multi-Modal Structure of the TTV Posterior¶
The TTV posterior for a 2-planet system exhibits multiple known degeneracies (detailed in [[Reverse N-Body and Astrometric Parallax]], Section 3):
- Period ratio ambiguity: Multiple mean-motion resonances ($j:(j-k)$) produce the same TTV super-period, creating isolated posterior modes in $P_c$.
- Phase ambiguity: For incomplete TTV phase coverage, bimodality in the perturber's conjunction time $t_{0,c}$.
- Mass-eccentricity degeneracy: $m_c$ and $e_c$ trade off to produce similar TTV amplitudes, creating elongated ridges in parameter space.
For a typical system near the 2:1 resonance, there are 2-6 distinct posterior modes. The likelihood ratio between the global mode and secondary modes can be $\Delta \ln \mathcal{L} \sim 5$--$30$.
3.2 Geometric Temperature Spacing¶
The standard approach for setting temperatures is geometric spacing:
$$T_k = T_1 \cdot r^{k-1}, \quad k = 1, \ldots, K$$
where $r > 1$ is the geometric ratio. This gives:
$$T_K = T_1 \cdot r^{K-1}$$
Solving for $r$:
$$r = \left(\frac{T_K}{T_1}\right)^{1/(K-1)}$$
With $T_1 = 1$:
$$r = T_K^{1/(K-1)}$$
3.3 Swap Acceptance Rate Criterion¶
The expected swap acceptance rate between adjacent chains $k$ and $k+1$ in the geometric schedule, for a $D$-dimensional Gaussian posterior approximation, is approximately:
$$\langle A(k \leftrightarrow k+1) \rangle \approx \text{erfc}\left(\sqrt{\frac{D}{4} \cdot \frac{(r-1)^2}{r}}\right)$$
[Source: Kofke (2002), J. Chem. Phys. 117, 6911 β optimal temperature spacing for replica exchange; Predescu et al. (2004), J. Chem. Phys. 120, 6238]
For efficient PT, the target swap acceptance rate is 20--40%. Below 20%, configurations propagate too slowly through the ladder. Above 40%, too many temperatures are wasted (you could use fewer chains).
For $D = 9$ (our 2-planet TTV problem) and target acceptance rate of 30%:
$$0.30 \approx \text{erfc}\left(\sqrt{\frac{9}{4} \cdot \frac{(r-1)^2}{r}}\right)$$
The argument of erfc for 30% acceptance: $\text{erfc}^{-1}(0.30) \approx 0.733$.
$$0.733^2 = 0.537 = \frac{9}{4} \cdot \frac{(r-1)^2}{r}$$
$$\frac{(r-1)^2}{r} = \frac{4 \times 0.537}{9} = 0.239$$
$$(r-1)^2 = 0.239 r$$
$$r^2 - 2r + 1 = 0.239r$$
$$r^2 - 2.239r + 1 = 0$$
$$r = \frac{2.239 \pm \sqrt{2.239^2 - 4}}{2} = \frac{2.239 \pm \sqrt{5.013 - 4}}{2} = \frac{2.239 \pm 1.007}{2}$$
Taking the larger root: $r \approx 1.623$.
3.4 Setting the Maximum Temperature $T_K$¶
The hottest chain must be hot enough that $\pi_K(\theta) = \mathcal{L}(\theta)^{1/T_K} \pi_0(\theta)$ is nearly uniform over the prior support. This means the tempered likelihood must be sufficiently flat that it does not confine the chain to any single mode.
Criterion: $T_K$ should be large enough that the dynamic range of the tempered log-likelihood across the prior volume is $\lesssim 1$ nat:
$$\frac{\Delta \ln \mathcal{L}_\text{max}}{T_K} \lesssim 1$$
where $\Delta \ln \mathcal{L}\text{max} = \ln \mathcal{L}\text{peak} - \ln \mathcal{L}_\text{prior-edge}$ is the total log-likelihood dynamic range across the prior support.
For the TTV problem: - The peak log-likelihood (good fit): $\ln \mathcal{L} \approx -\frac{1}{2}\chi^2_\text{min} \approx -\frac{N}{2}$ for $N$ transit times, assuming $\chi^2_\nu \approx 1$. - The prior-edge log-likelihood (random parameters): $\ln \mathcal{L} \approx -\frac{1}{2} \sum_n (t_n - \hat{t}_n)^2 / \sigma_n^2$. For random parameters, the predicted transit times are essentially random relative to observations, giving $\chi^2 \sim N \cdot (\Delta t / \sigma)^2$ where $\Delta t$ is the typical TTV amplitude and $\sigma$ is the timing precision.
Typical values: For $N = 50$ transits, $\sigma = 30$ s, TTV amplitude $\sim 5$ min:
$$\Delta \ln \mathcal{L}_\text{max} \approx \frac{1}{2} \times 50 \times \left(\frac{300}{30}\right)^2 = \frac{1}{2} \times 50 \times 100 = 2500$$
Therefore: $T_K \gtrsim 2500$.
3.5 Putting It Together¶
With $T_1 = 1$, $T_K = 2500$, $r = 1.623$:
$$K = 1 + \frac{\ln T_K}{\ln r} = 1 + \frac{\ln 2500}{\ln 1.623} = 1 + \frac{7.824}{0.484} \approx 1 + 16.2 = 17.2$$
Use $K = 18$ chains with geometric ratio $r = 2500^{1/17} \approx 1.597$.
The temperature ladder:
| Chain | $T_k$ (rounded) |
|---|---|
| 1 | 1.00 |
| 2 | 1.60 |
| 3 | 2.55 |
| 4 | 4.07 |
| 5 | 6.50 |
| 6 | 10.4 |
| 7 | 16.6 |
| 8 | 26.5 |
| 9 | 42.3 |
| 10 | 67.5 |
| 11 | 108 |
| 12 | 172 |
| 13 | 275 |
| 14 | 439 |
| 15 | 701 |
| 16 | 1120 |
| 17 | 1789 |
| 18 | 2500 |
Each chain is one BOINC work unit per epoch. With 18 chains and $E = 500$ steps per epoch, each epoch is $\sim 500 \times 2\,\text{ms} = 1\,\text{s}$ of CPU time (for TTVFast). This is far too short for a BOINC work unit. Therefore, increase $E$ substantially β see Section 7 for adaptive epoch length. A practical starting point is $E = 50{,}000$ steps ($\sim 100$ s of CPU time), with swap attempts only at epoch boundaries.
Adaptive ladder refinement: After the first 10 epochs, compute the observed swap acceptance rates between all adjacent pairs. If any pair has acceptance $< 15\%$, insert an intermediate temperature. If any pair has acceptance $> 50\%$, remove the higher-temperature chain. This requires reassigning BOINC work units, which is straightforward at epoch boundaries.
4. Detecting Stalled Chains¶
4.1 Definition of "Stalled"¶
A chain is stalled if it has been trapped in a single posterior mode for $> M$ steps without transiting to another mode. In a multi-modal posterior, a well-mixing chain should visit multiple modes over the course of a run (facilitated by PT swaps injecting configurations from hotter chains).
Operationally, a chain in a single mode will have parameter values confined to a compact region, and the autocorrelation time $\tau$ of the chain will be short (reflecting local mixing within the mode) but the chain will never make the large jumps between modes.
4.2 Practical Diagnostic¶
Primary diagnostic: Autocorrelation time divergence.
Compute the integrated autocorrelation time $\hat{\tau}_k$ for chain $k$ using the method of Sokal (1997) or the automated windowing of Goodman & Weare (2010):
$$\hat{\tau} = 1 + 2 \sum_{t=1}^{W} \hat{\rho}(t)$$
where $\hat{\rho}(t)$ is the estimated autocorrelation at lag $t$ and $W$ is chosen by the automated windowing criterion ($W$ is the smallest value where $W \geq C \hat{\tau}(W)$ for a tuning constant $C \approx 5$).
Stall criterion: Chain $k$ is stalled if:
$$\hat{\tau}_k > M / 2$$
where $M$ is the number of steps since the last successful swap injection. The interpretation: if the autocorrelation time exceeds half the available steps, the chain cannot have produced more than $\sim 2$ independent samples in that window. For a chain that should be exploring multiple modes, this indicates it is stuck.
Secondary diagnostic: Mode occupancy monitoring.
Cluster the chain samples using a simple criterion (e.g., k-means on the $\ln P_c$ dimension, which is the primary mode separator in TTV posteriors). If chain $k$ has occupied only a single cluster for $> M$ consecutive steps while other chains (at similar or lower temperatures) have visited multiple clusters, chain $k$ is stalled.
4.3 Recovery Strategy¶
Option A: Restart from a random prior draw.
Draw $\theta_k^\text{new} \sim \pi_0(\theta)$ and restart the chain from this point. This guarantees exploration of new regions but may waste many steps if the prior is broad and the new point lands in a low-likelihood desert. The chain must "find" a mode from scratch.
Option B: Restart from a sample from a hotter chain.
Take the most recent checkpoint state $\theta_j$ from the adjacent hotter chain $j = k+1$ and inject it into chain $k$. This is equivalent to forcing a swap. To maintain correctness, this injection must be accepted with the standard swap probability $A(k \leftrightarrow j)$ β but if chain $k$ is stalled in a deep mode, the swap acceptance may be low (the hot chain's state may be in a different, shallower mode from chain $k$'s perspective).
Recommendation for the TTV problem: Use Option B with a fallback to Option A.
Argument: The TTV posterior modes are well-separated in $P_c$ but the likelihood values at different modes are not wildly different (within $\Delta \ln \mathcal{L} \sim 5$--$30$). A hot chain's state is likely to be in or near a mode that the stalled chain has not visited. The swap acceptance probability is:
$$\ln A = \left(\frac{1}{T_k} - \frac{1}{T_{k+1}}\right)(\ln \mathcal{L}(\theta_{k+1}) - \ln \mathcal{L}(\theta_k))$$
If the hot chain is in a secondary mode with $\ln \mathcal{L}$ only $\sim 10$ below the primary mode, and the temperature difference is modest, the swap has reasonable acceptance probability. If the forced swap is rejected 3 consecutive times, fall back to Option A (prior restart).
Implementation:
PROCEDURE RecoverStalledChain(k):
attempts <- 0
WHILE attempts < 3:
j <- k + 1 # Next hotter chain
IF ChainState[j] exists AND ChainState[j].swap_status == IDLE:
# Attempt forced swap
ExecuteSwap(k, j)
IF swap was accepted:
RETURN # Chain k now has a new state from chain j
attempts += 1
# Fallback: restart from prior
theta_new <- sample_from_prior()
log_L_new <- compute_log_likelihood(theta_new)
log_prior_new <- compute_log_prior(theta_new)
ChainState[k].theta <- theta_new
ChainState[k].log_L <- log_L_new
ChainState[k].log_prior <- log_prior_new
record_event("chain_restart_from_prior", chain=k)
DispatchNextEpoch(k, theta_new)
5. Convergence Diagnostics¶
5.1 The Problem with Standard R-hat¶
The Gelman-Rubin $\hat{R}$ statistic compares between-chain variance $B$ to within-chain variance $W$ across $M$ parallel chains of equal length $N$:
$$\hat{R} = \sqrt{\frac{(N-1)/N \cdot W + B/N}{W}}$$
This requires all chains to have the same length and to be complete. In our async protocol, chains have different step counts (some volunteers are faster), some chains may be mid-epoch, and chains are at different temperatures (only the $T=1$ chain targets the true posterior).
5.2 Sequential Stopping Rule¶
At each server-side checkpoint of the cold chain ($T_1 = 1$), compute the running posterior statistics:
Running mean and variance: After epoch $e$ of the cold chain, with cumulative samples ${\theta^{(1)}, \ldots, \theta^{(N_e)}}$:
$$\hat{\mu}e^{(d)} = \frac{1}{N_e} \sum d$$}^{N_e} \theta_n^{(d)} \quad \text{for each parameter dimension
$$\hat{\sigma}e^{(d)} = \sqrt{\frac{1}{N_e - 1} \sum$$}^{N_e} \left(\theta_n^{(d)} - \hat{\mu}_e^{(d)}\right)^2
Stopping criterion: Define a forgetting window of the last $F$ checkpoints (e.g., $F = 20$). At each checkpoint, compute:
$$\Delta \hat{\mu}^{(d)} = \max_{e' \in [e-F, e]} \hat{\mu}{e'}^{(d)} - \min$$} \hat{\mu}_{e'}^{(d)
Stop when, for all parameters $d$:
$$\frac{\Delta \hat{\mu}^{(d)}}{\hat{\sigma}_e^{(d)}} < \varepsilon$$
where $\varepsilon$ is a tolerance parameter. Recommended: $\varepsilon = 0.05$. This means the running mean has drifted by less than 5% of the posterior width over the last $F$ checkpoints β the posterior estimate has stabilized.
Additionally, require that the total effective sample size (ESS) of the cold chain exceeds a minimum:
$$\text{ESS} = \frac{N_e}{\hat{\tau}} > N_\text{min}$$
Recommended: $N_\text{min} = 1000$ effective samples for publication-quality posteriors, $N_\text{min} = 200$ for preliminary estimates.
5.3 Rank-Normalised R-hat on Partial Chains (Vehtari et al. 2021)¶
The rank-normalised split-$\hat{R}$ (Vehtari et al. 2021, Bayesian Analysis 16, 667) improves on standard $\hat{R}$ by:
- Rank-normalising all samples (replacing values with their rank, then applying an inverse-normal transform). This makes $\hat{R}$ sensitive to differences in distribution shape, not just location and scale.
- Splitting each chain in half and treating the halves as separate chains. This detects non-stationarity within a single chain.
Modifications for async chains:
- Unequal chain lengths: Truncate all chains to the length of the shortest chain before computing $\hat{R}$. Alternatively, use importance-weighted $\hat{R}$ where chains contribute to the statistic proportional to their ESS.
- Temperature correction: Only compute $\hat{R}$ across chains at $T_1 = 1$. If there is only one cold chain, split it into 4 segments and compute split-$\hat{R}$ across segments.
- Multiple cold-chain replicas: Run $R \geq 4$ independent cold chains (each as a separate BOINC work unit stream) to enable meaningful $\hat{R}$ computation. This costs 4x the cold-chain BOINC resources but provides a robust convergence diagnostic.
Convergence threshold: $\hat{R} < 1.01$ for all parameters (rank-normalised, split). Additionally, require that the bulk ESS and tail ESS (as defined by Vehtari et al.) both exceed $N_\text{min}$.
5.4 KS Test Between Adjacent Temperature Chains¶
To verify that the temperature ladder is providing adequate communication between temperatures, compare the marginal distributions of adjacent chains. After accumulating at least 500 samples from each chain:
For each parameter $d$ and each adjacent pair $(k, k+1)$:
$$\text{KS}{k,k+1}^{(d)} = \sup_x \left| F_k^{(d)}(x) - F(x) \right|$$}^{(d)
with $p$-value from the two-sample KS test.
Interpretation:
- $p > 0.05$ for all parameters between adjacent chains: The tempered distributions overlap sufficiently. The temperature spacing is adequate.
- $p < 0.01$ for any parameter between adjacent chains: The distributions are too different. The temperature gap is too large. Insert an intermediate temperature.
- $0.01 < p < 0.05$: Borderline. Monitor for several more epochs.
Important caveat: Adjacent chains at different temperatures are expected to have different distributions (the tempered posterior broadens with temperature). The KS test here is not checking for identity β it is checking that the distributions overlap enough for swaps to be accepted. A more direct diagnostic is the observed swap acceptance rate itself (Section 3.5). The KS test is a secondary diagnostic used to understand why swap rates are low when they are.
6. Server-Side Checkpoint Schema¶
6.1 SQLite Database Schema¶
-- Temperature ladder configuration for a given inference run
CREATE TABLE pt_runs (
run_id TEXT PRIMARY KEY, -- e.g., "ttv_kepler17_20260321"
target_name TEXT NOT NULL, -- e.g., "Kepler-17"
n_chains INTEGER NOT NULL, -- K
n_params INTEGER NOT NULL, -- D (dimensionality)
param_names TEXT NOT NULL, -- JSON array of parameter names
prior_config TEXT NOT NULL, -- JSON: prior bounds and types
ttvfast_config TEXT NOT NULL, -- JSON: dt, total_time, etc.
transit_data TEXT NOT NULL, -- JSON: times, errors arrays
status TEXT NOT NULL DEFAULT 'RUNNING', -- RUNNING, CONVERGED, FAILED
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
converged_at TIMESTAMP
);
-- Current state of each chain (one row per chain, updated in place)
CREATE TABLE chain_states (
run_id TEXT NOT NULL REFERENCES pt_runs(run_id),
chain_id INTEGER NOT NULL, -- k = 1..K
temperature REAL NOT NULL, -- T_k
theta BLOB NOT NULL, -- Current parameter vector (D floats, little-endian float64)
log_likelihood REAL NOT NULL,
log_prior REAL NOT NULL,
step_count INTEGER NOT NULL DEFAULT 0, -- Cumulative n_k
epoch_count INTEGER NOT NULL DEFAULT 0, -- Number of completed epochs
current_epoch_len INTEGER NOT NULL, -- E for current/next epoch
last_checkpoint_ts TIMESTAMP,
swap_status TEXT NOT NULL DEFAULT 'IDLE', -- IDLE, SWAP_PENDING, SWAP_EXECUTING
swap_partner INTEGER, -- chain_id of pending partner, or NULL
swap_pending_since TIMESTAMP,
assigned_host TEXT, -- BOINC host_id
current_wu_id TEXT, -- BOINC work unit name
acceptance_rate REAL, -- Rolling MH acceptance rate (last epoch)
autocorr_estimate REAL, -- Rolling tau estimate
is_stalled INTEGER NOT NULL DEFAULT 0, -- Boolean flag
PRIMARY KEY (run_id, chain_id)
);
-- Append-only log of all checkpoints received
CREATE TABLE checkpoints (
checkpoint_id INTEGER PRIMARY KEY AUTOINCREMENT,
run_id TEXT NOT NULL REFERENCES pt_runs(run_id),
chain_id INTEGER NOT NULL,
epoch INTEGER NOT NULL,
theta BLOB NOT NULL, -- Parameter vector at checkpoint
log_likelihood REAL NOT NULL,
log_prior REAL NOT NULL,
step_count INTEGER NOT NULL, -- Steps completed in this epoch
acceptance_rate REAL NOT NULL,
autocorr_estimate REAL,
received_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
boinc_host_id TEXT,
boinc_wu_id TEXT,
wall_time_s REAL -- Wall-clock time for this epoch
);
-- Thinned posterior samples from each epoch (append-only)
CREATE TABLE samples (
sample_id INTEGER PRIMARY KEY AUTOINCREMENT,
run_id TEXT NOT NULL REFERENCES pt_runs(run_id),
chain_id INTEGER NOT NULL,
epoch INTEGER NOT NULL,
step_in_epoch INTEGER NOT NULL, -- Which step within the epoch
theta BLOB NOT NULL, -- D floats
log_likelihood REAL NOT NULL,
log_prior REAL NOT NULL
);
-- Index for fast retrieval of cold-chain samples for convergence diagnostics
CREATE INDEX idx_samples_cold ON samples(run_id, chain_id, epoch)
WHERE chain_id = 1;
-- Swap attempt log (append-only)
CREATE TABLE swap_log (
swap_id INTEGER PRIMARY KEY AUTOINCREMENT,
run_id TEXT NOT NULL REFERENCES pt_runs(run_id),
chain_i INTEGER NOT NULL,
chain_j INTEGER NOT NULL,
epoch_i INTEGER NOT NULL, -- Epoch of chain i at swap time
epoch_j INTEGER NOT NULL, -- Epoch of chain j at swap time
log_L_i REAL NOT NULL,
log_L_j REAL NOT NULL,
temp_i REAL NOT NULL,
temp_j REAL NOT NULL,
log_alpha REAL NOT NULL, -- Log acceptance probability
accepted INTEGER NOT NULL, -- 0 or 1
attempted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Convergence diagnostics computed at each cold-chain checkpoint
CREATE TABLE convergence_log (
log_id INTEGER PRIMARY KEY AUTOINCREMENT,
run_id TEXT NOT NULL REFERENCES pt_runs(run_id),
epoch INTEGER NOT NULL,
param_index INTEGER NOT NULL, -- Which parameter (0..D-1)
running_mean REAL NOT NULL,
running_std REAL NOT NULL,
delta_mu_ratio REAL NOT NULL, -- |Delta mu| / sigma over forgetting window
ess REAL NOT NULL, -- Effective sample size
r_hat REAL, -- Rank-normalised split R-hat (if available)
computed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Index for convergence monitoring queries
CREATE INDEX idx_convergence ON convergence_log(run_id, epoch, param_index);
-- Chain lifecycle events (restarts, stall detections, ladder adjustments)
CREATE TABLE chain_events (
event_id INTEGER PRIMARY KEY AUTOINCREMENT,
run_id TEXT NOT NULL REFERENCES pt_runs(run_id),
chain_id INTEGER, -- NULL for run-level events
event_type TEXT NOT NULL, -- STALL_DETECTED, RESTART_FROM_HOT, RESTART_FROM_PRIOR,
-- TEMP_INSERTED, TEMP_REMOVED, CHAIN_LOST, CHAIN_REASSIGNED
details TEXT, -- JSON with event-specific metadata
occurred_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
6.2 Storage Estimates¶
For a single TTV run with $K = 18$ chains, $D = 9$ parameters, $E = 50{,}000$ steps per epoch, thinning factor 50, and 100 epochs:
- chain_states: 18 rows, ~1 KB each = ~18 KB (updated in place)
- checkpoints: 18 chains x 100 epochs = 1,800 rows, ~200 B each = ~360 KB
- samples: 18 chains x 100 epochs x 1,000 thinned samples = 1.8M rows, ~100 B each = ~180 MB
- swap_log: ~100 epochs x 17 adjacent pairs = 1,700 rows, ~100 B each = ~170 KB
- convergence_log: 100 epochs x 9 params = 900 rows, ~100 B each = ~90 KB
Total per run: ~180 MB, dominated by the sample store. For runs accumulating on a server with dozens of target systems, this is manageable on a standard VPS with 100 GB of disk.
For bulk sample storage, consider storing the samples.theta column as compressed numpy arrays in Backblaze B2 and keeping only summary statistics in SQLite.
7. Epoch Length Optimisation¶
7.1 The Tradeoffs¶
The epoch length $E$ controls the granularity of server-volunteer interaction:
-
$E$ too small ($E < 1000$ for the TTV problem): The BOINC work unit completes in seconds. BOINC overhead (scheduling, file transfer, result processing) dominates. Volunteer machines spend more time downloading and uploading than computing. Swap opportunities are frequent but the overhead is unacceptable.
-
$E$ too large ($E > 500{,}000$): The work unit runs for hours. If the chain is stalled or the temperature ladder needs adjustment, the server cannot intervene until the epoch completes. Volunteer dropout mid-epoch wastes significant compute. Swap opportunities are too rare for effective mode communication.
-
$E$ optimal: The chain moves meaningfully within an epoch (explores the local mode thoroughly), the wall-clock time is 10-30 minutes (a good BOINC work unit duration), and swap opportunities occur frequently enough for inter-mode communication.
7.2 Relationship to Local Autocorrelation Time¶
Within a single mode, the chain has a local autocorrelation time $\tau_\text{local}$ β the number of steps needed to produce one effectively independent sample. For the epoch to be "meaningful," it must contain many independent samples:
$$E \gg \tau_\text{local}$$
A good rule of thumb: $E \geq 20 \cdot \tau_\text{local}$. This ensures the chain produces $\geq 20$ independent samples per epoch, providing a reliable estimate of the local posterior structure and a meaningful state to report at checkpoint time.
7.3 Adaptive Epoch Length¶
The server adjusts $E$ for each chain based on the rolling autocorrelation estimate reported by the volunteer:
PROCEDURE UpdateEpochLength(chain_k, reported_tau):
# Update rolling autocorrelation estimate (exponential moving average)
alpha = 0.3 # Smoothing factor
ChainState[k].autocorr_estimate =
alpha * reported_tau + (1 - alpha) * ChainState[k].autocorr_estimate
tau_est = ChainState[k].autocorr_estimate
# Target: 20 autocorrelation times per epoch
E_target = max(20 * tau_est, 5000) # Floor at 5000 steps
# Ceiling: work unit should not exceed 30 minutes wall time
# Estimate wall time: E * t_per_step (from last epoch's wall_time / step_count)
t_per_step = last_wall_time / last_step_count
E_max = floor(30 * 60 / t_per_step) # 30 minutes in steps
# Floor: work unit should be at least 5 minutes (BOINC overhead)
E_min = max(floor(5 * 60 / t_per_step), 5000)
ChainState[k].current_epoch_len = clamp(E_target, E_min, E_max)
7.4 Estimating $\tau_\text{local}$ on the Volunteer Machine¶
The volunteer machine estimates the autocorrelation time from its own epoch's samples using the automated windowing estimator. This is computed at the end of the epoch before submitting the checkpoint:
def estimate_autocorrelation(chain_samples):
"""
Estimate integrated autocorrelation time using Sokal's
automatic windowing procedure.
chain_samples: array of shape (E, D) β one row per step, D parameters
Returns: tau_hat (scalar, maximum over all parameters)
"""
tau_per_param = []
for d in range(chain_samples.shape[1]):
x = chain_samples[:, d]
x = x - x.mean()
n = len(x)
# FFT-based autocorrelation
f = np.fft.fft(x, n=2*n)
acf = np.fft.ifft(f * np.conj(f))[:n].real
acf /= acf[0]
# Sokal windowing: find W where W >= C * cumulative_tau(W)
C = 5.0
cumtau = 0.5 # Start with lag-0 contribution
for t in range(1, n):
cumtau += acf[t]
if t >= C * cumtau:
break
tau_per_param.append(2 * cumtau) # Factor of 2 for the full integral
return max(tau_per_param)
The maximum over parameters is reported because the slowest-mixing parameter dominates the effective epoch quality. This estimate is included in the checkpoint submission.
7.5 Initial Epoch Length (Before Any $\tau$ Estimate)¶
For the first epoch of a new run, use a conservative default:
$$E_\text{init} = 50{,}000$$
At $\sim 2$ ms per TTVFast call, this is $\sim 100$ s of CPU time β a reasonable initial work unit. After the first epoch returns with a $\tau$ estimate, the adaptive procedure takes over.
8. Full System Diagram¶
8.1 Mermaid Sequence Diagram¶
sequenceDiagram
participant V1 as Volunteer Machine (Chain i)
participant V2 as Volunteer Machine (Chain j)
participant BS as BOINC Scheduler
participant SS as Swap Server (Assimilator)
participant DB as SQLite Checkpoint Store
Note over SS,DB: Initialisation: Create run, seed K chain states in DB
SS->>BS: Create work unit WU_i (chain i, epoch 1, state ΞΈ_i)
SS->>BS: Create work unit WU_j (chain j, epoch 1, state ΞΈ_j)
BS->>V1: Dispatch WU_i (input: ΞΈ_i, T_i, transit data)
BS->>V2: Dispatch WU_j (input: ΞΈ_j, T_j, transit data)
Note over V1: Run E steps of tempered MCMC at T_i
Note over V2: Run E steps of tempered MCMC at T_j
V1->>BS: Submit result (ΞΈ_i', log L_i', samples, ΟΜ)
BS->>SS: Assimilator receives result for chain i
SS->>DB: Update ChainState[i] with ΞΈ_i', log L_i'
SS->>DB: Store checkpoint and samples
SS->>DB: Check: Is chain j SWAP_PENDING for chain i?
alt Chain j has not yet reported
SS->>DB: Set ChainState[i].swap_status = SWAP_PENDING
Note over SS: Wait up to W seconds for chain j
V2->>BS: Submit result (ΞΈ_j', log L_j', samples, ΟΜ)
BS->>SS: Assimilator receives result for chain j
SS->>DB: Update ChainState[j] with ΞΈ_j', log L_j'
SS->>DB: Chain i is SWAP_PENDING β execute swap logic
SS->>SS: Compute log Ξ± = (1/T_i - 1/T_j)(log L_j' - log L_i')
SS->>SS: Draw U ~ Uniform(0,1)
alt Swap accepted (log U < log Ξ±)
SS->>DB: Swap ΞΈ between chains i and j
SS->>DB: Log swap (accepted=True)
else Swap rejected
SS->>DB: Log swap (accepted=False)
end
SS->>DB: Reset swap_status for both chains to IDLE
SS->>BS: Create WU_i (epoch 2, state ΞΈ_i or swapped ΞΈ_j)
SS->>BS: Create WU_j (epoch 2, state ΞΈ_j or swapped ΞΈ_i)
else Swap window W expires before chain j reports
SS->>DB: Clear SWAP_PENDING for chain i
SS->>BS: Create WU_i (epoch 2, state ΞΈ_i' β no swap)
Note over SS: Chain j will submit later, gets dispatched then
end
BS->>V1: Dispatch next epoch work unit
BS->>V2: Dispatch next epoch work unit
Note over V1,V2: Cycle repeats until convergence criteria met
8.2 ASCII State Machine for a Single Chain¶
ββββββββββββββββββββ
β INITIALISED β
β (seeded state) β
ββββββββββ¬ββββββββββ
β
β DispatchNextEpoch()
v
ββββββββββββββββββββ
β DISPATCHED β
β (WU sent to β
β BOINC client) β
ββββββββββ¬ββββββββββ
β
βββββββββββββΌββββββββββββ
β β β
Volunteer Volunteer BOINC deadline
completes crashes expires
β β β
v β v
ββββββββββββββββ β ββββββββββββββββββββ
β CHECKPOINT β β β TIMED_OUT β
β RECEIVED β β β (chain lost) β
ββββββββ¬ββββββββ β ββββββββββ¬ββββββββββ
β β β
β β Reassign to new
β β volunteer from
β β last checkpoint
β β β
β ββββββββββββββ
β
βββββββββ΄βββββββββ
β β
Adjacent chain No adjacent chain
also recent available
β β
v v
ββββββββββββββββ ββββββββββββββββ
β SWAP_PENDING β β IDLE β
β (wait β€ W s) β β (dispatch β
ββββββββ¬ββββββββ β next epoch) β
β ββββββββββββββββ
βββββββ΄βββββββ
β β
Partner Timeout
arrives (W expires)
β β
v v
ββββββββββββ ββββββββββββββββ
β SWAP β β IDLE β
β EXECUTE β β (dispatch β
β (accept/ β β next epoch, β
β reject) β β no swap) β
ββββββ¬ββββββ ββββββββββββββββ
β
v
ββββββββββββββββ
β IDLE β
β (dispatch β
β next epoch β
β with new or β
β same state) β
ββββββββββββββββ
8.3 Complete Message Flow Across All Components¶
βββββββββββββββ ββββββββββββββββ ββββββββββββββββββββ ββββββββββββ
β Volunteer β β BOINC β β Swap Server β β SQLite β
β Machine β β Scheduler β β (Assimilator) β β DB β
ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββββ¬ββββββββββ ββββββ¬ββββββ
β β β β
β Poll for work β β β
β βββββββββββββββββ> β β β
β β β β
β WU (input files) β β β
β <βββββββββββββββββ β β β
β β β β
β [Run E MCMC β β β
β steps locally] β β β
β β β β
β Upload result β β β
β βββββββββββββββββ> β β β
β β Validated result β β
β β ββββββββββββββββββββ>β β
β β β β
β β β UPDATE chain_stateβ
β β β ββββββββββββββββββ>β
β β β β
β β β INSERT checkpoint β
β β β ββββββββββββββββββ>β
β β β β
β β β INSERT samples β
β β β ββββββββββββββββββ>β
β β β β
β β β CHECK swap partnerβ
β β β ββββββββββββββββββ>β
β β β partner status β
β β β <βββββββββββββββββββ
β β β β
β β β [Swap logic: β
β β β accept/reject/ β
β β β pend/skip] β
β β β β
β β β INSERT swap_log β
β β β ββββββββββββββββββ>β
β β β β
β β β Compute β
β β β convergence diag. β
β β β β
β β β INSERT convergenceβ
β β β ββββββββββββββββββ>β
β β β β
β β create_work β β
β β <βββββββββββββββββββββ β
β β (next epoch WU) β β
β β β β
9. Failure Mode Specification¶
9.1 Volunteer Machine Disappears Mid-Epoch¶
Scenario: A volunteer running chain $k$'s epoch shuts down their computer, loses internet, or pauses the BOINC client.
Detection: BOINC's transitioner marks the work unit as timed out after the deadline (configured per work unit, typically $2 \times$ expected wall time + 24 hours buffer).
Server response:
- The chain's
chain_statesrow retains the last checkpoint state (from the end of the previous epoch). No state is lost except the in-progress epoch's samples. - The work unit is marked as ERROR in BOINC. If
target_nresults > 1, BOINC automatically resends to another volunteer. - If
target_nresults = 1(our default for non-validation runs), the Swap Server detects the timeout and creates a new work unit from the last checkpoint state. - Any pending swaps involving chain $k$ are cleared after $W$ seconds (by the SwapTimeoutDaemon).
- The partner chain continues normally β it simply missed a swap opportunity.
Impact: Loss of $E$ steps of compute (one epoch). No corruption of any chain state. The temperature ladder operates with $K-1$ chains until the new work unit is dispatched and picked up.
9.2 Server Goes Down During Swap Execution¶
Scenario: The Swap Server crashes after computing the swap acceptance but before writing the result to the database.
Detection: On restart, the server finds chains with swap_status = SWAP_EXECUTING in the database.
Recovery procedure:
- For any chain with
swap_status = SWAP_EXECUTING: reset toIDLEand dispatch from the pre-swap state (which is still in the database, since the swap write was not committed). - Both chains involved in the interrupted swap restart from their pre-swap states. This is equivalent to the swap being rejected β both chains continue from their own states.
- The atomic unit of the swap operation is the database transaction that writes both chains' new states. Use an explicit SQLite transaction with
BEGIN IMMEDIATEto ensure the swap is all-or-nothing:
BEGIN IMMEDIATE;
UPDATE chain_states SET theta = ?, log_likelihood = ?, swap_status = 'IDLE'
WHERE run_id = ? AND chain_id = ?; -- chain i
UPDATE chain_states SET theta = ?, log_likelihood = ?, swap_status = 'IDLE'
WHERE run_id = ? AND chain_id = ?; -- chain j
INSERT INTO swap_log (...) VALUES (...);
COMMIT;
If the server crashes before COMMIT, the transaction is rolled back and both chains retain their pre-swap states.
9.3 Two Chains Arrive Simultaneously for Swap¶
Scenario: Chains $i$ and $j$ (adjacent in the temperature ladder) both submit checkpoints at the same instant, and both are processed by the server concurrently.
Prevention: The SwapLock mutex serialises all swap operations. Only one HandleCheckpoint procedure executes the swap-decision logic at a time. The first checkpoint to acquire the lock is processed first; the second acquires the lock after the first releases it, finds the partner's status already updated, and proceeds accordingly.
In practice (SQLite): SQLite's write lock provides natural serialisation for the swap logic. Wrap the entire HandleCheckpoint procedure in a single database transaction. The second concurrent call blocks on the write lock until the first completes.
9.4 Swap Window Expires for a Pending Chain While Its Partner Is In Transit¶
Scenario: Chain $i$ submits a checkpoint and enters SWAP_PENDING. The swap window $W$ expires. Chain $i$ is dispatched without a swap. Moments later, chain $j$ submits its checkpoint.
Handling: When chain $j$'s checkpoint arrives, the server checks chain $i$'s status and finds it IDLE (already dispatched). Chain $j$ enters SWAP_PENDING if another adjacent chain is available, or is dispatched immediately without a swap.
Impact: One missed swap opportunity. This is the expected behaviour and is by design. The protocol tolerates missed swaps gracefully.
9.5 Volunteer Returns Corrupted or Adversarial Results¶
Scenario: A volunteer's hardware has a floating-point error, or an adversarial volunteer submits fabricated results.
Detection:
-
Sanity checks on checkpoint: The server validates that $\log \mathcal{L}$ is finite and within a plausible range (e.g., $> -10^6$). The parameter vector $\theta$ must be within prior bounds. If these checks fail, reject the result and redispatch.
-
BOINC redundancy: For critical runs (publication-quality inference), set
target_nresults = 2and use the KS-test validation described in [[BOINC Integration Plan]], Section 3. Two independent volunteers must produce statistically consistent chains. -
Statistical anomaly detection: If a chain's log-likelihood suddenly jumps by $> 100$ nats between consecutive checkpoints (more than any plausible MCMC move), flag the result for manual review.
9.6 All Volunteers Disappear (Total Fleet Loss)¶
Scenario: Every active volunteer machine goes offline (e.g., a BOINC client update causes a mass disconnection).
Server response: All chains stall. The server's SwapTimeoutDaemon clears all pending swaps. All chain states are preserved in the database. When volunteers return, the server re-dispatches all chains from their last checkpoint states.
Impact: The inference run pauses. No state is lost. This is the normal operating mode of BOINC β projects routinely experience periods of low volunteer availability.
10. Protocol Parameter Summary¶
| Parameter | Symbol | Recommended Value | Rationale |
|---|---|---|---|
| Number of chains | $K$ | 18 | Geometric ladder from $T=1$ to $T=2500$ with 30% swap rate |
| Cold temperature | $T_1$ | 1.0 | Targets true posterior |
| Hot temperature | $T_K$ | 2500 | Flattens likelihood across prior volume for TTV problem |
| Geometric ratio | $r$ | 1.60 | Yields ~30% swap acceptance for $D=9$ |
| Epoch length (initial) | $E_\text{init}$ | 50,000 steps | ~100 s wall time at 2 ms/step |
| Epoch length (adaptive range) | $E$ | 5,000 -- 900,000 | 5 min to 30 min wall time |
| Swap window | $W$ | 300 s (5 min) | Tolerance for volunteer arrival jitter |
| Stall threshold | $M$ | 500,000 steps | 10 epochs at default $E$ |
| Convergence tolerance | $\varepsilon$ | 0.05 | $\Delta\mu / \sigma < 5\%$ over forgetting window |
| Forgetting window | $F$ | 20 checkpoints | Look-back for stability assessment |
| Minimum ESS | $N_\text{min}$ | 1,000 | For publication-quality posteriors |
| R-hat threshold | $\hat{R}$ | < 1.01 | Rank-normalised split R-hat |
| KS p-value threshold | > 0.05 | Adjacent chain distribution overlap | |
| Thinning factor | 50 | Store every 50th sample | |
| BOINC WU deadline | $4 \times$ expected wall time | Buffer for slow volunteers |
11. Relationship to Existing Vault Notes¶
This protocol operationalises the parallel tempering strategy described as "Option D (best for Stage 1)" in [[BOINC Integration Plan]], Section 2.1, which proposed: "Each temperature chain is an independent work unit. After all complete, implement the PTSampler swap step server-side."
The original proposal assumed synchronous completion ("after all complete"), which is unrealistic for BOINC. This document replaces that assumption with an asynchronous swap protocol that tolerates arbitrary volunteer timing while preserving the statistical guarantees of PT-MCMC.
The MCMC implementation details (TTVFast integration, parameter space, priors) follow [[Reverse N-Body and Astrometric Parallax]], Sections 1-4. The compute budget estimates from that document ($\sim 2$ ms/call, $\sim 15$ min per 2-planet system on 1 core) inform the epoch length and temperature ladder sizing here.
The broader compute architecture context is in [[Distributed Computing Strategy]], which positions BOINC as a Stage 2-3 tool. This protocol is designed for Stage 2 deployment, when the number of target systems exceeds what can be handled on a single VPS.
See also: [[BOINC Integration Plan]] for server setup and work unit packaging. See also: [[Distributed Computing Strategy]] for the full compute architecture. See also: [[Reverse N-Body and Astrometric Parallax]] for the TTV posterior geometry.