A mathematical flaw in the Dutch Book Argument for P(H)=1/3 in the Sleeping Beauty Thought Experiment

_The common frequentist argument for the thirder solution to the Sleeping Beauty problem claims that assigning P(H)=1/3 maximizes correct guesses over repeated trials. This relies on averaging correct awakenings across repeated trials of the experiment. If we measure Beauty's "success" in purely binary terms, the advantage disappears. Thus, the frequentist argument does not support P(H)=1/3.

Beauty agrees to participate in the following experiment. On Sunday, she is put to sleep. A fair coin is then tossed. If the coin lands Heads, she is awakened on Monday only. If the coin lands Tails, she is awakened on both Monday and Tuesday. After each awakening, her memory of previous awakenings is erased. Whenever Beauty awakens, she knows the protocol of the experiment but does not know which day it is or how the coin landed.

The central question is: what probability should Beauty assign to Heads upon awakening? One answer is $P(H)=1/2$ . Since the coin is fair and Beauty is guaranteed to be awakened regardless of the outcome, awakening appears to provide no information about the coin toss. Another answer is $P(H)=1/3$ . Since repeated trials produce one Heads-awakening for every two Tails-awakenings, only one-third of awakenings occur in Heads worlds. Beauty's uncertainty about her present awakening therefore seems to favor the thirder position.

The appeal of the thirder solution is strengthened by a frequentist consideration. Suppose Beauty is required to guess the outcome of the coin toss whenever she awakens. Over many repetitions of the experiment, always guessing Tails results in more correct guesses than always guessing Heads. Since Tails produces two awakenings while Heads produces only one, the strategy of guessing Tails appears to maximize success. This observation forms the basis of a standard Dutch book style argument for assigning $P(H)=1/3$ .

The argument can be stated informally as follows. If probabilities are interpreted in terms of long-run frequencies, then Beauty's credences should track the outcomes that lead to the greatest frequency of correct guesses. Because Tails-awakenings occur twice as often as Heads-awakenings, a credence of 1/3 in Heads appears to align with the frequencies generated by repeated trials of the experiment. The remainder of this paper argues that this reasoning depends on an unjustified measure of success and that, once success is defined in terms of genuine trial outcomes, the argument loses its force.

Before evaluating the frequentist argument, we must clarify what constitutes a trial of the Sleeping Beauty experiment. We shall use the term trial to mean one complete execution of the experimental protocol, beginning with the coin toss and ending after all awakenings associated with that toss have occurred. Thus, each performance of the Sleeping Beauty experiment, taken from start to finish, counts as exactly one trial, regardless of whether it contains one awakening or two.

We shall make one assumption. If there is a correct way for Beauty to guess, then after a trial has been completed it must be possible to answer whether she guessed correctly. The answer must be either "yes" or "no". Otherwise, it is unclear in what sense the guess can be said to be correct. Therefore, each trial must have a binary answer as to whether Beauty guessed correctly.

This is not the standard frequentist view. Let $S$ be the sample space of a repeatable experiment and $E \subseteq S$ an event. The frequentist probability of $E$ is defined as the limiting relative frequency with which $E$ occurs across repeated trials. In each trial, $E$ either occurs or does not occur [1]. Thus, probability is defined in terms of a binary event evaluated once per trial, not an average of quantities within a trial. The frequentist case for thirding therefore depends on identifying the relevant event and how often it occurs.

The Dutch book argument does not follow this framework. Instead of assigning each trial a binary value according to whether the relevant event occurred, it assigns a numerical score equal to the number of correct awakenings. A Tails trial may therefore contribute two units of success while a Heads trial contributes only one. Success is no longer evaluated on a trial-by-trial basis. Consequently, additional success accumulated within one trial is allowed to carry over into the long-run average. The Dutch book argument is therefore not counting successful trials; it is counting successful awakenings.

[1] Say on each trial you draw $s \in S$ . Either $s \in E$ or $s \notin E$