At INR3 in Kamloops I spoke on applying Bayesian logic to the study of Jesus along with the same principles we apply to dead religions (so as to avoid the “don’t offend the Christians” reaction to controversial claims…claims that would not be controversial if Jesus was not the object of worship of billions of loud, influential people). In Q&A philosopher Louise Antony challenged my application of Bayes’ Theorem to historical reasoning with a series of technical complaints, especially two fallacies commonly voiced by opponents of Bayesianism. I was running out of time (and there was one more questioner to get to) so I explained that I answered all her stated objections in my book Proving History (and I do, at considerable length).
But I thought it might be worth talking about those two fallacies specifically here, in case others run into the same arguments and need to know what’s fishy about them. The first was the claim that prior probabilities (the first pair of premises in Bayesian logic) are wholly arbitrary and therefore Bayesian arguments are useless because you can just put any number in that you want for the priors and therefore get any conclusion you want. The second was the claim that the probabilities of the evidence (the second pair of premises in Bayesian logic) are always 1 (i.e. 100%) and therefore Bayes’ Theorem cannot determine anything at all as to which competing hypothesis is true.
Never mind that if Antony was right about these points, then all Bayesian arguments and all Bayesian conclusions in scientific and industrial and public policy research would be bogus, not just the theorem’s application to history, and thousands of scientists and mathematicians would have to be engaged in a conspiracy to conceal this from the public, like some cabal of astrologers. Put that obvious objection aside. Because, though also true, it doesn’t educate. It’s important to understand why thousands of scientists and mathematicians reject claims like hers. Because such claims are mathematically illiterate. And philosophers should not want to be mathematically illiterate…especially if they want to issue critiques of mathematics.
And here, neither of the two claims just summarized is true.
1. The Fallacy of Arbitrary Priors
As to the first claim, no, you can’t put just any prior in. You have to justify your priors. You might only be able to justify priors with uncomfortably wide margins of error, or you might not be able to justify any prior better than 50/50, for example, but these are not always the case, and in any event are still not arbitrary. They are compelled by the state of the background evidence, or b. Because all the premises in Bayes’ Theorem (all the terms in the equation) are conditional on b. This should be evident from the mathematical notation:
See how the letter “b” is in every single term in the equation? The up-bar separating the two halves of each term’s element within parentheses is a mathematical symbol for conditional probability. Thus P(h|b), which is the prior probability of the hypothesis being true (as opposed to P(~h|b), which is the prior probability of the hypothesis being false) means the probability (P) of the hypothesis (h) conditional on our background knowledge (b). To insert just any number here would be to simply disregard all the information in b. Which is cheating, not honest Bayesian reasoning. The mathematical notation requires your input to reflect b. That is why b is there. Your use of the equation is invalid if you ignore it.
I discuss how b constrains your input at the stage of assigning priors in Proving History, pp. 70-71, 79-85, 110-14, 229-80. But there is an even more fundamental point here, which is that b is simply always a previous e (hence often b is called old evidence and e is called new evidence), and therefore every prior probability is in fact the posterior probability of a previous run of the equation (whether actually or hypothetically). This is why e and b can be demarcated any way you want (as long as they don’t overlap), and the equation will always come out the same (if it doesn’t, you’ve done something wrong).
So you could hypothetically start at a state of zero empirical knowledge, where b contains only a priori knowledge (e.g. logic and mathematics) and your priors are therefore 50/50 (when there is literally no evidence yet to favor h over ~h, or vice versa, then that logically entails P(h|b) = P(~h|b) = 0.5), and then start adding evidence into e one bit at a time: first you add one item, run the equation, and see how the prior (starting at 0.5) gets adjusted by that item of evidence (if at all), and the posterior probability that results becomes the prior probability in the next run of the equation when you add one more item of evidence into e (the previous item of evidence having now been folded into b by the run of the equation you just completed, so what was e is now part of b, and in the new run of the equation e contains instead the new item of evidence you are adding). And so on, until you’ve entered all known evidence (all relevant human knowledge). It’s just tedious to do it this way and thus very unnecessary. But you could do it this way. And your result will be no different (and again, if it is, you’ve done something wrong).
This is called the method of iteration in Proving History (check the index, “iteration, method of”). But the point is that priors are not mysterious values pulled out of thin air, they are simply the posterior probabilities that result from prior ratios of likelihoods. And that means they are constrained. They are constrained by those ratios of likelihoods (the likelihood ratios of all prior evidence). You thus can’t just input any number you want. Your input must reflect what this iteration from zero empirical knowledge would produce (if you had the inordinate time to sit down and complete it). It therefore must reflect background evidence. Starting with a ballpark prior (ideally from some reference class that gives you a defensible ratio), is just an easier way to do it. But in no way should this conceal the fact that this ballpark estimate must be derivable from an iterated run of all prior evidence from an initial neutral prior of 50%. Thus, b constrains P(h|b).
Priors are therefore not arbitrary–at least in the hands of any honest Bayesian. Abuses of logic by the unscrupulous or the incompetent do not serve to challenge the validity of logic, and neither do abuses of Bayes’ Theorem challenge its validity. Priors may still be subjective, but that’s not the same thing as arbitrary–and perhaps the fallacy here derives from confusing the two. On why they should not be confused, check the index of Proving History, “subjective priors, problem of.”
2. The Fallacy of Foregone Probability
I can only assume what Antony meant when she said the probability of the evidence is always 1 is this weird notion I’ve heard from several anti-Bayesians who don’t understand Bayes’ Theorem or even the concept of epistemic probability altogether: that the probability of an event after it has occurred is 100% (as otherwise it wouldn’t have happened), therefore the probability of all events is ultimately 100%, and therefore always 1. This is easier to understand on an assumption of causal determinism, where there was always only ever one outcome, and our assigning it a probability is just a consequence of our inevitable ignorance of all the conditions. But often it’s framed as “since you are observing e, the existence of e is well nigh 100% certain, regardless of what caused it,” therefore P(e|h) and P(e|~h) are always 1, because e is observed and therefore certainly exists, whether h is true or not.
(I have heard and read these arguments several times before, but cannot at present locate an example online; if anyone knows of one, do please post its URL in comments.)
Antony might have meant, instead, that hypotheses can always be gerrymandered so that the probability of the evidence is 1, but that would not mean P(e|h) is always 1, but only that it can always be forced to equal 1 with some elaboration (thus, only with the elaboration can it do so). And even then, such a tactic cannot ignore the mathematical consequences of such gerrymandering to the prior probability. The more you gerrymander a theory to raise its consequent probability, the more you lower that theory’s prior probability, often to no net gain. This is the basic logic of Ockham’s Razor, and I discuss the principle of gerrymandering underlying it in Proving History, pp. 80-81 (see also in the index, “gerrymandering (a theory)”). In essence, you can’t gerrymander P(e|h) to equal 1 without “paying for it” with a reduction in P(h|b). So you would just be moving probabilities around in the equation and not actually getting any better result. Hopefully Antony was aware of that and thus not making this argument.
But if Antony meant the first argument (or some variant of it), then that would mean she does not understand the conditional nature of Bayesian probabilities. She would also be confusing a physical with an epistemic probability. Bayes’ Theorem can operate with physical probabilities without running afoul of the “foregone probability” conundrum, since even though (e.g.) on causal determinism the outcome of a die roll is foregone, it still produces frequencies of outcome, and that is what physical probability measures (so even then the alleged problem does not arise, except for omniscient beings perhaps, but they would have no need of Bayes’ Theorem because they already know everything: Proving History, p. 332, n. 43).
But Bayes’ Theorem is usually employed with epistemic probabilities, i.e. the probability that some belief about the world is true (which can always be any value less than 1, even when the physical probability is 1). See Stephen Law’s remarks on the distinction between epistemic and “objective” probability, or what I call physical probability in Proving History, pp. 23-26 (I have concluded that calling the latter an “objective” probability is confusing and should be abandoned: ibid., p. 297, n. 4; I also argue all epistemic probabilities are estimates of physical probabilities, but only through the logic of Bayesian reasoning itself: ibid., pp. 265-80; BTW, a better term for those uncomfortable with presuming physicalism is “existential” probability: see David Hawkins, “Existential and Epistemic Probability,” Philosophy of Science 10.4 [Oct. 1943]: 255-61).
But the key element to understand here is that the probabilities in a Bayesian equation are all conditional probabilities. That is why the term for evidential likelihood reads P(e|h.b) and not P(e). Even in short forms of the equation, where you find P(e) in the denominator (which is really P(e|b); often mathematicians drop elements that appear in every term, like b, since they already know those elements are there and don’t need to be reminded of it, although laymen often won’t know that so I tend to avoid that kind of abbreviated notation myself), in that case P(e) is only standing-in for the long-form denominator of [P(h) x P(e|h)] + [P(~h) x P(e|~h)] in accordance with the principle of total probability. So it’s still a conditional probability: what is being measured is the probability of the evidence (e) conditional on the hypothesis being true (h) in one case and conditional on the hypothesis being false (~h) in the other case. The notation for the one is P(e|h); for the other, P(e|~h); for both combined, P(e).
Thus, the probabilities being asked for at this stage (the probabilities you must enter as premises) are the probability of the evidence if the hypothesis is true and the probability of the evidence if the hypothesis is false. As conditional probabilities, these are probabilities conditional on h being true or false. The difference between those two probabilities is precisely the extent to which the evidence e supports one hypothesis over another (if such it does). The actual existential probability of the evidence is completely irrelevant to this–except insofar as what we know about the existential probabilities informs our epistemic probabilities (see Proving History, pp. 265-80), but even then it cannot be the case that the existential probability is always 1 (it certainly can be 1, but not always), because if h is true and ~h is false (existentially, i.e. unknown to us), then the existential probability of the observed evidence on ~h is not at all likely to be the same as the existential probability on h, as if every possible past would have resulted in exactly the same future–a notion that would make a mockery of the whole of physics. Imagine mixing any two chemicals in a glass and no matter what two chemicals you mix you always only ever end up with a glass of water; now imagine everything worked that way, such that causation was wholly irrelevant to the course of history. That’s what it would mean if the probability of the evidence was always 1.
Hence we are not asking what the probability is of history having turned out as it did (which on causal determinism is always 1, i.e. 100%) or what the probability is of e existing when we observe e (which is typically quite close to 100%). We are asking what the probability is of history having turned out as it did if certain causes were in place (the causes we are hypothesizing), and what that probability would be if those causes were not in place (and some other set of causes were in place instead). One of these is necessarily a contrafactual. Since either h or ~h must be true, not both, the probability of the evidence on one of those two possibilities cannot be the probability of history having turned out as it actually did, because one of them didn’t even happen, and thus was not even involved in history having turned out as it actually did (which, incidentally, is what we are trying to find out with Bayes’ Theorem: whether h or ~h is what actually happened).
What goes into Bayes’ Theorem, therefore, is not the probability that an event e occurred given that we observe e, which would be P(e|o) where o = an observation of e, a probability that is usually close to 100% (barring hallucination and such). Rather, what goes into Bayes’ Theorem is the probability of e occurring given the occurrence of h, hence P(e|h) where h = a hypothesized system of events prior to e, a probability that is often not even close to 100%. And we must also put into the equation P(e|~h) where ~h = any other possible system of events prior to e except h, a probability that will only be the same as P(e|h) when e is equally expected regardless of whether h is true or ~h is true instead. In that case, e is not evidence for either h or ~h, because it’s just as likely to have appeared on either possibility. But if, say, e rarely results from a system of causes like h (say, only 1 in 100 times), yet often results from some other system of causes (say, on any other system of causes, e will result 80 in 100 times), then P(e|~h) = 0.01 and P(e|h) = 0.80, neither of which is 1.
There are other occasions where a consequent probability can fall out as 1 that don’t relate to this fallacy. For example, a consequent probability can become 1 after factoring out coefficients of contingency, mathematically reducing the remaining consequent to 1 (Proving History, pp. 215-19, with pp. 77-79; and check the index for “coefficient of contingency”), and often a value of simply 1 will be substituted for approximations to 1 merely out of convenience, since the difference between a precise number and an approximate number are so small they won’t even show up in the math at the resolution you are working with (e.g., Proving History, pp. 85-88, 221-22, etc.), especially in history, where the margins of error are often so wide that they wash out any such small differences in estimates, rendering those differences moot to the outcome.
Concluding with an Example
Medical diagnosis affords many examples of how conditional probability works here, and likewise the background-evidence dependency of prior probability. Suppose we have a test for cancer that detects cancer 95% of the time (meaning 5% of the time, when there is cancer, it misses it, producing a negative result) and that gives a false positive 1% of the time (meaning that 1 out of every 100 times that the test is taken by someone without cancer, the test falsely reports they have cancer anyway). And suppose we know from abundant data that 2% of the population has cancer. We take the test, and it reports a positive result. What is the probability that we have cancer?
Here “positive result” (POS) is the evidence (e) and “we have cancer” is the hypothesis (h). The frequency data (those 95% and 1% and 2% figures just cited) derive from our background knowledge (b), e.g. millions of prior tests, and prior data on cancer rates in the population we belong to. Bayes’ Theorem then tells us (I will here leave b out of the notation, but remember it is in every term):
P(h|POS) =
[P(h) x P(POS|h)] / ( [P(h) x P(POS|h)] + [P(~h) x P(POS|~h)] ) =
[0.02 x 0.95] / ( [0.02 x 0.95] + [0.98 x 0.01] ) =
0.019 / (0.019 + 0.0098) =
0.019 / 0.0288 =
0.66 (rounded) =
66%
A positive result from this test thus gives us a mere 2 in 3 chance of actually having cancer (despite this test being “95% accurate” or even “99% accurate,” depending on whether you are referring to its false-negative rate or its false-positive rate). Note that it would make no sense to say that the probability of the evidence (observing a positive result from the test) is always 100% no matter what. That would amount to saying that the test always gives a positive result whether there is cancer or not. Which would mean the test was completely useless. You may as well test for cancer using tea leaves or astrology.
So obviously it makes no sense to say that the probability of the evidence is always 100% in Bayesian reasoning. Certainly it is not. What we want to know in this case, for example, is the probability that a positive result is caused by something other than cancer (and is thus a false-positive), and a test that does that 1 in 100 times gives us that probability: it’s 0.01 (or 1%). That is not 1. Even if causal determinism is true, it’s still not 1. Because this is not the probability that the test came out positive if we observe the test having come out positive; it’s the probability that the test came out positive if we have cancer or not. So, too, for the other consequent probability, P(e|h), which is the probability of observing e (a positive result) if we do indeed have cancer…which from prior observations we know is 0.95 (or 95%). That is also not 1.
Likewise, notice that we can’t just insert any prior probability we want here, either. What we insert has to derive from prior evidence, namely prior observations of the test giving false negatives and false positives and prior observations of how many people tend to have cancer at any given time. All those prior observations constitute background knowledge (b) that constrains the prior probability, in this case to a very narrow range (if we have a lot of data, that will be a tiny margin of error around a value of 2%).
It is here that one can question whether we can use this tool on much scarcer data, as in the case of history. In Proving History I prove that we can, as long as we accept correspondingly large margins of error and degrees of uncertainty.
In the case of priors, we might not have thousands of data points to work with–maybe only ten, let’s say–but there are ways to mathematically work with that. Likewise if we have millions of data points and could never systematically enumerate them (e.g. general knowledge of human behavior) and thus have to make a best guess from what we can observe: there is a way to do this, mathematically, that accounts for the degree of error it introduces. And it might be less clear what reference class we should start from, and how we estimate data frequencies in that class might often be subjective, but the latter only introduces uncertainties that we can again define mathematically, and the former ultimately won’t matter as long as we leave no evidence out of the sum of b and e. Since by iteration, no matter what we start with in b (no matter what reference class we draw a ratio from for our initial priors), we will always end up with the same final result once we’ve put everything else in e (as we are obligated to do).
In the case of consequents, too little or too much data can be dealt with mathematically in the same way as for priors. Likewise, the expectation of outcomes given our hypothesis, and then given other alternative explanations of the same evidence, might be subjective, but that again only introduces uncertainties that we can again define mathematically. Historians can then debate the basis for whatever value you assign to either of the two consequent probabilities (also known as the likelihoods). If it has no basis, then the assignment is unwarranted. Otherwise, it has whatever warrant its basis affords. And when we argue a fortiori, using a fortiori premises (Proving History, pp. 85-88), we can easily reach agreement among all honest and competent observers on an a fortiori result (Proving History, pp. 88-93, 208-14).
And that is actually what historians already do anyway. They just aren’t aware of the mathematical formalism that justifies what they are doing, or that exposes when they are doing it wrong (i.e. illogically or fallaciously).
Thus, it’s time to retire the fallacies of arbitrary priors and foregone likelihoods. They do not constitute a valid critique of Bayesianism in general or of Bayesianism in history.
R. Joseph Hoffmann argued something like this in a post of his last year:
It looks like it comes from equating probability with prediction. You can’t predict that it will rain after it’s already raining. Meaning that probability can’t apply to past events since all past events have a probability/”pre”diction of 1.
What do you think about Yudkowsky’s argument that epistemic probabilities have to be less than 1 and greater than 0 unless you have infinite certainty?
First: Yudkowsky is right (priors of 1 and 0 apply only in analytical logic, i.e. they represent logical necessity and impossibility; and yet even in a priori logic we can never be epistemically certain: see my remarks on this very point in Proving History, pp. 24-25…and that’s even before introducing Gödel).
Second: That quote from Hoffmann is a gem of unintelligibility. Thanks for that. I had read that before and forgot about it here. Mostly because it was ignorant gibberish. Amusingly, he thinks B is a probability when in fact it’s an item of evidence, the probability of which is P(B), not B. And in Bayes’ Theorem, such a term is always conditional on A or ~A (and thus not at all what he seems to think, which sounds like a value conditional on “having happened” or some such nonsense, which does indeed look like the fallacy of foregone probability, although it’s really hard to figure out what he thinks he is saying there).
To draw on my article, his remark that “we tend to think of conditional probability as an event that has not happened” makes no sense when we look at the cancer test example I used, betraying his ignorance of what Bayes’ Theorem is saying. The cancer test has happened. Yet conditional probability obviously still intelligibly applies. So, too, every other valid application of BT. Conclusion: Hoffmann doesn’t know what he’s talking about.
Indeed, “how can it be useful in determining whether events ”actually” transpired in the past, that is, when the sample field itself consists of what has already occurred (or not occurred)” suggests he must reject all evidence-based reasoning whatsoever, since it is always the case that we have to infer what “actually” transpired in the past with only evidence that exists in the present (that which we can directly confirm “has already occurred”). It seems like he can’t tell the difference between e and h in the equation. (Because otherwise I’m pretty sure he is not a postmodernist and thus is not skeptically rejecting all history that can’t be directly observed–although on such nonsense see my Reply to Rosenberg on History.)
The Hoffman quote you include also contains a covert attack on reference classes: all historical events are “sui generis” and therefore beyond background-evidence-based prediction, a claim that if true would render false all known science (which applies past findings to new and novel cases), and literally everything we ever conclude about how to act in the future based on what we’ve seen happen in the past. He really can’t think things through, can he?
I remember now I had long ago found that whole Hoffmann post opaque and clueless. But that wasn’t new. I gave up on him years ago.
Richard, I believe you mean that an epistemic probability of 1 represents logical necessity, not logical possibility. (For example, it is logically possible for a cat to be green, or to be black, but it is logically necessary for a cat to be a cat.)
On topic, I can’t imagine anyone who’s studied either logic or probability seriously would make the mistake Hoffmann makes in that quote. I took an intro to logic course as a college freshman that made it perfectly clear that there’s no reason inference has to happen forward in time. We can reason A -> B even if B happened first, like “there’s dirty dishes in the sink now, so my roommate must have been home earlier.” And I’m a pretty lousy poker player, but when I see new cards during a hand I can adjust the probabilities of my “predictions” of what cards my opponents have previously been dealt. “Clueless” is exactly right, he has no idea what he’s talking about.
I also like how he thinks he can use what “we tend to think” as representative of how things really work.
Right, I meant logical necessity and impossibility, not possibility and impossibility. I fixed it so no one would get confused. Thanks!
That Hoffman article is ridiculous. It’s like he’s saying that Bayesianism is this opaque complicated math thing that only works in certain cases. Uh… no it isn’t. It’s a basic rule. When you get new evidence, your belief about something changes (at least under the Bayesian interpretation). The only more ridiculous than Bayesian Reasoning is non-Bayesian Reasoning. I don’t understand why people so much trouble with this. Yea, actually doing the updates can be annoying and mathy, but he’s going after the basic idea of Bayesian Inference.
His parodies are confusing and nonsense. The only reason “The moon is made of cheese” getting 50/50 initially is absurd is because we have lots of evidence that it isn’t made of cheese. That’s literally the only reason why that’s absurd. If you remove all the evidence we have that it isn’t made of cheese, which you have to do if you’re talking about the very first prior probability, then that’s not absurd at all. That makes total sense.
Just note that IMO Hoffmann is insane. I mean literally. Obviously I’m not a medical doctor so I can only voice this as a personal lay opinion, but it’s based on evidence. I think he writes mostly out of some paranoid vendetta against me from a belief that I’m part of some cabalistic paid conspiracy to…I don’t know, the guy’s crazy. Point being, we shouldn’t generalize from his bizarre writing to anti-Bayesians in general. They would be unfairly insulted by the comparison.
But yes, regarding the moon thing, you’re spot on. Except it wouldn’t be 50/50 if we had none of that evidence, as if every other object we encountered in the universe is made of cheese. But then, that effect of background evidence is more generally your point.
Hoffman is one of those academics who gets so caught up in his rhetoric, it seems, that he doesn’t bother to wonder or worry whether what he’s saying is correct–as long as it sounds clever and can be aimed at whatever he’s criticizing.
This comes out in a special way on the comments section of his blog. Like you, Richard, he vets posts before making them public. Unlike you, he doesn’t post critical comments–unless he has something biting and witty to reply with. I once posted correcting his assessment of your credentials. I was brief, and courteous. It didn’t make it through his filter. That pretty much lost him my respect, right there.
I’m don’t think she was going for that angle. It’s tough to tell from a single sentence, but I think she was arguing against an apparent contradiction: Bayesian inference asserts all evidence must have some level of uncertainty, yet BI itself must be taken with absolute certainty. Otherwise, how else would you invoke it?
As BI has become very important to my anti-theist arguments, I’ve put some thought into answering this. I can think of three routes:
1. BI itself can be falsified, and should that happen you just switch to another epistemology. This makes it the perfect starting choice. I used to love this argument, but the more I’ve thought about it the more it seems to beg the question. How do you falsify BI? Through applying BI. Whoops! All my efforts to dodge that have failed, and all other attempts I’ve read haven’t been convincing.
2. BI is the epistemology that makes the least number of assumptions, for all epistemologies that handle uncertainty. The basic inspiration for this comes from zero-knowledge proofs, which I suspect have close ties to BI. This one needs more thought, and requires a value assign (least is better), but it’s my current favorite.
3. BI is a logical consequence of the Law of Non-Contradiction and the property known as “true.” I only thought of this one today, so I’m not very confident in it.
I assume by BI you mean Bayesian Inference.
You would falsify BI by falsifying BT by using the same mathematics that proved BT, i.e. the theorem could have been proved false (or unprovable a la Gödel), but instead it just so happens Bayes’ Theorem was proven formally valid. So anyone who accepts any common axiomatic mathematics must accept BT. That could still turn out to be a mistake, but it’s extremely unlikely at this point (that all logical certainties could turn out only mistakenly so is a point I make in Proving History, pp. 24-25, and here that point entails falsifiability).
These are the kind of arguments that get repeated over and over. I think a far stronger argument for Bayesianism is that it works and it has shown to be useful in everything from actuarial science, to stock markets, to search parties, to guessing presidential elections. Especially in a time of Big Data and powerful computation, Bayesian Statistics is way too powerful to ignore.
Besides, what’s the alternative? p-values? Really? That’s not arbitrary?
I unfortunately wasn’t able to attend INR3, so can only guess at what Louise Antony would have been asking about. But there are two standard objections to Bayesianism in the philosophical literature that seem to somewhat fit the description, and I don’t think they are as easy to dismiss as you seem to think.
The issue of arbitrary priors is central to the debate between objective and subjective bayesians. Of course, at each stage the background evidence is fixed by the stage before it, by the standard bayesian method. But at some point (at least in the highly idealized picture in which we are perfect bayesian reasoners), you get back to the point of zero evidence. More directly, we can use Bayes theorem to show that P(h/b) is a function of P(h). The claim of arbitrary priors is simply that, no matter what the background evidence, b, we can set P(h/b) equal to any probability we like by choosing P(h) appropriately. And the subjective bayesian claims that there is no a priori ground for setting the prior probability P(h) to any specific value. The argument for this is essentially just a generalization of Hume’s argument against the rational foundation of induction.
Now the objective bayesian rejects this claim, and says that we can have a priori prior probabilities, and it appears that you fall into this camp. But, first, you owe an argument for this, since it is a live philosophical debate. And second, it is far more complicated to specify a priori probabilities than to just say that, in the absence of evidence, P(h)=P(~h)=0.5. This is a very crude invocation of the principle of indifference, and it is inappropriate in most cases. (e.g., if there are multiple, mutually exclusive hypotheses, then you can’t set the prior probability of each of them to 0.5).
The second objection is harder to pinpoint, but it sounds to me like the common complaint that, in bayesian conditionalization, the probability of evidence e, once accepted, is one. This problem often appears as the problem of old evidence: if e is a part of the background b, then P(e/b) is one, and P(h/e.b) reduces to P(h/b), and evidence e no longer ‘confirms’ hypothesis h. But it is also a problem if the evidence is itself in doubt – the standard bayesian methods do not allow for conditionalization on uncertain evidence (Jeffrey conditionalization allows this, but isn’t widely accepted). With historical claims, I imagine there are lots of cases where different experts disagree on what counts as evidence in the first place, and the standard bayesian machinery has a hard time dealing with this.
As I said, I wasn’t at the talk (nor have I read your book), so I have no idea if or how these points affect the claims you were making. But if I have identified the issues Antony raised, they are not simply ignorant fallacies raised by those who don’t understand bayesianism.
Not in any useful way. The only way you can set P(h|b) to .999… for example (and still respect background evidence) is if you define h so vaguely that it is predictive of nothing (e.g. h = “Hercules is a name in some book somewhere” vs. “Hercules existed as a real historical person and conquered the Peloponnesus in the year 1803 B.C.E.”). All non-trivial theories cannot be dinked this way. So I’m not sure what you are talking about here.
Bayesian reasoning actually solves Humean induction by removing the necessity of time from the equation. The laws of probability hold for sets of evidence even atemporally, therefore one never has to infer from past to future to get the same results, and since the same laws apply across both axes (space and time), Hume’s problem is avoided as long as you only make proper statistical statements about the future and not statements of objective certainty.
Of course, you could walk this all the way back to Cartesian skepticism (“maybe we’ve all been deluded about everything!”) but that’s a defect of all epistemologies, and therefore is not any greater a defect of Bayesian epistemology.
As for the setting of priors to 50/50 at zero empirical knowledge for simple binary hypotheses (hence you are correct for non-binary problems ramification becomes more problematic, but “x exists” is almost always binary), if subjective Bayesians deny this as you claim (I’m not sure they do, you may be over-generalizing from certain cases to all, e.g. non-binary to binary hypotheses), then they are refuted on this point by the arguments in Proving History, pp. 83-88, 110-14. Be aware that we can only be talking about artificial a priori priors, i.e. we analytically create their a priori status by subtracting information (thus we can manipulate the information however we need to develop them). It’s not like there’s any such thing as an actual a priori prior (maybe for Hal 9000 the second he is first turned on, but that’s certainly not us).
Anyway, I am fairly certain none of this was what Antony was thinking.
I treat this issue several times in Proving History as well, most pertinently I usually exclude all significantly uncertain existential claims from being evidence, classifying them as hypotheses instead, unless we can condition on the uncertainty, e.g. we can calculate “P conditional on e having a 2% chance of being true,” if we really wanted to. In probability theory that’s not difficult at all. The math is just tedious. I advise historians to avoid that and only put in e and b what all parties can agree should go there (all parties, that is, who agree to certain axioms defined in chapter two, and actually abide by them…thus lunatics can be excluded from “all,” for example, as can irrational dogmatists, liars, and so on; in practice, it amounts to determining what the Bayesian conclusion is within a population that is rational, neither insane nor unyieldingly dogmatic, and sincerely committed to certain basic axioms that non-controversially define history as a knowledge-seeking profession, since we don’t care what the conclusion is in other populations, e.g. liars or lunatics).
There seems to be possible confusion here. P(e|b) = [P(h|b) x P(e|h)] + [P(~h|b) x P(e|~h)] and thus is always conditional on h and ~h. So it can never be 1 simply by being subsumed under b, because P(e|b) isn’t the probability that the evidence exists, it’s the probability that it would be produced by h and by ~h. Those latter probabilities never change. And the effect of them is represented in the revised prior, hence the probabilities are calculated into the effect e has on the prior now that it’s in b.
Your subsequent comment (which I addressed above) suggests that you are here confusing the term P(e|b) in BT with what I identify as P(e|o) in my article. Which gets us back to exactly what’s wrong with Antony’s objection. If, again, that’s what she was saying.
Richard,
I think you’re not getting Iain’s points. Let me see if I can make the second one clearer, since if Antony meant to be raising the problem of old evidence, it really is a somewhat serious problem.
Bayesians of a pretty minimal stripe typically adopt two constraints on rational belief. First, they claim that rational degrees of belief (at any given time) must obey the axioms of probability. And second, they claim that rational degrees of belief are to be updated, given evidence, using Bayes’ Rule (not to be confused with Bayes’ Theorem) of Conditionalization, which says:
Pr_new(h) = Pr_old(h | e)
In words: When I update my credences on the basis of some evidence e, the new credence that I assign to each hypothesis h in my hypothesis space must be equal to my old conditional credence in h given e. That is, I’m moving from one distribution over the hypothesis space to a new one, and Bayes’ Rule tells me how to make the move.
(Note that whereas Bayes’ Rule is controversial among epistemologists, Bayes’ Theorem is not controversial. Even benighted frequentists accept Bayes’ Theorem. Everyone who accepts the usual probability axioms, the usual definition of conditional probability, and classical logic has to accept Bayes’ Theorem!)
Now, consider what your credence for e should be after learning that e is the case. Obviously it is one. We’re just updating: Pr_new(e) = Pr_old(e | e) = 1. In other words, Bayes’ Rule assumes that we are certain about our evidence.
We’re not quite to the problem yet. We need an historical example to motivate the problem. The example that Clark Glymour used when he introduced the problem of old evidence is Einstein’s use of the theory of general relativity to explain the anomalous advance of the perihelion of Mercury. At the time that Einstein was working out the details of his general theory, it was well-known (and had been well-known for a long time) that Newton’s theory did not correctly predict Mercury’s orbit. Moreover, it was well-known what Mercury’s orbit looked like.
And that is a serious problem for Bayesian accounts of confirmation. Naively, the fact that Einstein’s theory gave the right answer for the orbit of Mercury helped to confirm the theory. But if the Bayesian picture is correct, then fitting the orbit of Mercury should not have changed the (subjective) probability that the theory was true. It should not have affected anyone’s rational credences. The reason is that the evidence — the shape of Mercury’s orbit — was already known. It was already in the background beliefs. Naively, one would think that Pr_new(GTR) = Pr_old(GTR | M) > Pr_old(GTR), where M is the observed orbit of Mercury. But since M was already known, Pr_new(GTR) = Pr_old(GTR | M) = Pr_old(GTR).
In the linked essay above, Glymour considers some ways one might try to escape the problem of old evidence. To take the example he likes best, you could treat the evidence as the fact that the theory predicts the known-in-advance observation. That is, the evidence would be a structural relation between a theory and an observation, rather than just an observation. But there is a lurking worry here that, as Glymour puts it, the degrees of belief become epiphenomena, while the structural relation does all the real work.
Note that that is just like using Newton’s equations to predict rate of fall, even though that will almost always produce a false result (see my discussion here). It’s an approximation, not an exactitude, because exactitude would require far more work for no useful gain. If you really wanted to, the probability of evidence being incorrect can be incorporated into Bayes’ Theorem (using a standard model of total probability). It’s just really a lot of work for no useful gain (especially when we are arguing a fortiori, see Proving History, index “a fortiori”).
No, it isn’t. And if you were up on Bayesian literature, you’d know it isn’t. For why it’s no problem at all (and the scholarship) see Proving History, pp. 277-80, where I discuss exactly the same example you bring up.
(You also don’t seem to know how Bayes’ Theorem works. That Mercury’s orbit was known in b is simply no different from it being observed in e, and one can move items between b and e however you want, as long as the two sets don’t overlap. I discuss this several times in Proving History, index “demarcation”. The probability that e if h is not the probability that e is observed, it’s the probability that e would be observed if h is true. Before Einstein, the probability that that e would be observed on any then known theory was very small; when Einstein came along, his theory then entailed the probability of that e was remarkably high; thus, it became evidence for his theory–one merely has to control for the retrofitting fallacy, but that didn’t apply in that one example, which is notably why that’s the example everyone uses: it’s so rarely the case that retrofitting can’t be the cause of a retrodiction, that there are few examples to choose from. I explain in Proving History, pp. 277-80, why retrofitting wasn’t a causal factor in that case. As has been noted by other exerts, whom I cite there.)
Oh, and BTW, “Old Evidence, Problem of” is even in the index to Proving History. So don’t think you are somehow surprising me with this.
If someone does not want to believe you, you could point out that this principle is actually widely used in robotics and machine learning. It is kind of hard to argue against real world technical applications.
An example: You want to do robot localization. Your robot can sense features of the environment and can move. Immediately after the robot is turned on, it knows nothing about the environment (all locations have the same probability). This are the priors. The robot then gets the first sensory input and computes the posterior probabilities. The process can be repeated using the posteriors as new priors and the algorithm converges pretty quickly – regardless of the priors unless you set the initial priors to zero.
Good priors obviously help to speed up the convergence but don’t really matter. If someone is interested I can recommend the book: “Probabilistic Robotics” that addresses the technical side of this.
For the benefit of my readers, I assume you are referring to the Thrun-Burgard-Fox text, Probabilistic Robotics (which is indeed Bayesian from start to finish, it even says so in the preface).
The second fallacy you describe is very close to one that, worryingly, is not limited to philosophers. Recently, a high-court judge in the UK ruled that statistical evidence was inadmissible, because he said it was meaningless to associate a probability with an event that had already happened. Either it happened or it didn’t. I wrote that this was grounds for him to lose his job.
Ted Bunn also picked up on the story, and offered perhaps the best response to anyone who thinks this way: challenge them to a game of poker, and insist that they stick to their principles. Either they have been dealt a winning hand, or they have not. It has already happened so no amount reasoning can allow a probability to be assigned. In which case it makes no difference whether or not they look at their cards, so they should play blind.
There is also the notion of information entropy, which can actually be used in the context of inference in two ways:
1. The notion of an “entropic prior” can be used to specify under what conditions a “natural” prior can be defined and exactly what this natural “entropic” prior should look like. The entropic prior is the generalization of “If you know nothing, it’s 50/50.” However, it can take into account more sophisticated constraints in an easy and more elegant way than Bayesian priors.
2. The notion of maximizing the entropy of the posterior relative to the prior and subject to the provided constraints allows one to define a very general way to update your set of beliefs given new information. It’s fully compatible with Bayes theorem (which it includes as a special case), but it also includes more general methods of updating like the Jeffrey’s rule, where the observed data includes uncertainty, and even more generalizations.
It’s all rather elegant and it can all be codified in terms of maximizing a single functional called the relative entropy. Everything about inference is then reduced to three problems:
1. Identify the prior. Make sure you know what the universe of discourse is.
2. Figure out what the constraints are. This is the sames as identifying what information you have been given.
3. Implement the constraints via direct substitution and/or Lagrange multipliers. The rest is an extremely simple exercise in the calculus of variations.
It turns out that all of inference can be viewed in this way, and, in many cases, this is more mathematically elegant than the highly “discrete” Bayes theorem. They are equivalent, but maximizing concave functionals is easy once things get complicated.
Sounds far too complicated for your average humanities major. Note that I similarly reject many other systems and uses of Bayes’ Theorem in the book on the grounds of lack of utility-for-the-purpose (ascertaining the logical validity of typical propositions about history). That’s not to slight their value in other ways and domains. They just aren’t useful for doing history, any more than binary code is a useful way to write books.
It is a bit more complicated, I’ll admit. However, it also makes certain problems much easier to solve. Here’s an example.
Suppose you have a die, and besides the fact that it’s a die, the only additional information you know is that the “true” average of its rolls is equal to 4. Is there a natural posterior probability distribution that can be associated with this new information given a uniform prior?
The surprising answer is yes, and the easiest way to obtain this answer is by the method of maximum entropy. Bayesian updating is very useful with the new information is in the form of discrete data points. MaxEnt is very useful when the new information is in the form of constraints (i.e. You have information that tells you what something isn’t, rather than what it is.). I don’t know if information in the form of (quantifiable) constraints shows up in the humanities, I am familiar with its use in physics, economics, experimental design, and machine learning.
It’s up to the individual to decide whether or not such a thing is useful.
Sorry, but no, that is not “the easiest way” to do that. A far easier way is simple argument by hypothesis, as I explain (indeed, with almost the very same example) in Proving History, pp. 257-65.
It might be more appropriate to say that the entropy method is the more precise way to do things like this, but historians neither need nor can use that level of precision.
I suppose the perspective of a physicist and the perspective of a historian are simply different, then. I guess we’ll just have to agree to disagree about which is easier.
Oh well, it was just a suggestion about something I thought was interesting and possibly relevant.
Certainly interesting, and relevant in certain contexts, just not in history.
Richard, there are those of us who have tried to understand your use of the theorem as describe above but who just don’t get it, and probably don’t have the time to spend on it.
Should be then not use it, and hold back on it or should we just accept it based on your experience and education?
The first question rather is: What will you use instead?
Whatever it is, either it will conform to BT or it will violate it. If the latter, your conclusion will be logically invalid (PH, pp. 106-18).
So the second question is: How much does it bother you that you won’t know when your reasoning is logically invalid, because you won’t know when it conforms to BT?
If that doesn’t bother you much, then I can’t help you. You have a much bigger problem to solve first.
But if it bothers you enough, you should try harder to understand the logic of BT. Since it requires no more than sixth grade math, I doubt it is beyond your ability to understand it if you try.
Note, however, that you don’t have to understand this specific post to apply BT. This post relates to high level meta-challenges to the underlying logic and epistemology of BT. Just as you don’t need to understand philosophy of science to be a good scientist, you don’t need to understand this meta-level stuff to apply Bayesian reasoning. Emphasis on need to. You certainly can, IMO. But that would require effort, which introduces the question of practicality (is it worth the bother). And that gets us back to the same cycle of questions:
Does it bother you that you don’t understand why the two objections to Bayesian reasoning I rebut here are incorrect?
If not, then you needn’t worry about it. You are then okay with the conclusion that they are incorrect and you can just move on to something more important.
But if it does bother you, then you need to make more of an effort to understand why these two objections to Bayesian reasoning are wrong.
And I think you could. Certainly if you did your best and then asked questions here to get more information about what is still stumping you. Progress would be inevitable I think.
Richard, it would also help if you could clarify your academic qualifiations in mathematics. I can’t seem to find much on these.
A lot of people on our campus have asked about this, so if you think it is going away because you block discussion of it, you are mistaken.
AP calculus, a college-level statistics course, a semester of electronics engineering, a year serving as a sonar technician at sea, a Ph.D. in the history of science, and several years of advanced reading and discussion with experts in the philosophy of mathematics generally and Bayesian reasoning specifically. Since my use of BT requires nothing more than sixth grade math, I really don’t even need those qualifications. But in any case, my book (Proving History: Bayes’s Theorem and the Quest for the Historical Jesus) passed peer reviewed by a professor of mathematics. And I have similarly published under peer review work in statistics and probability before that (Richard Carrier, “The Argument from Biogenesis: Probabilities Against a Natural Origin of Life,” Biology and Philosophy 19.5 [Nov 2004]: 739-64).
Rather than obsess over qualifications, though, why not actually address the arguments I make?
Richard,
Just wanted to point out that 50/50 is not the correct prior probability of the moon being made of cheese even with no background knowledge. If it were, then it could similarly be claimed to be 50% for being made of pasta, porcelain, or wood, which of course adds up to more than 100%. Instead, the correct approach, with no background knowledge, would be to assign equal probabilities to all conceivable moon materials. With thousands of possible materials, and no reason to assign additional likelihood to cheese (no background knowledge), we would have a very low probability for cheese. Thus, with no background, the correct assessment would be that any guess regarding the moon’s composition is extremely unlikely. Maybe I’m just nitpicking here.
That’s essentially what I said (“Except it wouldn’t be 50/50 if we had none of that evidence, as if every other object we encountered in the universe is made of cheese.” … in other words, if half of all objects randomly encountered were made of cheese, then a prior of 50/50 would be warranted, so conversely if we adopted an artificial zero knowledge position and asserted a prior of 50/50 for cheese we would be asserting that half of all objects randomly encountered will be made of cheese, which cannot validly be asserted on a state of zero knowledge).
Of course, on actual background knowledge it would not be an equal probability for all materials. We have a lot of background evidence regarding the frequencies of different materials in natural space objects, which gives us better than equal chances for all possible materials (and extremely low chances for absurdities like cheese). Even if we step back and assume a position of, say, an ancient astronomer, where they lacked data on space objects, there still would not be an equal distribution, since inferences could be made based on appearance and causality and what materials were at least then known (not as reliable as inferences we can make now, but far more reliable than just randomly assigning all possible materials the same probability; case in point: many ancient astronomers correctly inferred the moon was made of earth, which is well nigh statistically impossible if all materials had equal probability, so they clearly were using a more reliable standard of inference than that).
If we posit a zero knowledge position, then you wouldn’t model the problem as “made of cheese or not” because in a state of zero knowledge you’ve never heard of cheese. A person in that state could not make any assertion at all about what the moon was made of, until they started gathering more and more knowledge (hence the rise from cave men thinking it’s a light in the sky to ancient astronomers concluding its a rock of some kind to moderns being able to deduce ever-more accurate conclusions as access to data increases).
But to be charitable to the commenter’s original point, this essentially was their point (even if they got the math wrong, their idea was correct).
I was asked off-thread what I thought of two remarks about this post on Reddit. I don’t participate on Reddit, because It’s an unregulated cesspool and I don’t believe anything actually productive can ever get done there. Indeed, anyone who prefaces a critique with the sentence “Bullshit article is bullshit” (as one of the comments I was asked about does) looks more like something written by a child than anyone worth taking seriously.
But for the adults in the room:
(1)
Regarding a remark concerning arbitrary priors and Kolmogorov complexity, this is a classic example of overthinking a problem, and illustrates the difference between an armchair thinker and someone who actually uses Bayes’ Theorem in real-world applications (where it has been so thoroughly proven superior to any other epistemology, one has to immediately question anyone who claims it can’t possibly work, just like someone who claims evolution can’t possibly work because “reasons”).
In principle iterative Bayesianism solves the problem of arbitrary priors, since all priors are based on the same data, and all data has to be swept up in the equation eventually anyway, since b + e = all available knowledge without remainder. So it doesn’t matter which prior you start with. The question rather is whether we can complete a Bayesian iteration in every case. And the answer is no; in many cases we have to ballpark it. But that is a problem for all epistemologies, not just Bayesianism (see Epistemological End Game).
I address both facts in my book, in detail (for example, Proving History, pp. 229-56). The bottom line is that the selection of reference classes (hence priors) is not all that arbitrary (and indeed, any reference class you don’t use falls back into the evidence and thus affects the posterior probability anyway: as I show, again, on pp. 229-56). What this critic also seems to be unaware of is the use of a fortiori reasoning with large margins of error, a technique that solves most problems involving the selection of probabilities in BT, and that is the technique I lay out in my book Proving History (index, “a fortiori“).
I further show that all other valid epistemologies can be described by BT, thus all other valid epistemologies are modeled by and thus reduce to Bayesian epistemology. All the problems and limitations of the one are represented in the other, in a logical 1:1 correspondence. So no criticism of Bayesian epistemology can be valid that doesn’t also take down every other epistemology worth having. So you may as well just role up your sleeves and solve the problem, whatever you claim it is. And in my experience, BT makes doing this a lot easier, because it shows the correct logical relations between evidence and sound belief.
Note that at no point do I have to appeal to anything as needlessly complex as resorting to Kolmogorov complexities to define parameters in BT. That one can do this only shows that what we are doing with BT has an ontological ground, and that without remainder. But one almost never has to actually do that to know that–just as one doesn’t have to actually translate this comment into binary code to know that it can be done.
To carry that analogy further, anyone who is well enough familiar with such a procedure doesn’t have to actually translate this comment into binary code to know roughly how much data space that would consume, and generally one doesn’t ever need to know that beyond roughly, e.g. our data storage and transmission speeds are so large now that anything in the Kb range is insignificant when it comes to calculating times and storage limits, so we usually round to the Mb or maybe the 100s of Kb… “Kolmogorov-style” precision is simply useless, although someone interested in reducing the information in this comment to its smallest possible bit length might find that an amusing puzzle, the answer is generally of no use to anyone (except maybe cryptologists and engineers who are faced with solving extreme limitations in communications, such as how to increase the efficiency of ELF transmissions for submarines).
(2)
It was suggested that John Pollock’s paper “Problems for Bayesian Epistemology” refutes me, although that isn’t true for this blog post, and is certainly not true for my book, which actually refutes him, at length. Which tells me whoever posted that comment hasn’t read my book.
Indeed, all the “problems” Pollock refers to are mooted by my demonstration that “degrees of belief” reduce to frequencies (Proving History, pp. 265-80, although the preceding section is important to the point as well: pp. 257-65), and thus probability calculus applies by all the same proofs ever developed (and so all his objections have nothing to apply to in my application of BT). I provide a more formal refutation of Pollock’s entire thesis on pp. 106-14 (and relevant to understanding that is what is shown in the preceding section again, pp. 97-106), where I prove that all epistemologies necessarily reduce to BT (so any problems with Bayesian epistemology are equally problematic for any other epistemology you care to name…provided that epistemology has any logical validity; but any epistemologies that violate logic should obviously be rejected).
Pollock is correct about one thing, though: all we can get out of Bayesian reasoning is warrant (“warranted belief”), not “knowledge” in the hyper-specific sense of justified true belief, except insofar as it is probabilistic belief, because we can only have justified true belief that “given the information available to me at time t, the epistemic probability that h is true is P at t” (and not that “h is true,” full stop, the impossible dream of too many a philosopher these days). But it is not hard to show that all epistemologies suffer the same problem, and Bayesianism only exposes the problem in greater clarity. Basically, any epistemology that denies the probabilistic nature of all knowledge claims (and thus acknowledges that anything, literally anything, we claim to be “knowledge” could be false…i.e. it has some nonzero probability of being false) is an epistemology no human being could ever actually deploy (and indeed, even a god could not: PH, p. 331, n. 41). So holding out for such an epistemology is foolish.
I’m posting this for no particular reason, I just thought you may find it funny 🙂
http://xkcd.com/1236/