Does Bayesian statistical analysis involve stronger assumptions compared to frequentist statistical analysis? It is commonly believed that the answer is yes because frequentist analysis makes assumptions about the likelihood while Bayesian analysis requires assumptions on both the likelihood and the prior. Since priors are absent from frequentist analysis, it seems reasonable to say that Bayesian analysis needs more assumptions compared to frequentist analysis. Quite a few statisticians distrust modeling assumptions on the grounds that they are fundamentally unverifiable and these statisticians are much more sceptical about Bayesian methods compared to frequentist ones.
I held similar views until not too long ago but have realized recently that the nature of assumptions underlying frequentist and Bayesian analyses is quite different and it is not at all clear to me anymore that frequentist analysis indeed makes fewer assumptions. In fact, from the likelihood standpoint, Bayesian analysis makes much fewer assumptions compared to frequentist analysis. Specifically, in Bayesian analysis, one needs to model only the probability of observing the given dataset at hand conditional on the unknown parameter while in frequentist analysis, one typically needs to model, for every possible imaginary dataset, the probability of observing that dataset. I will try to illustrate this here using the following elementary problem in statistical inference that is surely familiar to every statistics student.
Problem: Suppose a scientist makes numerical measurements 26.6, 38.5, 34.4, 34, 31, 23.6 of an unknown physical quantity . On the basis of these measurements, what can be inferred about ?
Standard frequentist and Bayesian approaches to this problem are described below and they lead to identical solutions which allows for a transparent comparison of the assumptions underlying the two approaches. I will assume below that all involved random variables have densities.
Let me start with a detailed description of the standard frequentist approach. The observed dataset here consists of the points , , , , , . One can consider any other set of real numbers as an imaginary alternative dataset. One can presumably also allow alternative datasets to have different numbers of points. The collection of all such imaginary alternative datasets is known as the sample space whose specification is essential to frequentist inference. We shall take the sample space to be the set of all real numbers . If the scientist used a stopping criterion to obtain their data (such as, for example, the criterion of stopping as soon as a measurement less than 25 is observed), then this simple sample space will not work (we would need to allow datasets with varying lengths in that case).
Freq-Step 1: The observed values are assumed to be realizations of random variables .
Freq-Step 2: are assumed to be independently distributed according to the normal distribution with mean (which is the unknown quantity of interest) and an unknown variance . Mathematically, this means that have the joint probability density function:
The above assumption describes the probability of obtaining every possible imaginary dataset in the sample space.
Freq-Step 3: Under assumption (1), it is a well-known mathematical fact that has the Student -distribution with degrees of freedom where
Freq-Step 4: Because of the distributional result of Freq-Step 3,
for every
where is the -quantile of the Student -distribution with degrees of freedom. This is equivalent to saying that is contained in the random interval
Freq-Step 5: The observed values corresponding to and are plugged in (2) and the interval is calculated explicitly. With the values listed in the problem, we get as the confidence interval for . The common choice leads to the confidence interval .
So the frequentist approach required a specification of the sample space and the precise assumption (1) of the probability of observing every possible imaginary dataset in the sample space. If we only make the assumption (1) for a subset of datasets in the sample space, the distributional fact in Freq-Step 3 will no longer hold. An alternative frequentist approach based on asymptotics to this problem would make, instead of (1), the assumption that are independent and identically distributed according to a distribution with mean and a finite variance :
for some density function satisfying and . This is a more general assumption compared to (1) but it still needs to hold for all i.e., for all possible imaginary elements of the sample space.
Let us now look at the standard Bayesian approach before comparing the two approaches.
Bayes-Step 1: The observed values are assumed to be realizations of random variables . The unknown physical quantity is also considered to be random from the point of view of the scientist/data analyst who is uncertain about its actual value. The goal is to calculate the conditional density of given :
.
Bayes rule is applied to give
where the sign means that the left hand side is equal to the right hand side multiplied by a factor that does not depend on (the factor does depend on ). Modeling assumptions are needed for (likelihood) and (prior).
Bayes-Step 2: For the likelihood model, a parameter is introduced for the scale of the noise in the measurements so one can write
It is assumed that
and can together be considered as prior specification and, for these, it is assumed that
for all and respectively. These are improper priors.
Bayes-Step 3: Plugging in (6) and (5) in (4), we get
One can pull out the term involving by a simple change of variable resulting in
where
One can check with the formula for the density of the -distribution that the above is equivalent to
having the student -distribution with degrees of freedom.
Bayes-Step 4: Because of the distributional result in Bayes-Step 3,
which is equivalent to saying that, conditional on , the random variable is contained in the fixed interval
Bayes-Step 5: The observed values corresponding to and are plugged in (7) and the interval is calculated explicitly. With the values listed in the problem, we get
as the credible interval for . The common choice leads to the credible interval .
The Bayes analysis which leads to identical answers as the frequentist analysis relies on the assumption (6) for the prior and the assumption (5) for the likelihood. Let us now compare these with the assumptions (1) (and (3)) underlying frequentist analysis:
- Priors: Bayesian analysis uses the assumption (6) on the prior densities and, of course, there are no such assumptions in frequentist analysis. In this particular case, the prior assumptions are reasonable and principled arguments (based on invariant transformations) exist to justify them as a valid mathematical representation of the state of complete ignorance on part of the scientist/data analyst about and outside of the observed data (see Chapter 12 of the book Probability theory: the logic of science by E. T. Jaynes).
- Likelihood: The frequentist likelihood assumption is (1) while the Bayesian likelihood assumption is (5). Clearly (1) is much stronger than (5) as the former requires equality at all while the latter requires equality only at . From the Bayesian viewpoint, imaginary unobserved datasets are irrelevant to the problem at hand as one conditions on throughout the analysis. From the frequentist viewpoint however, alternative datasets become relevant and one needs explicit assumptions on probabilities of observing them. Many textbooks write the Bayesian likelihood assumption as This should be avoided as it refers to (1) while the actual Bayesian assumption is the much weaker (5). As I already mentioned, the -distribution result in Freq-Step 3 cannot be obtained using the much weaker assumption (5).
- More on likelihood: Let us now compare the alternative frequentist likelihood assumption (3) with (5). These cannot be directly compared but one can still note that (3) is assumed for all while (5) is only for the observed dataset. Besides, frequentist analysis under the assuption (3) relies on asymptotics by sending the sample size to and (3) is a very strong assumption (although it is weaker than (1)) when is large; indeed, how does one know that a high-dimensional density can be written in that product form for all ? One cannot use any asymptotic results such as the central limit theorem under the Bayesian likelihood assumption (5).
- As functions of and : One may argue that as a function of the parameters, (1) is weaker compared to (5) because (1) only needs to hold for the true values of and while (5) needs to hold for all and . This is true in the example provided here but frequentist analysis also usually considers likelihoods as functions of the parameters (for example, while calculating maximum likelihood estimators).
Strong assumptions are usually easy to invalidate and (1) can be invalidated easily when is large (the histogram of the observed data can be plotted and checked for closeness to a normal distribution). The assumption (3) is more complicated and people just assume that it must be true (but it is still a strong assumption because it needs to hold for all points in the sample space). On the other hand, how does one invalidate the very weak Bayesian likelihood assumption (5)?
An interesting aspect of the specific problem considered here is that both the frequentist and Bayesian approaches lead to the same answer. Specifically, both approaches assert that . Now is a fixed physical constant that either lies in the interval or does not. In spite of this, one would like to associate a probability of 0.95 to the proposition . From a Bayesian perspective, this is not a problem because the scientist/data analyst solving this problem is uncertain about the proposition and any uncertainty can be quantified by a conditional probability (here the conditioning event is ). From a frequentist point of view however, there is no meaning to the statement: “ holds with probability 0.95″ because this proposition is either always true or always false and it cannot be given any meaning in terms of repeated trials. To salvage this problem, the frequentist goes to the statement which produced this interval:
This statement (written in terms of the random variables themselves instead of their observed values) can indeed be assigned a probability by the frequentist. This probability assignment is with respect to the random variables so the frequentist is thus forced to consider other values that could have taken but have not taken in this specific observed dataset. Such a need for considering imaginary alternative datasets does not arise in Bayesian analyses.
To summarize this post, frequentist methods make much more stringent assumptions on the probability of observing data compared to Bayesian methods. On the other hand, Bayesian methods involve prior assumptions which frequentist methods have no need for. Which analysis are you more comfortable with?