Consider a pair of jointly distributed random variables and . We think of as the unknown parameter and as the data from which we have to draw inference about . Assume that the marginal distribution of is uniform over a finite set of size i.e., for each .

The *mutual information* between and , denoted by , is defined as the Kullback-Leibler divergence (or relative entropy) between the joint distribution of and and the product of their marginal distributions. Recall that the Kullback-Leibler divergence between two probability measures and is given by where and denote densities of and with respect to a common measure .

A clean expression for can be written down in terms of the conditional distributions, , of conditional on :

where .

Mutual information is fundamental to information theory. It also appears in many places in statistics. For example, it provides, via Fano’s inequality, a lower bound on how well can be estimated on the basis of :

Here is any estimator for based on i.e., any function of that takes values in .

Determining exactly is usually intractable however and one typically works with appropriate bounds on . This post is about upper bounds on . Note that an upper bound on provides a lower bound for the testing error in (1).

Most bounds for are based on the identity

which is a consequence of the fact that for every . Different choices of in (2) give different upper bounds on . One gets, for example,

These bounds are very frequently used in conjunction with Fano’s inequality (1). The last bound is called the Kullback-Leibler diameter of .

I will try to argue below that the bounds in (3) are, in general, quite inaccurate and describe some improved bounds due to Yang and Barron [1] and Haussler and Opper [2].

To demonstrate that these are bad bounds, consider the following alternate, equally crappy, bound for :

To see why this is true, just note that for each which is a consequence of the fact that the density of with respect to is trivially bounded from above by .

The bounds (3) and (4) are complementary; (3) only involves the pairwise Kullback-Leibler divergences for while the bound (4) only considers the cardinality of . To see why (3) can be inaccurate, just consider the simple case when with much larger than . On the other hand, (4) can be a crappy bound when is a set with large cardinality and small Kullback-Leibler diameter.

Yang and Barron [1] proved the following upper bound for that is a simultaneous improvement of both (3) and (4). For every finite set of probability measures with , we have

To see why (5) is an improvement of (3), just fix and take to be the singleton probability measure so that . Then write down the bound given by (5) and then take infimum over . To see why (5) is an improvement of (4), just take to be identical to the class .

Yang and Barron [1] crucially used (5) in conjunction with Fano’s inequality to give simple proofs of minimax lower bounds in many classical nonparametric estimation problems including density estimation and nonparametric regression. Their arguments actually apply to any situation where one has accurate upper and lower bounds on the global metric entropy numbers of the parameter space. They also point out that the weaker bounds in (3) cannot be used to yield optimal minimax lower bounds that depend on global metric entropy properties alone. I will try to explain these in another post.

Inequality (5) has an almost trivial proof. Let denote the density of and denote the density of , all with respect to a common dominating measure . Let and note, from (4), that . Observe now that the ratio of the densities of and is bounded from above by which should prove (5).

It turns out that (5) is a consequence of the following bound for due to Haussler and Opper [2]:

Indeed, one obtains (5) from (6) by replacing the inner sum in the right hand side above by the smallest of the terms. Inequality (6) is proved by a very clever application of Jensen’s inequality. First observe that because , it is enough to show that

for every . Write

Consider now the mapping

so that

The crucial point here is that this mapping is concave. Inequality (6) then follows by Jensen’s inequality. To see why is concave, use Holder’s inequality:

and then take logs on both sides. Haussler and Opper [2] also proved a very nice lower bound for that looks very similar to the upper bound in (6); that might be the topic of a future post.

**References**

[1] D. Haussler and M. Opper. Mutual information, metric entropy and cumulative relative entropy risk. *Annals of Statistics*, 25: 2451-2492, 1997.

[2] Y. Yang and A. Barron. Information-theoretic determination of minimax rates of convergence. *Annals of Statistics, 27: 1564-1599, 1999. *