# Upper bounds for mutual information

Consider a pair of jointly distributed random variables ${\Theta}$ and ${X}$. We think of ${\Theta}$ as the unknown parameter and ${X}$ as the data from which we have to draw inference about ${\Theta}$. Assume that the marginal distribution of ${\Theta}$ is uniform over a finite set ${F}$ of size ${N}$ i.e., ${\mathop{\mathbb P} \{\Theta = \theta\} = 1/N}$ for each ${\theta \in F}$.

The mutual information between ${\Theta}$ and ${X}$, denoted by ${I(\Theta, X)}$, is defined as the Kullback-Leibler divergence (or relative entropy) between the joint distribution of ${\Theta}$ and ${X}$ and the product of their marginal distributions. Recall that the Kullback-Leibler divergence between two probability measures ${P}$ and ${Q}$ is given by ${D(P||Q) := \int p \log(p/q) d\mu}$ where ${p}$ and ${q}$ denote densities of ${P}$ and ${Q}$ with respect to a common measure ${\mu}$.

A clean expression for ${I(\Theta, X)}$ can be written down in terms of the conditional distributions, ${P_{\theta}}$, of ${X}$ conditional on ${\Theta = \theta}$:

$\displaystyle I(\Theta, X) = \frac{1}{N} \sum_{\theta \in F} D(P_{\theta}||\bar{P})$

where ${\bar{P} := \frac{1}{N} \sum_{\theta \in F} P_{\theta}}$.

Mutual information is fundamental to information theory. It also appears in many places in statistics. For example, it provides, via Fano’s inequality, a lower bound on how well ${\Theta}$ can be estimated on the basis of ${X}$:

$\displaystyle \inf_{\hat{\Theta}(X)} \mathop{\mathbb P} \left\{\Theta \neq \hat{\Theta}(X) \right\} \geq 1 - \frac{I(\Theta; X) + \log 2}{\log N}. \ \ \ \ \ (1)$

Here ${\hat{\Theta}(X)}$ is any estimator for ${\Theta}$ based on ${X}$ i.e., any function of ${X}$ that takes values in ${F}$.

Determining ${I(\Theta, X)}$ exactly is usually intractable however and one typically works with appropriate bounds on ${I(\Theta, X)}$. This post is about upper bounds on ${I(\Theta, X)}$. Note that an upper bound on ${I(\Theta, X)}$ provides a lower bound for the testing error in (1).

Most bounds for ${I(\Theta; X)}$ are based on the identity

$\displaystyle I(\Theta, X) = \frac{1}{N} \inf_Q \sum_{\theta \in F} D(P_{\theta} || Q) \ \ \ \ \ (2)$

which is a consequence of the fact that ${\sum_{\theta \in F} D(P_{\theta} || Q) = \sum_{\theta \in F} D(P_{\theta} || \bar{P}) + N D(\bar{P} || Q) }$ for every ${Q}$. Different choices of ${Q}$ in (2) give different upper bounds on ${I(\Theta, X)}$. One gets, for example,

$\displaystyle I(\Theta, X) \leq \min_{\theta' \in F} \frac{1}{N} \sum_{\theta \in F} D(P_{\theta} || P_{\theta'}) \leq \frac{1}{N^2} \sum_{\theta, \theta' \in F} D(P_{\theta} || P_{\theta'}) \leq \max_{\theta, \theta' \in F} D(P_{\theta} || P_{\theta'}) . \ \ \ \ \ (3)$

These bounds are very frequently used in conjunction with Fano’s inequality (1). The last bound ${\max_{\theta, \theta' \in F} D(P_{\theta} || P_{\theta'})}$ is called the Kullback-Leibler diameter of ${\{P_{\theta} : \theta \in F\}}$.

I will try to argue below that the bounds in (3) are, in general, quite inaccurate and describe some improved bounds due to Yang and Barron [1] and Haussler and Opper [2].

To demonstrate that these are bad bounds, consider the following alternate, equally crappy, bound for ${I(\Theta, X)}$:

$\displaystyle I(\Theta, X) \leq \log N. \ \ \ \ \ (4)$

To see why this is true, just note that ${D(P_{\theta} || \bar{P}) \leq \log N}$ for each ${\theta \in F}$ which is a consequence of the fact that the density of ${P_{\theta}}$ with respect to ${\bar{P}}$ is trivially bounded from above by ${N}$.

The bounds (3) and (4) are complementary; (3) only involves the pairwise Kullback-Leibler divergences ${D(P_{\theta}||P_{\theta'})}$ for ${\theta, \theta' \in F}$ while the bound (4) only considers the cardinality of ${F}$. To see why (3) can be inaccurate, just consider the simple case when ${F = \{\theta, \theta'\}}$ with ${D(P_{\theta}||P_{\theta'})}$ much larger than ${\log 2}$. On the other hand, (4) can be a crappy bound when ${\{P_{\theta}: \theta \in F\}}$ is a set with large cardinality and small Kullback-Leibler diameter.

Yang and Barron [1] proved the following upper bound for ${I(\Theta, X)}$ that is a simultaneous improvement of both (3) and (4). For every finite set of probability measures ${\{Q_{\alpha}, \alpha \in G\}}$ with ${|G| = M}$, we have

$\displaystyle I(\Theta, X) \leq \log M + \frac{1}{N} \sum_{\theta \in F} \min_{\alpha \in G} D(P_{\theta} || Q_{\alpha}). \ \ \ \ \ (5)$

To see why (5) is an improvement of (3), just fix ${\theta' \in F}$ and take ${\{Q_{\alpha}, \alpha \in G\}}$ to be the singleton probability measure ${\{P_{\theta'}\}}$ so that ${M = 1}$. Then write down the bound given by (5) and then take infimum over ${\theta' \in F}$. To see why (5) is an improvement of (4), just take ${\{Q_{\alpha}, \alpha \in G\}}$ to be identical to the class ${\{P_{\theta}, \theta \in F\}}$.

Yang and Barron [1] crucially used (5) in conjunction with Fano’s inequality to give simple proofs of minimax lower bounds in many classical nonparametric estimation problems including density estimation and nonparametric regression. Their arguments actually apply to any situation where one has accurate upper and lower bounds on the global metric entropy numbers of the parameter space. They also point out that the weaker bounds in (3) cannot be used to yield optimal minimax lower bounds that depend on global metric entropy properties alone. I will try to explain these in another post.

Inequality (5) has an almost trivial proof. Let ${p_{\theta}}$ denote the density of ${P_{\theta}}$ and ${q_{\alpha}}$ denote the density of ${Q_{\alpha}}$, all with respect to a common dominating measure ${\mu}$. Let ${\bar{Q} := \sum_{\alpha \in G} Q_{\alpha}/M}$ and note, from (4), that ${I(\Theta, X) \leq \sum_{\theta \in F} D(P_{\theta} || \bar{Q})/N}$. Observe now that the ratio of the densities of ${P_{\theta}}$ and ${\bar{Q}}$ is bounded from above by ${M \min_{\alpha \in G} p_{\theta}/q_{\alpha}}$ which should prove (5).

It turns out that (5) is a consequence of the following bound for ${I(\Theta, X)}$ due to Haussler and Opper [2]:

$\displaystyle I(\Theta, X) \leq - \frac{1}{N} \sum_{\theta \in F} \log \left(\frac{1}{M} \sum_{\alpha \in G} \exp \left(- D(P_{\theta}||Q_{\alpha}) \right) \right). \ \ \ \ \ (6)$

Indeed, one obtains (5) from (6) by replacing the inner sum in the right hand side above by the smallest of the terms. Inequality (6) is proved by a very clever application of Jensen’s inequality. First observe that because ${I(\Theta, X) \leq \sum_{\theta \in F} D(P_{\theta} || \bar{Q})}$, it is enough to show that

$\displaystyle D(P_{\theta} || \bar{Q}) \leq -\log \left(\frac{1}{M} \sum_{\alpha \in G} \exp \left(- D(P_{\theta}||Q_{\alpha}) \right) \right)$

for every ${\theta \in F}$. Write

$\displaystyle D(P_{\theta} || \bar{Q}) = - \int p_{\theta} \log \left( \frac{1}{M} \sum_{\alpha \in G} \frac{q_{\alpha}}{p_{\theta}} \right) = - \int p_{\theta} \log \left( \frac{1}{M} \sum_{\alpha \in G} \exp \left(\log \frac{q_{\alpha}}{p_{\theta}} \right) \right).$

Consider now the mapping

$\displaystyle F : (u_{\alpha}, \alpha \in G) \mapsto -\log \left(\sum_{\alpha \in G} e^{u_{\alpha}}/M \right).$

so that

$\displaystyle D(P_{\theta} || \bar{Q}) = \int p_{\theta} F \left(\log \frac{q_{\alpha}}{p_{\theta}}, \alpha \in G \right).$

The crucial point here is that this mapping is concave. Inequality (6) then follows by Jensen’s inequality. To see why ${F}$ is concave, use Holder’s inequality:

$\displaystyle \frac{1}{M} \sum_{\alpha \in G} \left(e^{u_{\alpha}}\right)^{1-\gamma} \left(e^{v_{\alpha}}\right)^{\gamma} \leq \left(\frac{1}{M} \sum_{\alpha \in G} e^{u_{\alpha}} \right)^{1-\gamma} \left(\frac{1}{M} \sum_{\alpha \in G} e^{v_{\alpha}} \right)^{\gamma}$

and then take logs on both sides. Haussler and Opper [2] also proved a very nice lower bound for ${I(\Theta, X)}$ that looks very similar to the upper bound in (6); that might be the topic of a future post.

References

[1] D. Haussler and M. Opper. Mutual information, metric entropy and cumulative relative entropy risk. Annals of Statistics, 25: 2451-2492, 1997.

[2] Y. Yang and A. Barron. Information-theoretic determination of minimax rates of convergence. Annals of Statistics, 27: 1564-1599, 1999.