Mutual Information and Representation Learning

February 01, 2020

Preliminaries

Data Processing Inequality

Mutual Information Maximization

In [1] representation learning we want to encode a high dimensional input to a representation which will carry as much as possible information about the input. This translates to maximizing the mutual information between the input and the representation.

I(x;c)=x,cp(x,c)p(x,cp(x)p(c)=x,cp(x,c)p(xcp(x)I(x; c) = \sum_{x,c}{p(x,c)\frac{p(x,c}{p(x)p(c)}}= \sum_{x,c}{p(x,c)\frac{p(x|c}{p(x)}}

Often this summation is not tractable because of the cardinality of XX and CC.

Deriving the Contrastive Predictive Coding loss

We want to model

f(x,c)p(xc)p(x)f(x, c) \propto \frac{p(x|c)}{p(x)}

and we set it to

f(x,c)=exp(zTWc)f(x, c) = \exp(z^TWc)

Mutual Information Minimization

Mutual Information minimization occurs in several information bottleneck settings. [2]

  1. On mutual information maximization for representation learning
    Tschannen, M., Djolonga, J., Rubenstein, P.K., Gelly, S. and Lucic, M., 2019. arXiv preprint arXiv:1907.13625.
  2. Mutual information neural estimation
    Belghazi, M.I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A. and Hjelm, D., 2018. International Conference on Machine Learning, pp. 531--540.