Mutual Information and Representation Learning

February 01, 2020

Preliminaries

Data Processing Inequality

Mutual Information Maximization

In [1] representation learning we want to encode a high dimensional input to a representation which will carry as much as possible information about the input. This translates to maximizing the mutual information between the input and the representation.

I(x; c) = \sum_{x,c}{p(x,c)\frac{p(x,c}{p(x)p(c)}}= \sum_{x,c}{p(x,c)\frac{p(x|c}{p(x)}}

sdfdsf asddasf asdfasdf 2323232 Often this summation is not tractable because of the cardinality of $X$ and $C$ .

Deriving the Contrastive Predictive Coding loss

We want to model

f(x, c) \propto \frac{p(x|c)}{p(x)}

and we set it to

f(x, c) = \exp(z^TWc)

Mutual Information Minimization

This again [2]. Yeah.

Mutual Information minimization occurs in several information bottleneck settings. [2]

On mutual information maximization for representation learning

Tschannen, M., Djolonga, J., Rubenstein, P.K., Gelly, S. and Lucic, M., 2019. arXiv preprint arXiv:1907.13625.

Mutual information neural estimation

Belghazi, M.I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A. and Hjelm, D., 2018. International Conference on Machine Learning, pp. 531--540.