Mutual Information and Representation Learning

February 01, 2020

Preliminaries

Data Processing Inequality

Mutual Information Maximization

In [1] representation learning we want to encode a high dimensional input to a representation which will carry as much as possible information about the input. This translates to maximizing the mutual information between the input and the representation.

I(x;c)=x,cp(x,c)p(x,cp(x)p(c)=x,cp(x,c)p(xcp(x)I(x; c) = \sum_{x,c}{p(x,c)\frac{p(x,c}{p(x)p(c)}}= \sum_{x,c}{p(x,c)\frac{p(x|c}{p(x)}}

sdfdsf asddasf asdfasdf 2323232 Often this summation is not tractable because of the cardinality of XX and CC.

Deriving the Contrastive Predictive Coding loss

We want to model

f(x,c)p(xc)p(x)f(x, c) \propto \frac{p(x|c)}{p(x)}

and we set it to

f(x,c)=exp(zTWc)f(x, c) = \exp(z^TWc)

Mutual Information Minimization

This again [2]. Yeah.

Mutual Information minimization occurs in several information bottleneck settings. [2]

On mutual information maximization for representation learning
Tschannen, M., Djolonga, J., Rubenstein, P.K., Gelly, S. and Lucic, M., 2019. arXiv preprint arXiv:1907.13625.
Mutual information neural estimation
Belghazi, M.I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A. and Hjelm, D., 2018. International Conference on Machine Learning, pp. 531--540.