Introduction

Both variance and entropy are measures of uncertainty. Variance assumes values vary as points in a space separated by distances. In this document, the variance of a random vector refers to the variance of the distance from its mean (sum of the variances of each component).

Random one-hot vectors are a convenient spacial representation for categorical random variables. A one-hot vector has all components equal to 00 except one component that equals 11. This representation has been used in genetics [1]. For genetic loci with only two alleles, a one-hot vector has two redundant components. "Half" of such one-hot vectors are typically used in genetics (e.g. [2] p.40, [3], [4] ). The variance of the "half one-hot vector" is exactly half the variance of its full one-hot vector.

Main Result

Given NN independent random one-hot vectors X1X_1, X2X_2, ..., XNX_N, define Xβˆ—=X1Γ—X2Γ—β‹―Γ—XN X_* = X_1 \times X_2 \times \dots \times X_N as the Cartesian product.

The variance of Xβˆ—X_* can be adjusted to form a lower bound to the collision entropy, H⁑2(Xβˆ—)\operatorname{H}_2(X_*), and Shannon entropy, H⁑(Xβˆ—)\operatorname{H}(X_*): βˆ’log⁑2(1βˆ’Var⁑(Xβˆ—)N)β€…β€Šβ‰€β€…β€ŠH⁑2(Xβˆ—)Nβ€…β€Šβ‰€β€…β€ŠH⁑(Xβˆ—)N - \log_2{\left( 1 - \frac{ \operatorname{Var}({ X_*}) }{N} \right)} \; \le \; \frac{\operatorname{H}_2({ X_*})}{N} \; \le \; \frac{\operatorname{H}({ X_*})}{N}

If every XiX_i takes only two equally likely values, then the lower bounds reach equality: βˆ’log⁑2(1βˆ’Var⁑(Xβˆ—)N)=H⁑2(Xβˆ—)N=H⁑(Xβˆ—)N=1 -\log_2{\left( 1 - \frac{ \operatorname{Var}({ X_*}) }{N} \right)} = \frac{\operatorname{H}_2({ X_*})}{N} = \frac{\operatorname{H}({ X_*})}{N} = 1

Proof

Let MiM_i be length of XiX_i (the number of categorical values represented by XiX_i). Let pi,jp_{i,j} represent the probability of XiX_i taking the jj-th categorical value.

For every 1≀i≀N1 \le i \le N, βˆ‘j=1Mipi,j=1 \sum_{j=1}^{M_i} p_{i,j} = 1

The expectation and variance of the ii-th one-hot vector XiX_i is E⁑ ⁣(Xi)=(β€…β€Špi,1β€…β€Š,β€…β€Špi,2β€…β€Š,β€…β€Šβ€¦β€…β€Š,β€…β€Špi,Miβ€…β€Š)Var⁑(Xi)=βˆ‘j=1Mipi,j[(1βˆ’pi,j)2+βˆ‘k=ΜΈj(0βˆ’pi,k)2]=βˆ‘j=1Mipi,j[1βˆ’2pi,j+βˆ‘k=1Mipi,k2]=1βˆ’2βˆ‘j=1Mipi,j2+βˆ‘k=1Mipi,k2=1βˆ’βˆ‘j=1Mipi,j2 \begin{aligned} \operatorname{E}\!\left({ X_i}\right) & = \left(\; {p}_{i,1} \;,\; {p}_{i,2} \;,\; \dots \;,\; {p}_{i,M_i} \;\right) \\ \operatorname{Var}({ X_i}) & = \sum_{j=1}^{M_i} p_{i,j} \left[ (1 - p_{i,j})^2 + \sum_{k \not= j} (0 - p_{i,k})^2 \right] \\ & = \sum_{j=1}^{M_i} p_{i,j} \left[ 1 - 2 p_{i,j} + \sum_{k=1}^{M_i} p_{i,k}^2 \right] \\ & = 1 - 2 \sum_{j=1}^{M_i} p_{i,j}^2 + \sum_{k=1}^{M_i} p_{i,k}^2 \\ & = 1 - \sum_{j=1}^{M_i} p_{i,j}^2 \end{aligned}

Thus the variance of XiX_i equals the probability of two independent samples from XiX_i being distinct. This probability of distinction has been called logical entropy [5].

The complement 1βˆ’Var⁑(Xi)=βˆ‘j=1Mipi,j2 1 - \operatorname{Var}({ X_i}) = \sum_{j=1}^{M_i} p_{i,j}^2 is the chance of repetition, which is expected probability. Taking the negative log gives RΓ©nyi entropy of order 2, also called collision entropy: βˆ’log⁑2(1βˆ’Var⁑(Xi))=βˆ’log⁑2(βˆ‘j=1Mipi,j2)=H⁑2(Xi) -\log_2{( 1 - \operatorname{Var}({ X_i}))} = -\log_2{\left( \sum_{j=1}^{M_i} p_{i,j}^2 \right)} = \operatorname{H}_2(X_i) Since negative log is a concave function, the negative log of expected probability (collision entropy), is a lower bound to the expected negative log of probability (Shannon entropy) by Jensen's inequality: H⁑2(Xi)=βˆ’log⁑2(βˆ‘j=1Mipi,j2)β‰€βˆ‘j=1Mipi,j(βˆ’log⁑2pi,j)=H⁑(Xi) \operatorname{H}_2(X_i) = -\log_2{\left( \sum_{j=1}^{M_i} p_{i,j}^2 \right)} \le \sum_{j=1}^{M_i} p_{i,j} (-\log_2{p_{i,j}}) = \operatorname{H}({ X_i})

The total variance, can be adjusted to equal the average probability of one-hot vector repetition (per one-hot vector): 1βˆ’Var⁑(Xβˆ—)N=1βˆ’1Nβˆ‘i=1NVar⁑(Xi)=1Nβˆ‘i=1Nβˆ‘j=1Mipi,j2 1 - \frac{ \operatorname{Var}({ X_*}) }{N} = 1 - \frac{1}{N} \sum_{i=1}^N \operatorname{Var}({ X_i}) = \frac{1}{N} \sum_{i=1}^N \sum_{j=1}^{M_i} p_{i,j}^2

Negative log with Jensen's inequality can then establish yet another lower bound: βˆ’log⁑2(1Nβˆ‘i=1Nβˆ‘j=1Mipi,j2)≀1Nβˆ‘i=1N(βˆ’log⁑2βˆ‘j=1Mipi,j2)=1Nβˆ‘i=1NH⁑2(Xi) -\log_2{\left( \frac{1}{N} \sum_{i=1}^N \sum_{j=1}^{M_i} p_{i,j}^2 \right)} \le \frac{1}{N} \sum_{i=1}^N \left( -\log_2{\sum_{j=1}^{M_i} p_{i,j}^2} \right) = \frac{1}{N} \sum_{i=1}^N \operatorname{H}_2(X_i)

Collision and Shannon entropy are additive for independent variables. Putting everything together we get βˆ’log⁑2(1βˆ’Var⁑(Xβˆ—)N)β€…β€Šβ‰€β€…β€ŠH⁑2(Xβˆ—)Nβ€…β€Šβ‰€β€…β€ŠH⁑(Xβˆ—)N -\log_2{\left( 1 - \frac{ \operatorname{Var}({ X_*}) }{N} \right)} \; \le \; \frac{\operatorname{H}_2({ X_*})}{N} \; \le \; \frac{\operatorname{H}({ X_*})}{N}

References

1.
Menozzi P, Piazza A, Cavalli-Sforza L (1978) Synthetic maps of human gene frequencies in Europeans. Science 201:786–792. https://doi.org/10.1126/science.356262
2.
Weir BS (1996) Genetic data analysis II: Methods for discrete population genetic data. Sinauer Associates, Sunderland, Mass
3.
Weir BS, Hill WG (2002) Estimating F-Statistics. Annual Review of Genetics 36:721–750. https://doi.org/10.1146/annurev.genet.36.050802.093940
4.
Patterson N, Price AL, Reich D (2006) Population Structure and Eigenanalysis. PLoS Genetics 2:e190–. https://doi.org/10.1371/journal.pgen.0020190
5.
Ellerman D (2017) Logical information theory: New logical foundations for information theory. Logic Journal of the IGPL 25:806–835. https://doi.org/10.1093/jigpal/jzx022