# Tensor embedding preserves Hamming distance

This is a proof that Tensor Embedding (Joudaki, Rätsch, and Kahles 2020) with $ℓ^2$-norm preserves the Hamming distance.

This is in collaboration with Amir Joudaki.


## Definitions

Notation
• The alphabet is $$\Sigma$$, of size $$|\Sigma| = \sigma$$.
• The set of indices is $$\I := \{(i_1, \dots, i_t) \in [n]^t: i_1 < \dots < i_t\}$$.
• Given a string $$a_1\dots a_n = a\in \Sigma^n$$, we define the $I$-index as $$a_I = (a_{i_1}, \dots, a_{i_t})$$.
• We write $$[ X ]$$ for the indicator variable of event $$X$$, which is $$1$$ when $$X$$ holds and $$0$$ otherwise.
Definition 1: Tensor embedding
Given $$a\in \Sigma^n$$, the tensor embedding $$T_a$$ is the $$\sigma^t$$ tensor given by $$T_a[s] = \sum_{I\in \I} [A_I = s]$$ for each $$s\in \Sigma^t$$.

The normalized tensor embedding distance $$d_{te}$$ between two sequences $$a$$ and $$b$$ is defined as

\begin{equation*} d_{te}(a,b) := \frac 12 \binom{n}{2t-1}^{-1}\cdot \|T_a - T_b\|_2^2. \end{equation*}

Lemma 1: Tensor embedding preserves Hamming distance under $$\ell^2$$ norm
Let $$a$$ be a uniform random sequence of length $$n$$ in $$\Sigma^n$$, and for a fixed mutation rate $$r\in [0,1]$$ let $$b$$ be a sequence where $$a_i$$ is substituted by a new character $$b_i \in Unif(\Sigma \backslash a_i)$$ with probability $$r$$ and $$b_i = a_i$$ otherwise. Then for $$n\gg 2t\sigma$$:

\begin{equation*} \EE_{a,b}[d_{te}(a,b)] = (4/\sigma)^{t-1} \cdot r + O(2t\sigma^{2-t}/n) \cdot r, \end{equation*}

which for DNA with $$\sigma=4$$ and fixed $$t$$ gives $$\EE[d_{te}(a,b)] = (1 + O(n^{-1})) \cdot r$$.

Lemma 2: Variance of Tensor embedding
In the same setting as Lemma 1,

\begin{equation*} \var_{a,b}[d_{te}(a,b)] = TODO. \end{equation*}

## Proof of Lemma 1

By definition we have

\begin{align*} 2\binom{n}{2t-1}d_{te}(a,b) &= \|T_a - T_b\|_2^2 = \sum_{s\in \Sigma^t} \left(\sum_{I\in \I} [a_I = s] - \sum_{I\in \I}[b_I = s]\right)^2 \\ &= \sum_{s\in \Sigma^t} \sum_{I,J\in \I} \Big([a_I = s][a_J=s] - [a_I=s][b_J=s] - [b_I=s][a_J=s] + [b_I=s][b_J=s]\Big). \end{align*}

By symmetry between $$a$$ and $$b$$, the first and last term, and second and third term are equal in expected value, reducing this to

\begin{align*} \EE_{a,b}\left(\|T_a-T_b\|_2^2\right) &=\EE \left(2 \sum_{s\in \Sigma^t} \sum_{I,J\in \I} \Big([a_I = s][a_J=s] - [a_I=s][b_J=s]\Big)\right)\\ &=\EE\left( 2 \sum_{I,J\in \I} \sum_{s\in \Sigma^t}\Big([a_I = s \land a_J=s] - [a_I=s \land b_J=s]\Big)\right)\\ &= 2 \sum_{I,J\in \I}\EE \Big([a_I = a_J] - [a_I=b_J]\Big).\tag{i}\label{eq:delta} \end{align*}

Define the overlap $$q$$ as the number of positions where $$I$$ and $$J$$ are equal, $$q(I, J) := |\{x\in [t]: I_x = J_x\}|$$. We will show using induction on $$t$$ that $$\EE[a_I=b_J]=(\sigma(1-r))^q\sigma^{-t}$$. For $$t=0$$ we have $$I=J=\emptyset$$ and trivially $$\EE[a_I = b_J] = 1$$. For $$t>0$$, write $$I’$$ and $$J’$$ for the tuples $$(I_1, \dots, I_{t-1})$$ and $$(J_1, \dots, J_{t-1})$$. When $$I_t = J_t$$, the characters $$a_{I_t}$$ and $$b_{J_t}$$ are independent of the earlier characters and equal with probability $$1-r$$, and $$q(I’, J’) = q-1$$, so that

\begin{align*} \EE[a_I = b_J] &= (1-r) \EE[a_{I’} = b_{J’}]\\ &= (1-r) \cdot (\sigma(1-r))^{q-1}\sigma^{-(t-1)}\\ &= (\sigma(1-r))^{q}\sigma^{-t}. \end{align*}

When $$I_t \neq J_t$$, assume without loss of generality that $$I_t < J_t$$. Then $$I_x < J_t$$ for all $$x\in [t]$$, resulting in $$b_{J_t}$$ is independent from the characters seen so far. This implies that $$[a_{I_t} = b_{J_t}]$$ is independent from $$[a_{I’} = b_{I’}]$$:

\begin{align*} \EE[a_I = b_J] &= \EE[a_{I_t} = b_{J_t}] \EE[a_{I’} = b_{J’}]\\ &= \sigma \cdot (\sigma(1-r))^q\sigma^{-(t-1)}\\ &= (\sigma(1-r))^q\sigma^{-t}. \end{align*}

We conclude that

\begin{equation*} \EE_{a,b}\big([a_I=a_J]-[a_I=b_J]\big) = \sigma^{-t+q}\big(1-(1-r)^{q(I, J)}\big). \end{equation*}

This difference vanishes for $$q=0$$, and thus in \eqref{eq:delta} we only have to consider $$(I, J)$$ with $$q(I, J) \geq 1$$. The summation can now be rewritten as

\begin{align*} \EE_{a,b}\left(\|T_a-T_b\|_2^2\right) &= 2 \sum_{q=1}^t \sum_{\substack{I,J\in \I:\\ q(I, J) = q}}\EE \Big([a_I = a_J] - [a_I=b_J]\Big)\\ &= 2 \sum_{q=1}^t \sum_{\substack{I,J\in \I:\\ q(I, J) = q}} \sigma^{-t+q}\big(1-(1-r)^q\big)\\ &= 2 \sum_{q=1}^t \sigma^{-t+q}\big(1-(1-r)^q\big)\cdot f_q, \tag{ii}\label{eq:ii} \end{align*}

where $$f_q$$ counts the number of pairs $$(I, J)$$ with $$q(I, J) = q$$. Since $$|I\cap J|\geq q$$, the total number of distinct indices is bounded by $$|I\cup J| \leq 2t-q$$. This directly implies that $$f_q \leq (1+o(1)) \binom{n}{2t-q}$$, which for $$q\geq 2$$ gives

\begin{equation*} \binom{n}{2t-1}^{-1} \binom{n}{2t-q} \cdot \sigma^{-t+q}\big(1-(1-r)^q\big) = O((2t\sigma/n)^{q-1} \sigma^{1-t} r). \end{equation*}

When $$q=1$$ and $$|I\cup J| < 2t-1$$ a similar argument applies, and we are left with the case where $$q=1$$ and $$|I\cup J| = 2t-1$$. We can first choose the $$2t-1$$ distinct values for $$I\cup J$$ in $$\binom n{2t-1}$$ ways, and then assume that $$I\cup J = [2t-1]$$. The overlap can be at any odd position $$2k+1\in\{1,3,\dots, 2t-1\}$$, since $$I$$ and $$J$$ must both have an equal number of distinct elements smaller (resp. larger) than $$2k+1$$. Given the overlap at $$2k+1$$, the $$2k$$ smaller positions can be split into two halves in $$\binom{2k}{k}$$ ways, and similarly for the right half, leading to the following number of $$(I, J)$$ pairs with $$q=1$$ and $$|I\cup J| = 2t-1$$:

\begin{equation*} \binom{n}{2t-1}\cdot\sum_{k=0}^{t-1}\binom{2k}{k} \binom{2(t-1-k)}{t-1-k} =\binom{n}{2t-1}\cdot 4^{t-1}, \end{equation*}

a well-known identity (Jensen 1902; Duarte and de Oliveira 2012). Splitting \eqref{eq:ii} into the cases $$q=1$$ (with $$|I\cap J|=1$$ and $$|I\cap J|>1$$) and $$q\geq 2$$, and assuming that $$n\gg 2t\sigma$$, we get our result:

\begin{align*} \EE(d_{te}(a,b)) &= (4/\sigma)^{t-1} \cdot r+ O(2t\sigma/n \cdot \sigma^{-t} r) + \sum_{q=2}^t O((2t\sigma/n)^{q-1} \cdot \sigma^{1-t} r)\\ &= (4/\sigma)^{t-1} \cdot r + O(2t\sigma^{2-t}/n) \cdot r. \end{align*}

## TODO Proof of Lemma 2

We compute the $m$th moment:

\begin{align*} \EE_{a,b}\|T_a - T_b\|_2^{2m} &= \EE\left(\sum_{s\in \Sigma^t} \left(\sum_{I\in \I} [a_I = s] - \sum_{I\in \I}[b_I = s]\right)^2\right)^m \\ &= \EE \left(\sum_{I, J}\big([a_I=a_J] - [a_I=b_J] - [b_I=a_J] + [b_I=b_J]\big)\right)^m\\ &= \sum_{I^1,J^1}\dots \sum_{I^m,J^m} \EE_{a,b}\prod_{i\in [m]}\big([a_{I^i}=a_{J^i}] - [a_{I^i}=b_{J^i}] - [b_{I^i}=a_{J^i}] + [b_{I^i}=b_{J^i}]\big). \end{align*}

Suppose for the moment that there is an $$i$$ such that $$I^i$$ (or $$J^i$$ for that matter) is disjoint from all $$J^j$$ (resp. $$I^j$$’s). Then, all events involving $$a_{I^i}$$ and $$b_{I^i}$$ are independent from all others. Thus, we may compute the factor for $$i$$ separately, and it equals

\begin{equation*} \EE_{a,b}\big([a_{I^i}=a_{J^i}] - [a_{I^i}=b_{J^i}] - [b_{I^i}=a_{J^i}] + [b_{I^i}=b_{J^i}]\big) = \sigma^{-t} - \sigma^{-t} - \sigma^{-t} + \sigma^{-t} = 0. \end{equation*}

This implies that non-zero terms in the summation can only occur when none of the $$I^i$$ and $$J^i$$ is disjoint from all the others. It follows that $$U:=\left|\bigcup_{i\in [m]} I_i\cup J_i\right| \leq m(2t-1)$$. As in the proof of the expected value, the total number of tuples $$(I_1, J_1, \dots, I_m, J_m)$$ is $$\binom{n}{m(2t-1)} f(t)$$ for some function $$f$$ independent of $$n$$, and the contribution of each tuple will also be independent of $$n$$. As $$n\to \infty$$, all terms with $$U<m(2t-1)$$ will only contribute a fraction $$O(n^{-1})$$ of the terms with $$U=m(2t-1)$$, ass

We can reduce arbitrary $$t$$ to $$t=1$$ and simply multiply everything by $$\sigma^{-m(t-1)}$$, since $$t-1$$ characters of each equality are completely independent of the rest. Thus, assume $$t=1$$, and identify the $1$-tuple $$I^i$$ with the corresponding integer.

For $$m=2$$, we can have either $$(I^1, I^2) = (J^1, J^2)$$ or $$(I^1, I^2) = (J^2, J^1)$$. In the first case, the product comes out as $$(2r)^2 = 4r^2$$, and in the second case it equals $$2r^2$$. Thus, the expected value comes out as

\begin{align*} \EE_{a,b}\|T_a - T_b\|_2^{2m} &= \sum_{I^1,J^1}\dots \sum_{I^m,J^m} \EE_{a,b}\prod_{i\in [m]}\big([a_{I^i}=a_{J^i}] - [a_{I^i}=b_{J^i}] - [b_{I^i}=a_{J^i}] + [b_{I^i}=b_{J^i}]\big)\\ &= \big(1+O(n^{-1})\big) \binom{n}{4t-2} \binom{4t-2}{2} 6r^2 \cdot \sigma^{-2(t-1)}. \end{align*}

This means that the variance is given by

\begin{align*} \var_{a,b}\|T_a - T_b\|_2 &= \big(1+O(n^{-1})\big) \binom{n}{4t-2} \binom{4t-2}{2} 6r^2 \cdot \sigma^{-2(t-1)} - \left((1+O(n^{-1})\cdot 2\binom{n}{2t-1} (4/\sigma)^{t-1} r\right)^2\\ &= \big(1+O(n^{-1})\big) \binom{n}{4t-2} \binom{4t-2}{2} 6r^2 \cdot \sigma^{-2(t-1)} - \left((1+O(n^{-1})\cdot 2\binom{n}{2t-1} (4/\sigma)^{t-1} r\right)^2. \end{align*}

For higher moments, this would generalize to

\begin{align*} \EE_{a,b}\|T_a - T_b\|_2^{2m} &= \big(1+O(n^{-1})\big) \binom{n}{m(2t-1)} \binom{m(2t-1)}{2t-1, \dots, 2t-1} 4^{m(t-1)} f_m r^m \cdot \sigma^{-m(t-1)}. \end{align*}

## References

Duarte, Rui, and António Guedes de Oliveira. 2012. “New Developments of an Old Identity.” https://doi.org/10.48550/ARXIV.1203.5424.
Jensen, J. L. W. V. 1902. “Sur Une Identité D’abel et Sur D’autres Formules Analogues.” Acta Mathematica 26 (0): 307–18. https://doi.org/10.1007/bf02415499.
Joudaki, Amir, Gunnar Rätsch, and André Kahles. 2020. “Fast Alignment-Free Similarity Estimation by Tensor Sketching,” November. https://doi.org/10.1101/2020.11.13.381814.