Tensor embedding preserves Hamming distance

This is a proof that Tensor Embedding (Joudaki, Rätsch, and Kahles 2020) with $ℓ^2$-norm preserves the Hamming distance.

This is in collaboration with Amir Joudaki.

\begin{equation*} \newcommand{\I}{\mathcal I} \newcommand{\EE}{\mathbb E} \newcommand{\var}{\operatorname{Var}} \end{equation*}

Definitions

Notation
  • The alphabet is \(\Sigma\), of size \(|\Sigma| = \sigma\).
  • The set of indices is \(\I := \{(i_1, \dots, i_t) \in [n]^t: i_1 < \dots < i_t\}\).
  • Given a string \(a_1\dots a_n = a\in \Sigma^n\), we define the $I$-index as \(a_I = (a_{i_1}, \dots, a_{i_t})\).
  • We write \([ X ]\) for the indicator variable of event \(X\), which is \(1\) when \(X\) holds and \(0\) otherwise.
Definition 1: Tensor embedding
Given \(a\in \Sigma^n\), the tensor embedding \(T_a\) is the \(\sigma^t\) tensor given by \(T_a[s] = \sum_{I\in \I} [A_I = s]\) for each \(s\in \Sigma^t\).

The normalized tensor embedding distance \(d_{te}\) between two sequences \(a\) and \(b\) is defined as

\begin{equation*} d_{te}(a,b) := \frac 12 \binom{n}{2t-1}^{-1}\cdot \|T_a - T_b\|_2^2. \end{equation*}

Lemma 1: Tensor embedding preserves Hamming distance under \(\ell^2\) norm
Let \(a\) be a uniform random sequence of length \(n\) in \(\Sigma^n\), and for a fixed mutation rate \(r\in [0,1]\) let \(b\) be a sequence where \(a_i\) is substituted by a new character \(b_i \in Unif(\Sigma \backslash a_i)\) with probability \(r\) and \(b_i = a_i\) otherwise. Then for \(n\gg 2t\sigma\):

\begin{equation*} \EE_{a,b}[d_{te}(a,b)] = (4/\sigma)^{t-1} \cdot r + O(2t\sigma^{2-t}/n) \cdot r, \end{equation*}

which for DNA with \(\sigma=4\) and fixed \(t\) gives \(\EE[d_{te}(a,b)] = (1 + O(n^{-1})) \cdot r\).

Lemma 2: Variance of Tensor embedding
In the same setting as Lemma 1,

\begin{equation*} \var_{a,b}[d_{te}(a,b)] = TODO. \end{equation*}

Proof of Lemma 1

By definition we have

\begin{align*} 2\binom{n}{2t-1}d_{te}(a,b) &= \|T_a - T_b\|_2^2 = \sum_{s\in \Sigma^t} \left(\sum_{I\in \I} [a_I = s] - \sum_{I\in \I}[b_I = s]\right)^2 \\ &= \sum_{s\in \Sigma^t} \sum_{I,J\in \I} \Big([a_I = s][a_J=s] - [a_I=s][b_J=s] - [b_I=s][a_J=s] + [b_I=s][b_J=s]\Big). \end{align*}

By symmetry between \(a\) and \(b\), the first and last term, and second and third term are equal in expected value, reducing this to

\begin{align*} \EE_{a,b}\left(\|T_a-T_b\|_2^2\right) &=\EE \left(2 \sum_{s\in \Sigma^t} \sum_{I,J\in \I} \Big([a_I = s][a_J=s] - [a_I=s][b_J=s]\Big)\right)\\ &=\EE\left( 2 \sum_{I,J\in \I} \sum_{s\in \Sigma^t}\Big([a_I = s \land a_J=s] - [a_I=s \land b_J=s]\Big)\right)\\ &= 2 \sum_{I,J\in \I}\EE \Big([a_I = a_J] - [a_I=b_J]\Big).\tag{i}\label{eq:delta} \end{align*}

Define the overlap \(q\) as the number of positions where \(I\) and \(J\) are equal, \(q(I, J) := |\{x\in [t]: I_x = J_x\}|\). We will show using induction on \(t\) that \(\EE[a_I=b_J]=(\sigma(1-r))^q\sigma^{-t}\). For \(t=0\) we have \(I=J=\emptyset\) and trivially \(\EE[a_I = b_J] = 1\). For \(t>0\), write \(I’\) and \(J’\) for the tuples \((I_1, \dots, I_{t-1})\) and \((J_1, \dots, J_{t-1})\). When \(I_t = J_t\), the characters \(a_{I_t}\) and \(b_{J_t}\) are independent of the earlier characters and equal with probability \(1-r\), and \(q(I’, J’) = q-1\), so that

\begin{align*} \EE[a_I = b_J] &= (1-r) \EE[a_{I’} = b_{J’}]\\ &= (1-r) \cdot (\sigma(1-r))^{q-1}\sigma^{-(t-1)}\\ &= (\sigma(1-r))^{q}\sigma^{-t}. \end{align*}

When \(I_t \neq J_t\), assume without loss of generality that \(I_t < J_t\). Then \(I_x < J_t\) for all \(x\in [t]\), resulting in \(b_{J_t}\) is independent from the characters seen so far. This implies that \([a_{I_t} = b_{J_t}]\) is independent from \([a_{I’} = b_{I’}]\):

\begin{align*} \EE[a_I = b_J] &= \EE[a_{I_t} = b_{J_t}] \EE[a_{I’} = b_{J’}]\\ &= \sigma \cdot (\sigma(1-r))^q\sigma^{-(t-1)}\\ &= (\sigma(1-r))^q\sigma^{-t}. \end{align*}

We conclude that

\begin{equation*} \EE_{a,b}\big([a_I=a_J]-[a_I=b_J]\big) = \sigma^{-t+q}\big(1-(1-r)^{q(I, J)}\big). \end{equation*}

This difference vanishes for \(q=0\), and thus in \eqref{eq:delta} we only have to consider \((I, J)\) with \(q(I, J) \geq 1\). The summation can now be rewritten as

\begin{align*} \EE_{a,b}\left(\|T_a-T_b\|_2^2\right) &= 2 \sum_{q=1}^t \sum_{\substack{I,J\in \I:\\ q(I, J) = q}}\EE \Big([a_I = a_J] - [a_I=b_J]\Big)\\ &= 2 \sum_{q=1}^t \sum_{\substack{I,J\in \I:\\ q(I, J) = q}} \sigma^{-t+q}\big(1-(1-r)^q\big)\\ &= 2 \sum_{q=1}^t \sigma^{-t+q}\big(1-(1-r)^q\big)\cdot f_q, \tag{ii}\label{eq:ii} \end{align*}

where \(f_q\) counts the number of pairs \((I, J)\) with \(q(I, J) = q\). Since \(|I\cap J|\geq q\), the total number of distinct indices is bounded by \(|I\cup J| \leq 2t-q\). This directly implies that \(f_q \leq (1+o(1)) \binom{n}{2t-q}\), which for \(q\geq 2\) gives

\begin{equation*} \binom{n}{2t-1}^{-1} \binom{n}{2t-q} \cdot \sigma^{-t+q}\big(1-(1-r)^q\big) = O((2t\sigma/n)^{q-1} \sigma^{1-t} r). \end{equation*}

When \(q=1\) and \(|I\cup J| < 2t-1\) a similar argument applies, and we are left with the case where \(q=1\) and \(|I\cup J| = 2t-1\). We can first choose the \(2t-1\) distinct values for \(I\cup J\) in \(\binom n{2t-1}\) ways, and then assume that \(I\cup J = [2t-1]\). The overlap can be at any odd position \(2k+1\in\{1,3,\dots, 2t-1\}\), since \(I\) and \(J\) must both have an equal number of distinct elements smaller (resp. larger) than \(2k+1\). Given the overlap at \(2k+1\), the \(2k\) smaller positions can be split into two halves in \(\binom{2k}{k}\) ways, and similarly for the right half, leading to the following number of \((I, J)\) pairs with \(q=1\) and \(|I\cup J| = 2t-1\):

\begin{equation*} \binom{n}{2t-1}\cdot\sum_{k=0}^{t-1}\binom{2k}{k} \binom{2(t-1-k)}{t-1-k} =\binom{n}{2t-1}\cdot 4^{t-1}, \end{equation*}

a well-known identity (Jensen 1902; Duarte and de Oliveira 2012). Splitting \eqref{eq:ii} into the cases \(q=1\) (with \(|I\cap J|=1\) and \(|I\cap J|>1\)) and \(q\geq 2\), and assuming that \(n\gg 2t\sigma\), we get our result:

\begin{align*} \EE(d_{te}(a,b)) &= (4/\sigma)^{t-1} \cdot r+ O(2t\sigma/n \cdot \sigma^{-t} r) + \sum_{q=2}^t O((2t\sigma/n)^{q-1} \cdot \sigma^{1-t} r)\\ &= (4/\sigma)^{t-1} \cdot r + O(2t\sigma^{2-t}/n) \cdot r. \end{align*}

TODO Proof of Lemma 2

We compute the $m$th moment:

\begin{align*} \EE_{a,b}\|T_a - T_b\|_2^{2m} &= \EE\left(\sum_{s\in \Sigma^t} \left(\sum_{I\in \I} [a_I = s] - \sum_{I\in \I}[b_I = s]\right)^2\right)^m \\ &= \EE \left(\sum_{I, J}\big([a_I=a_J] - [a_I=b_J] - [b_I=a_J] + [b_I=b_J]\big)\right)^m\\ &= \sum_{I^1,J^1}\dots \sum_{I^m,J^m} \EE_{a,b}\prod_{i\in [m]}\big([a_{I^i}=a_{J^i}] - [a_{I^i}=b_{J^i}] - [b_{I^i}=a_{J^i}] + [b_{I^i}=b_{J^i}]\big). \end{align*}

Suppose for the moment that there is an \(i\) such that \(I^i\) (or \(J^i\) for that matter) is disjoint from all \(J^j\) (resp. \(I^j\)’s). Then, all events involving \(a_{I^i}\) and \(b_{I^i}\) are independent from all others. Thus, we may compute the factor for \(i\) separately, and it equals

\begin{equation*} \EE_{a,b}\big([a_{I^i}=a_{J^i}] - [a_{I^i}=b_{J^i}] - [b_{I^i}=a_{J^i}] + [b_{I^i}=b_{J^i}]\big) = \sigma^{-t} - \sigma^{-t} - \sigma^{-t} + \sigma^{-t} = 0. \end{equation*}

This implies that non-zero terms in the summation can only occur when none of the \(I^i\) and \(J^i\) is disjoint from all the others. It follows that \(U:=\left|\bigcup_{i\in [m]} I_i\cup J_i\right| \leq m(2t-1)\). As in the proof of the expected value, the total number of tuples \((I_1, J_1, \dots, I_m, J_m)\) is \(\binom{n}{m(2t-1)} f(t)\) for some function \(f\) independent of \(n\), and the contribution of each tuple will also be independent of \(n\). As \(n\to \infty\), all terms with \(U<m(2t-1)\) will only contribute a fraction \(O(n^{-1})\) of the terms with \(U=m(2t-1)\), ass


We can reduce arbitrary \(t\) to \(t=1\) and simply multiply everything by \(\sigma^{-m(t-1)}\), since \(t-1\) characters of each equality are completely independent of the rest. Thus, assume \(t=1\), and identify the $1$-tuple \(I^i\) with the corresponding integer.

For \(m=2\), we can have either \((I^1, I^2) = (J^1, J^2)\) or \((I^1, I^2) = (J^2, J^1)\). In the first case, the product comes out as \((2r)^2 = 4r^2\), and in the second case it equals \(2r^2\). Thus, the expected value comes out as

\begin{align*} \EE_{a,b}\|T_a - T_b\|_2^{2m} &= \sum_{I^1,J^1}\dots \sum_{I^m,J^m} \EE_{a,b}\prod_{i\in [m]}\big([a_{I^i}=a_{J^i}] - [a_{I^i}=b_{J^i}] - [b_{I^i}=a_{J^i}] + [b_{I^i}=b_{J^i}]\big)\\ &= \big(1+O(n^{-1})\big) \binom{n}{4t-2} \binom{4t-2}{2} 6r^2 \cdot \sigma^{-2(t-1)}. \end{align*}

This means that the variance is given by

\begin{align*} \var_{a,b}\|T_a - T_b\|_2 &= \big(1+O(n^{-1})\big) \binom{n}{4t-2} \binom{4t-2}{2} 6r^2 \cdot \sigma^{-2(t-1)} - \left((1+O(n^{-1})\cdot 2\binom{n}{2t-1} (4/\sigma)^{t-1} r\right)^2\\ &= \big(1+O(n^{-1})\big) \binom{n}{4t-2} \binom{4t-2}{2} 6r^2 \cdot \sigma^{-2(t-1)} - \left((1+O(n^{-1})\cdot 2\binom{n}{2t-1} (4/\sigma)^{t-1} r\right)^2. \end{align*}

For higher moments, this would generalize to

\begin{align*} \EE_{a,b}\|T_a - T_b\|_2^{2m} &= \big(1+O(n^{-1})\big) \binom{n}{m(2t-1)} \binom{m(2t-1)}{2t-1, \dots, 2t-1} 4^{m(t-1)} f_m r^m \cdot \sigma^{-m(t-1)}. \end{align*}

References

Duarte, Rui, and António Guedes de Oliveira. 2012. “New Developments of an Old Identity.” https://doi.org/10.48550/ARXIV.1203.5424.
Jensen, J. L. W. V. 1902. “Sur Une Identité D’abel et Sur D’autres Formules Analogues.” Acta Mathematica 26 (0): 307–18. https://doi.org/10.1007/bf02415499.
Joudaki, Amir, Gunnar Rätsch, and André Kahles. 2020. “Fast Alignment-Free Similarity Estimation by Tensor Sketching,” November. https://doi.org/10.1101/2020.11.13.381814.