Tensor embedding preserves Hamming distance

Table of Contents

Definitions
Proof of Lemma 1
TODO Proof of Lemma 2

This is a proof that Tensor Embedding (Joudaki, Rätsch, and Kahles 2020) with $ℓ^2$-norm preserves the Hamming distance.

This is in collaboration with Amir Joudaki.

\begin{equation*} \newcommand{\I}{\mathcal I} \newcommand{\EE}{\mathbb E} \newcommand{\var}{\operatorname{Var}} \end{equation*}

Definitions Link to heading

Notation

The alphabet is $\Sigma$, of size $|\Sigma| = \sigma$.
The set of indices is $\I := \{(i_1, \dots, i_t) \in [n]^t: i_1 < \dots < i_t\}$.
Given a string $a_1\dots a_n = a\in \Sigma^n$, we define the $I$-index as $a_I = (a_{i_1}, \dots, a_{i_t})$.
We write $[ X ]$ for the indicator variable of event $X$, which is $1$ when $X$ holds and $0$ otherwise.

Definition 1: Tensor embedding

Given $a\in \Sigma^n$, the tensor embedding $T_a$ is the $\sigma^t$ tensor given by $T_a[s] = \sum_{I\in \I} [A_I = s]$ for each $s\in \Sigma^t$.

The normalized tensor embedding distance $d_{te}$ between two sequences $a$ and $b$ is defined as

\begin{equation*} d_{te}(a,b) := \frac 12 \binom{n}{2t-1}^{-1}\cdot \|T_a - T_b\|_2^2. \end{equation*}

Lemma 1: Tensor embedding preserves Hamming distance under $\ell^2$ norm

Let $a$ be a uniform random sequence of length $n$ in $\Sigma^n$, and for a fixed mutation rate $r\in [0,1]$ let $b$ be a sequence where $a_i$ is substituted by a new character $b_i \in Unif(\Sigma \backslash a_i)$ with probability $r$ and $b_i = a_i$ otherwise. Then for $n\gg 2t\sigma$:

\begin{equation*} \EE_{a,b}[d_{te}(a,b)] = (4/\sigma)^{t-1} \cdot r + O(2t\sigma^{2-t}/n) \cdot r, \end{equation*}

which for DNA with $\sigma=4$ and fixed $t$ gives $\EE[d_{te}(a,b)] = (1 + O(n^{-1})) \cdot r$.

Lemma 2: Variance of Tensor embedding

In the same setting as Lemma 1,

\begin{equation*} \var_{a,b}[d_{te}(a,b)] = TODO. \end{equation*}

Proof of Lemma 1 Link to heading

By definition we have

\begin{align*} 2\binom{n}{2t-1}d_{te}(a,b) &= \|T_a - T_b\|_2^2 = \sum_{s\in \Sigma^t} \left(\sum_{I\in \I} [a_I = s] - \sum_{I\in \I}[b_I = s]\right)^2 \\ &= \sum_{s\in \Sigma^t} \sum_{I,J\in \I} \Big([a_I = s][a_J=s] - [a_I=s][b_J=s] - [b_I=s][a_J=s] + [b_I=s][b_J=s]\Big). \end{align*}

By symmetry between $a$ and $b$, the first and last term, and second and third term are equal in expected value, reducing this to

\begin{align*} \EE_{a,b}\left(\|T_a-T_b\|_2^2\right) &=\EE \left(2 \sum_{s\in \Sigma^t} \sum_{I,J\in \I} \Big([a_I = s][a_J=s] - [a_I=s][b_J=s]\Big)\right)\\ &=\EE\left( 2 \sum_{I,J\in \I} \sum_{s\in \Sigma^t}\Big([a_I = s \land a_J=s] - [a_I=s \land b_J=s]\Big)\right)\\ &= 2 \sum_{I,J\in \I}\EE \Big([a_I = a_J] - [a_I=b_J]\Big).\tag{i}\label{eq:delta} \end{align*}

Define the overlap $q$ as the number of positions where $I$ and $J$ are equal, $q(I, J) := |\{x\in [t]: I_x = J_x\}|$. We will show using induction on $t$ that $\EE[a_I=b_J]=(\sigma(1-r))^q\sigma^{-t}$. For $t=0$ we have $I=J=\emptyset$ and trivially $\EE[a_I = b_J] = 1$. For $t>0$, write $I’$ and $J’$ for the tuples $(I_1, \dots, I_{t-1})$ and $(J_1, \dots, J_{t-1})$. When $I_t = J_t$, the characters $a_{I_t}$ and $b_{J_t}$ are independent of the earlier characters and equal with probability $1-r$, and $q(I’, J’) = q-1$, so that

\begin{align*} \EE[a_I = b_J] &= (1-r) \EE[a_{I’} = b_{J’}]\\ &= (1-r) \cdot (\sigma(1-r))^{q-1}\sigma^{-(t-1)}\\ &= (\sigma(1-r))^{q}\sigma^{-t}. \end{align*}

When $I_t \neq J_t$, assume without loss of generality that $I_t < J_t$. Then $I_x < J_t$ for all $x\in [t]$, resulting in $b_{J_t}$ is independent from the characters seen so far. This implies that $[a_{I_t} = b_{J_t}]$ is independent from $[a_{I’} = b_{I’}]$:

\begin{align*} \EE[a_I = b_J] &= \EE[a_{I_t} = b_{J_t}] \EE[a_{I’} = b_{J’}]\\ &= \sigma \cdot (\sigma(1-r))^q\sigma^{-(t-1)}\\ &= (\sigma(1-r))^q\sigma^{-t}. \end{align*}

We conclude that

\begin{equation*} \EE_{a,b}\big([a_I=a_J]-[a_I=b_J]\big) = \sigma^{-t+q}\big(1-(1-r)^{q(I, J)}\big). \end{equation*}

This difference vanishes for $q=0$, and thus in \eqref{eq:delta} we only have to consider $(I, J)$ with $q(I, J) \geq 1$. The summation can now be rewritten as

\begin{align*} \EE_{a,b}\left(\|T_a-T_b\|_2^2\right) &= 2 \sum_{q=1}^t \sum_{\substack{I,J\in \I:\\ q(I, J) = q}}\EE \Big([a_I = a_J] - [a_I=b_J]\Big)\\ &= 2 \sum_{q=1}^t \sum_{\substack{I,J\in \I:\\ q(I, J) = q}} \sigma^{-t+q}\big(1-(1-r)^q\big)\\ &= 2 \sum_{q=1}^t \sigma^{-t+q}\big(1-(1-r)^q\big)\cdot f_q, \tag{ii}\label{eq:ii} \end{align*}

where $f_q$ counts the number of pairs $(I, J)$ with $q(I, J) = q$. Since $|I\cap J|\geq q$, the total number of distinct indices is bounded by $|I\cup J| \leq 2t-q$. This directly implies that $f_q \leq (1+o(1)) \binom{n}{2t-q}$, which for $q\geq 2$ gives

\begin{equation*} \binom{n}{2t-1}^{-1} \binom{n}{2t-q} \cdot \sigma^{-t+q}\big(1-(1-r)^q\big) = O((2t\sigma/n)^{q-1} \sigma^{1-t} r). \end{equation*}

When $q=1$ and $|I\cup J| < 2t-1$ a similar argument applies, and we are left with the case where $q=1$ and $|I\cup J| = 2t-1$. We can first choose the $2t-1$ distinct values for $I\cup J$ in $\binom n{2t-1}$ ways, and then assume that $I\cup J = [2t-1]$. The overlap can be at any odd position $2k+1\in\{1,3,\dots, 2t-1\}$, since $I$ and $J$ must both have an equal number of distinct elements smaller (resp. larger) than $2k+1$. Given the overlap at $2k+1$, the $2k$ smaller positions can be split into two halves in $\binom{2k}{k}$ ways, and similarly for the right half, leading to the following number of $(I, J)$ pairs with $q=1$ and $|I\cup J| = 2t-1$:

\begin{equation*} \binom{n}{2t-1}\cdot\sum_{k=0}^{t-1}\binom{2k}{k} \binom{2(t-1-k)}{t-1-k} =\binom{n}{2t-1}\cdot 4^{t-1}, \end{equation*}

a well-known identity (Jensen 1902; Duarte and de Oliveira 2012). Splitting \eqref{eq:ii} into the cases $q=1$ (with $|I\cap J|=1$ and $|I\cap J|>1$) and $q\geq 2$, and assuming that $n\gg 2t\sigma$, we get our result:

\begin{align*} \EE(d_{te}(a,b)) &= (4/\sigma)^{t-1} \cdot r+ O(2t\sigma/n \cdot \sigma^{-t} r) + \sum_{q=2}^t O((2t\sigma/n)^{q-1} \cdot \sigma^{1-t} r)\\ &= (4/\sigma)^{t-1} \cdot r + O(2t\sigma^{2-t}/n) \cdot r. \end{align*}

TODO Proof of Lemma 2 Link to heading

We compute the $m$th moment:

\begin{align*} \EE_{a,b}\|T_a - T_b\|_2^{2m} &= \EE\left(\sum_{s\in \Sigma^t} \left(\sum_{I\in \I} [a_I = s] - \sum_{I\in \I}[b_I = s]\right)^2\right)^m \\ &= \EE \left(\sum_{I, J}\big([a_I=a_J] - [a_I=b_J] - [b_I=a_J] + [b_I=b_J]\big)\right)^m\\ &= \sum_{I^1,J^1}\dots \sum_{I^m,J^m} \EE_{a,b}\prod_{i\in [m]}\big([a_{I^i}=a_{J^i}] - [a_{I^i}=b_{J^i}] - [b_{I^i}=a_{J^i}] + [b_{I^i}=b_{J^i}]\big). \end{align*}

Suppose for the moment that there is an $i$ such that $I^i$ (or $J^i$ for that matter) is disjoint from all $J^j$ (resp. $I^j$’s). Then, all events involving $a_{I^i}$ and $b_{I^i}$ are independent from all others. Thus, we may compute the factor for $i$ separately, and it equals

\begin{equation*} \EE_{a,b}\big([a_{I^i}=a_{J^i}] - [a_{I^i}=b_{J^i}] - [b_{I^i}=a_{J^i}] + [b_{I^i}=b_{J^i}]\big) = \sigma^{-t} - \sigma^{-t} - \sigma^{-t} + \sigma^{-t} = 0. \end{equation*}

This implies that non-zero terms in the summation can only occur when none of the $I^i$ and $J^i$ is disjoint from all the others. It follows that $U:=\left|\bigcup_{i\in [m]} I_i\cup J_i\right| \leq m(2t-1)$. As in the proof of the expected value, the total number of tuples $(I_1, J_1, \dots, I_m, J_m)$ is $\binom{n}{m(2t-1)} f(t)$ for some function $f$ independent of $n$, and the contribution of each tuple will also be independent of $n$. As $n\to \infty$, all terms with $U<m(2t-1)$ will only contribute a fraction $O(n^{-1})$ of the terms with $U=m(2t-1)$, ass

We can reduce arbitrary $t$ to $t=1$ and simply multiply everything by $\sigma^{-m(t-1)}$, since $t-1$ characters of each equality are completely independent of the rest. Thus, assume $t=1$, and identify the $1$-tuple $I^i$ with the corresponding integer.

For $m=2$, we can have either $(I^1, I^2) = (J^1, J^2)$ or $(I^1, I^2) = (J^2, J^1)$. In the first case, the product comes out as $(2r)^2 = 4r^2$, and in the second case it equals $2r^2$. Thus, the expected value comes out as

\begin{align*} \EE_{a,b}\|T_a - T_b\|_2^{2m} &= \sum_{I^1,J^1}\dots \sum_{I^m,J^m} \EE_{a,b}\prod_{i\in [m]}\big([a_{I^i}=a_{J^i}] - [a_{I^i}=b_{J^i}] - [b_{I^i}=a_{J^i}] + [b_{I^i}=b_{J^i}]\big)\\ &= \big(1+O(n^{-1})\big) \binom{n}{4t-2} \binom{4t-2}{2} 6r^2 \cdot \sigma^{-2(t-1)}. \end{align*}

This means that the variance is given by

\begin{align*} \var_{a,b}\|T_a - T_b\|_2 &= \big(1+O(n^{-1})\big) \binom{n}{4t-2} \binom{4t-2}{2} 6r^2 \cdot \sigma^{-2(t-1)} - \left((1+O(n^{-1})\cdot 2\binom{n}{2t-1} (4/\sigma)^{t-1} r\right)^2\\ &= \big(1+O(n^{-1})\big) \binom{n}{4t-2} \binom{4t-2}{2} 6r^2 \cdot \sigma^{-2(t-1)} - \left((1+O(n^{-1})\cdot 2\binom{n}{2t-1} (4/\sigma)^{t-1} r\right)^2. \end{align*}

For higher moments, this would generalize to

\begin{align*} \EE_{a,b}\|T_a - T_b\|_2^{2m} &= \big(1+O(n^{-1})\big) \binom{n}{m(2t-1)} \binom{m(2t-1)}{2t-1, \dots, 2t-1} 4^{m(t-1)} f_m r^m \cdot \sigma^{-m(t-1)}. \end{align*}

References Link to heading

Duarte, Rui, and António Guedes de Oliveira. 2012. “New Developments of an Old Identity.” https://doi.org/10.48550/ARXIV.1203.5424.

Jensen, J. L. W. V. 1902. “Sur Une Identité D’abel et Sur D’autres Formules Analogues.” Acta Mathematica 26 (0): 307–18. https://doi.org/10.1007/bf02415499.

Joudaki, Amir, Gunnar Rätsch, and André Kahles. 2020. “Fast Alignment-Free Similarity Estimation by Tensor Sketching,” November. https://doi.org/10.1101/2020.11.13.381814.