Thoughts on "Static Retrival Revisited"

Table of Contents

1 Problem definitions
2 Optimal solutions
3 Why an overhead is necessary
4 Augmented Retrieval

These are some summarizing notes and thoughts on “Static Retrieval Revisited: To Optimality and beyond” by Hu, Kuszmaul, Liang, Yu, Zhang, and Zhou.

1 Problem definitions Link to heading

Static Retrieval. Given $n$ keys $X\subseteq U$ and $n$ $v$-bit values $f(X) \in [2^v]$, encode $f: X\to [2^v]$. The goal is use a minimal number of bits on top of the trivial $nv$ lower bound, while allowing efficient queries.

A static dictionary is similar, but also encodes the set $X$ itself: $f: U \to \{\bot\} \cup [2^v]$ (Hu et al. 2024). Note that (unlike I initially assumed), a static dictionary does not have to support updates to values.

2 Optimal solutions Link to heading

When $v$ is large ($\Omega(\log n)$), a minimal perfect hash function (MPHF) takes $\leq 3$ bits/key to map the keys $x\in X$ bijectively into $[n]$, after which the value can be retrieved using an array lookup. This takes $nv + O(n)$ bits in total and has $O(1)$ queries, which is shown to be on the optimal space-time tradeoff curve.

When $v=1$, an alternative solution using only $O(\log n)$ bits (or $O(1)$ words) and constant time queries is given by Dietzfelbinger and Walzer (2019). The core of the method is to create a full-rank sparse linear system over $\mathbb F_2$, so that each query xors together a few bits, while all queries together can be embedded into a single list of bits.

For $v=\Theta(\log n)$, multiple copies of this structure can be used for $v \log n$ bits of overhead and $O(v)$ query time, which again is shown to be an optimal tradeoff.¹

3 Why an overhead is necessary Link to heading

Naively, one might assume that $nv$ bits must be sufficient to store $f$, but this is not the case. It turns out that any encoding of $f$ necessarily contains some information on $X$, and thus, some bits must be ‘wasted’ on that, rather than on storing just the values. The paper goes into detail on how (the encoding of) $f$ can be augmented with additional information $g$ to recover $X$. Since there are $\binom{|U|}{n}$ options for $X$, this implies that the total space of $f$ and $g$ must be at least

\begin{equation*} |f| + |g| \geq nv + \log_2 \binom{|U|}n, \end{equation*}

and thus that

\begin{equation*} |f| \geq nv + \log_2 \binom{|U|}n - |g|. \end{equation*}

As an extremal example, assume that $v$ equals the word-size $w$ and that each query to $f$ reads exactly one word. Then, each $x\in X$ must be mapped to a distinct word of $f$’s representation. But this is exponentially unlikely (assuming a uniform random $X$), and would typically require at least $\log_2(e)\cdot n = 1.44n$ bits of entropy. Thus, this gives a lot of information about $X$ itself.

This information could also be encoded into $f$ itself by extending the static $nv$ array with an internal MPHF, as described above.

If, on the other hand, each lookup to $f$ reads $O(\log n)$ words, and takes for example their xor, then the probability that the system created by this has full rank is relatively large, and only $\log n$ extra words are necessary.

4 Augmented Retrieval Link to heading

When there is a secondary datastructure of size similar to $f$, say $h: [n] \to [2^v]$, then the space overhead of $f$ can be absorbed into $h$! Whereas before we needed to ensure that the set of random queries $x$ hit all words of $f$, now we can mix those into $h$. For example, we can use an array of size $2n$ and let $h(i)$ be the xor of (say) values $2i$, $2i+1$, and some random values. When we then add additional constraints for the values of $X$, it’s OK if those don’t hit all locations, since we already hit every location at least once with the values of $h$ anyway.

References Link to heading

Dietzfelbinger, Martin, and Stefan Walzer. 2019. “Constant-Time Retrieval with O(Log M) Extra Bits.” Schloss Dagstuhl – Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPICS.STACS.2019.24.

Hu, Yang, William Kuszmaul, Jingxun Liang, Huacheng Yu, Junkai Zhang, and Renfei Zhou. 2025. “Static Retrieval Revisited: To Optimality and beyond.” arXiv. https://doi.org/10.48550/ARXIV.2510.18237.

Hu, Yang, Jingxun Liang, Huacheng Yu, Junkai Zhang, and Renfei Zhou. 2024. “Optimal Static Dictionary with Worst-Case Constant Query Time.” arXiv. https://doi.org/10.48550/ARXIV.2412.10655.

I wonder why it’s not possible to work over $\mathbb F_2^v$ to achieve $O(1)$ queries instead. ↩︎