Elias-Fano coding

Previous slides: Bit vectors, Rank & Select Link to heading

  • Bit vector: storing bits
  • Rank: counting 1 bits
  • Select: finding 1 bits

Goal: Storing lists of integers Link to heading

How do we efficiently store a list of integers \[ 0\leq x_0 \leq x_1 \leq \dots \leq x_{n-1} < U. \]

Theorem 1 Lower bound.

Using stars and bars, there are \(\binom{U+n-1}{n}\) of choosing \(n\) elements (with duplicates) from \(\{0, 1, \dots, U-1\}\).

\[ n \log_2 \left(\frac{U+n-1}{n}\right) \leq \log_2 \binom{U+n-1}{n} \leq n \log_2 \left(e\cdot \frac{U+n-1}{n}\right) \] so it can be done with \(\log_2 e + \log_2\left(\frac{U+n-1}{n}\right)\) bits per key.

For \(n = o(\sqrt U)\) and \(n\to\infty\), \[ \frac 1n \log_2 \binom{U+n-1}{n} \to \log_2 \left(e\cdot \frac{U}{n}\right). \]

Storing many small integers (\(n \gg U\)) Link to heading

Question 1.

What if \(n \gg U\)?

Answer 1.

When \(n\gg U\), store how often each element each element occurs.

  • \(U \log_2 n\) bits

Storing dense lists (\(n\approx U\)) Link to heading

Question 2.

What if \(n \approx U\)?

(Hint: What if all elements are distinct?)

Answer 2.

When all elements are distinct: store \(U\) bits indicating for each \(1\) or \(0\).

  • \(n\) bits

Otherwise: encode all the counts in unary and concatenate them, e.g., a count of 3 becomes 1110.

  • \(U + n\) bits
  • Lower bound for \(n=u\): \[\log_2\binom{U+n-1}{n} = \log_2\binom{2n-1}n = \log_2\big(\Theta(2^{2n}/\sqrt n)\big) \approx 2n\]

Storing sparse lists (\(n\ll U\)) Link to heading

Question 3.

What if \(n \ll U\)?

Answer 3.

Store directly:

  • \(n \log_2 U\) bits.
  • Lower bound for \(n\ll U\): \(n\log_2\left(e\cdot \frac Un\right) \approx n\log_2(eU/n) = n(\log_2 U + \log_2 e - \log_2 n)\)
    • \(\log_2 n - \log_2 e\) bits/key wasted!
Question 4.

Where is the “waste”/inefficiency?

Answer 4.

Consecutive elements share the \(\approx \log_2 n\) high bits!

Elias-Fano coding (Elias 1974; Fano 1971; Vigna 2013) Link to heading

Theorem 2 Elias-Fano (EF) coding.

A list of \(n\) integers \(0\leq x_0\leq x_1\leq \dots \leq x_{n-1} < U\) can be stored in \[ n (\log_2 (U/n) +2) \] bits while supporting \(O(1)\) access time.

This is only \(2-\log_2(e) = 0.5573\) bits/key away from the lower bound when \(n = o(\sqrt U)\)!

Idea:

  • We have budget to store the \(\ell = \lfloor \log(U/n)\rfloor\) bits explicitly.
  • But how do we encode all \(\log U - \ell \approx \log n\) high bits using \(\approx 2\) bits/key?
  • Using the \(n\approx U\) case!

Elias-Fano coding Link to heading

Definition 1.

Split each integer into a \(\ell=\lfloor \log(U/n)\rfloor\)-bit low part and \((\log U - \ell)\)-bit high part.

  • Store all low bits explicitly in \(n \ell\) bits.
  • Write how often each high part occurs in unary.
    • \(\lceil U/2^\ell\rceil\) high parts with \[n \leq \lceil U/2^\ell \rceil < 2n.\]
    • In total \(n + \lceil U/2^\ell\rceil \leq 3n\) bits.
  • Thus, space usage below \[n \ell + 3n = n(\lfloor \log (U/n)\rfloor+3)\]
  • More refined analysis shows that actually: \[n\lfloor \log(U/n)\rfloor + n + \lceil U/2^\ell\rceil \leq n (\log (U/n)+2).\]

Interpretations of the high-bit encoding Link to heading

  • Each input number corresponds to a 1.
  • Each “bucket” for the high parts corresponds to a 0.
  • 0 separates the unary count in number of 1 for each high part.
  • The number of 0 directly before a 1 is the jump the high part makes from the previous value.
  • The total number of 0s before a 1 is the value of the high part.

Querying Elias-Fano Link to heading

Definition 2.

To get \(x_i\):

  • Read the \(i\)’th group of \(\ell\) low bits.
  • Get the high part of the \(i\)’th value:

    • Count zeros before the \(i\)’th 1 (counting from 0):

    \[\mathsf{high}(i) = \mathsf{select}_1(i) - i\]

Variants Link to heading

  • As discussed: storing sorted lists of integers.
  • For strictly increasing integers: replace \(x_i\) by \(x_i-i\).
  • To store a non-sorted list with small sum: store prefix sums.

Further reading / sources Link to heading

Bibliography Link to heading

References Link to heading

Elias, Peter. 1974. “Efficient Storage and Retrieval by Content and Address of Static Files.” Journal of the Acm 21 (2): 246–60. https://doi.org/10.1145/321812.321820.
Fano, R.M. 1971. On the Number of Bits Required to Implement an Associative Memory. Memorandum 61, Computation Structures Group. MIT Project MAC Computer Structures Group. https://books.google.ch/books?id=07DeGwAACAAJ.
Pibiri, Giulio Ermanno, and Rossano Venturini. 2017. “Dynamic Elias-Fano Representation.” In Lipics, Volume 78, Cpm 2017, 78:30:1–30:14. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPICS.CPM.2017.30.
Vigna, Sebastiano. 2013. “Quasi-Succinct Indices.” In Proceedings of the Sixth Acm International Conference on Web Search and Data Mining, 83–92. Wsdm 2013. ACM. https://doi.org/10.1145/2433396.2433409.

Possible exam questions Link to heading

  • What is the core idea of Elias-Fano coding?
  • What is the stars & bars technique? Relate it to a binomial coefficient.
  • How may “high” bits are there? Why does this make sense?
  • What is the space of the Elias-Fano encoding? Is it far from the lower bound?
  • What are some interpretations of the bits encoding the high parts?
  • How does one recover \(x_i\), i.e., query the Elias-Fano coded list?

Blackboard: Bit vector & Rank Link to heading

Blackboard: Select & Darray Link to heading