These are some (biased) comments on “When Less Is More: Sketching with Minimizers in Genomics” (Ndiaye et al. 2024).
The importance of ordering
the interest lies in constructing a minimizer with a density within a constant factor, i.e., \(O(1/w)\) for any \(k\). With lexicographic ordering, minimizers can achieve such density, but with large \(k\) values (\(\geq \log_{|Σ|}(w)-c\) for a constant \(c\)), which might not be desirable (Zheng, Kingsford, and Marçais 2020). However, random ordering can result in a lower density than that of the lexicographic ordering. Thus, random ordering (implemented with pseudo-random hash functions) is usually used in practice.
- I typically consider \(k = \log_\sigma w\) to be small. Really, only for very small \(k\) up to say \(4\), random minimizers do not have density \(O(1/w)\). So in general, reaching \(O(1/w)\) is easy unless \(k\) is very small.
- As shown in Theorem 2 of Zheng, Kingsford, and Marçais (2020), lexicographic minimizers are optimal, in that they have density \(O(1/w)\) if and only if this is possible at all. Some motivation why random is in fact better in practice would be good.
Recent investigations indicate that ordering algorithms can achieve a density value of \(1.8/(w + 1)\) (Orenstein et al. 2016), well below the originally proposed lower bound of \(2/(w + 1)\) (Marçais et al. 2019; Roberts et al. 2004).
- I cannot find the \(1.8/(w+1)\) in either Orenstein et al. (2016) or Orenstein et al. (2017).
- It turns out the followup paper Marçais et al. (2017) states that DOCKS has density factor \(1.737\) for \(k=w=10\).
- For which \(k\)? For \(k=1\), this is impossible. For \(k>w\), miniception (Zheng, Kingsford, and Marçais 2020) is better at \(1.67/w\), and in fact, mod-minimizer (Groot Koerkamp and Pibiri 2024) is even better and asymptotically reaches density \(1/w\), so this \(1.8/(w+1)\) is quite meaningless anyway.
- A remark that the original lower bound doesn’t apply because of overly strong assumptions would be in place here. Otherwise the sentence kinda contradicts itself.
Asymptotically optimal minimizers
This dual-minimizer setup has been shown to achieve an upper bound expected density of \(1.67/(w + 1)\), which is lower than the \(2/(w + 1)\) density of traditional random minimizers.
- Again, only for \(k>w\).
the lower bound of the resulting sketch (\(1.67/(w + 1)\)) is higher than the theoretical lower bound (\(1/w\)), which can be achieved using UHS or Polar Sets.
should say upper bound instead.
This paragraph is titled asymptotically optimal minimizers, yet it only talks about miniception, which is not in fact asymptotically optimal. UHS and Polar sets are also not really ‘plain’ minimizers.
Instead, Marçais, DeBlasio, and Kingsford (2018) present an actual asymptotic optimal minimizer scheme based on universal hitting sets, and Groot Koerkamp and Pibiri (2024) present an asymptotic optimal scheme with much lower density in practice.