Comments on 'When Less is More' minimizer review

These are some (biased) comments on “When Less Is More: Sketching with Minimizers in Genomics” (Ndiaye et al. 2024).

The importance of ordering

the interest lies in constructing a minimizer with a density within a constant factor, i.e., \(O(1/w)\) for any \(k\). With lexicographic ordering, minimizers can achieve such density, but with large \(k\) values (\(\geq \log_{|Σ|}(w)-c\) for a constant \(c\)), which might not be desirable (Zheng, Kingsford, and Marçais 2020). However, random ordering can result in a lower density than that of the lexicographic ordering. Thus, random ordering (implemented with pseudo-random hash functions) is usually used in practice.

  • I typically consider \(k = \log_\sigma w\) to be small. Really, only for very small \(k\) up to say \(4\), random minimizers do not have density \(O(1/w)\). So in general, reaching \(O(1/w)\) is easy unless \(k\) is very small.
  • As shown in Theorem 2 of Zheng, Kingsford, and Marçais (2020), lexicographic minimizers are optimal, in that they have density \(O(1/w)\) if and only if this is possible at all. Some motivation why random is in fact better in practice would be good.

Recent investigations indicate that ordering algorithms can achieve a density value of \(1.8/(w + 1)\) (Orenstein et al. 2016), well below the originally proposed lower bound of \(2/(w + 1)\) (Marçais et al. 2019; Roberts et al. 2004).

  • I cannot find the \(1.8/(w+1)\) in either Orenstein et al. (2016) or Orenstein et al. (2017).
    • It turns out the followup paper Marçais et al. (2017) states that DOCKS has density factor \(1.737\) for \(k=w=10\).
  • For which \(k\)? For \(k=1\), this is impossible. For \(k>w\), miniception (Zheng, Kingsford, and Marçais 2020) is better at \(1.67/w\), and in fact, mod-minimizer (Groot Koerkamp and Pibiri 2024) is even better and asymptotically reaches density \(1/w\), so this \(1.8/(w+1)\) is quite meaningless anyway.
  • A remark that the original lower bound doesn’t apply because of overly strong assumptions would be in place here. Otherwise the sentence kinda contradicts itself.

Asymptotically optimal minimizers

This dual-minimizer setup has been shown to achieve an upper bound expected density of \(1.67/(w + 1)\), which is lower than the \(2/(w + 1)\) density of traditional random minimizers.

  • Again, only for \(k>w\).

the lower bound of the resulting sketch (\(1.67/(w + 1)\)) is higher than the theoretical lower bound (\(1/w\)), which can be achieved using UHS or Polar Sets.

  • should say upper bound instead.

  • This paragraph is titled asymptotically optimal minimizers, yet it only talks about miniception, which is not in fact asymptotically optimal. UHS and Polar sets are also not really ‘plain’ minimizers.

    Instead, Marçais, DeBlasio, and Kingsford (2018) present an actual asymptotic optimal minimizer scheme based on universal hitting sets, and Groot Koerkamp and Pibiri (2024) present an asymptotic optimal scheme with much lower density in practice.

References

Groot Koerkamp, Ragnar, and Giulio Ermanno Pibiri. 2024. “The mod-minimizer: A Simple and Efficient Sampling Algorithm for Long k-mers.” In 24th International Workshop on Algorithms in Bioinformatics (Wabi 2024), edited by Solon P. Pissis and Wing-Kin Sung, 312:11:1–11:23. Leibniz International Proceedings in Informatics (Lipics). Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPIcs.WABI.2024.11.
Marçais, Guillaume, Dan DeBlasio, and Carl Kingsford. 2018. “Asymptotically Optimal Minimizers Schemes.” Bioinformatics 34 (13): i13–22. https://doi.org/10.1093/bioinformatics/bty258.
Marçais, Guillaume, David Pellow, Daniel Bork, Yaron Orenstein, Ron Shamir, and Carl Kingsford. 2017. “Improving the Performance of Minimizers and Winnowing Schemes.” Bioinformatics 33 (14): i110–17. https://doi.org/10.1093/bioinformatics/btx235.
Marçais, Guillaume, Brad Solomon, Rob Patro, and Carl Kingsford. 2019. “Sketching and Sublinear Data Structures in Genomics.” Annual Review of Biomedical Data Science 2 (1): 93–118. https://doi.org/10.1146/annurev-biodatasci-072018-021156.
Ndiaye, Malick, Silvia Prieto-Baños, Lucy M. Fitzgerald, Ali Yazdizadeh Kharrazi, Sergey Oreshkov, Christophe Dessimoz, Fritz J. Sedlazeck, Natasha Glover, and Sina Majidian. 2024. “When Less Is More: Sketching with Minimizers in Genomics.” Genome Biology 25 (1). https://doi.org/10.1186/s13059-024-03414-4.
Orenstein, Yaron, David Pellow, Guillaume Marçais, Ron Shamir, and Carl Kingsford. 2016. “Compact Universal K-Mer Hitting Sets.” In Algorithms in Bioinformatics, 257–68. Springer International Publishing. https://doi.org/10.1007/978-3-319-43681-4_21.
———. 2017. “Designing Small Universal K-Mer Hitting Sets for Improved Analysis of High-Throughput Sequencing.” Edited by Benjamin J. Raphael. Plos Computational Biology 13 (10): e1005777. https://doi.org/10.1371/journal.pcbi.1005777.
Roberts, Michael, Wayne Hayes, Brian R. Hunt, Stephen M. Mount, and James A. Yorke. 2004. “Reducing Storage Requirements for Biological Sequence Comparison.” Bioinformatics 20 (18): 3363–69. https://doi.org/10.1093/bioinformatics/bth408.
Zheng, Hongyu, Carl Kingsford, and Guillaume Marçais. 2020. “Improved Design and Analysis of Practical Minimizers.” Bioinformatics 36 (Supplement\_1): i119–27. https://doi.org/10.1093/bioinformatics/btaa472.