[WIP] Bitpacking and string searching

Table of Contents

\[ \newcommand{\ceil}[1]{\lceil#1\rceil} \newcommand{\floor}[1]{\lfloor#1\rfloor} \]

Intro

This note summarizes some papers I was reading while investigating this history of bitpacking methods such as Myers (1999). This post was sparked by Loving, Hernandez, and Benson (2014), which cites many works of Hyyrö. I then discovered that in fact Hyyrö wrote many papers on bitpacking methods. Then while reading these, I also found many citations to (approximate) string searching algorithms (aka semi-global alignment, aka (approximate) pattern matching). In particular Baeza-Yates and Navarro wrote many papers on this topic. Some of those use bit-packing while others don’t. Paper also differ in the distance metric they use (/mismatches (Hamming distance), matches (LCS distance), differences (edit distance), indel distance), and which approach they use (automata or DP); see the corresponding sections.

For the classic DP bit-packing methods like Myers (1999), see DP methods.

In the semi-global setting, the convention is that the (short) pattern has length \(m\) (sometimes assumed to be at most the word size \(w\)), and that the text has length \(n\gg m\). Semi-global alignment is also called (approximate) string matching. For global alignment, typically \(m\sim n\). Throughout this post, I assume the alphabet size is constant.

Generally, there is a distinction between practical algorithms and theoretical algorithms. The latter often requiring e.g. a suffix array for \(O(m)\) runtime and hence being slower in practice than a \(O(m^2)\) DP-based implementation.

PDFs of some papers listed here are hard to find. I’m happy to share them on request.

Review papers

Chang and Lampe (1992) “Theoretical and Empirical Comparisons of Approximate String Matching Algorithms”
Very comprehensive 10 page comparison of theory of many early algorithms, and analyses them on random strings. Introduces diagonal transition term.

TODO: Read this carefully.

Figure 1: Summary of Chang and Lampe (1992). Such a nice table. Empirical runtime should be more widespread.

Figure 1: Summary of Chang and Lampe (1992). Such a nice table. Empirical runtime should be more widespread.

Baeza-Yates (1992) “Text-Retrieval: Theory and Practice”
15 page review, but focuses on a slightly different topic.
Navarro (2001) “A Guided Tour to Approximate String Matching”
50 page long review paper. Useful as reference but long to read.
Navarro and Raffinot (2002) “Flexible Pattern Matching in Strings”
200 page book. I didn’t actually read this; it just appeared as citation.

DP methods

LCS

Allison and Dix (1986)

“A Bit-String Longest-Common-Subsequence Algorithm”

Figure 2: Formula of Allison and Dix (1986), as shown by Hyyrö (2004).

Figure 2: Formula of Allison and Dix (1986), as shown by Hyyrö (2004).

  • First bit-parallel LCS; well written
  • \(O(n\ceil{m/w})\) LCS
  • Contours
  • Bit-profile, called Alphabet strings
  • Encoding: bits of \(row_i\) store horizontal differences (in \(\{0,1\}\)) within rows.
  • \(6\) operations per word: \[row_i = x\land ((x-((row_{i-1}\ll 1)|1))) \neq x),\] where \(x=row_{i-1} \lor b_i\).
  • equivalently (Hyyrö 2004) \[row_i = x\land (\sim(x-((row_{i-1}\ll 1)|1))).\]
  • The \(-\) and \(\ll\) have to be carried through the entire row if it consists of multiple words.
  • The \(|1\) can be absorbed into the carry-through instructions (Hyyrö 2004).

Crochemore et al. (2001)

“A Fast and Practical Bit-Vector Algorithm for the Longest Common Subsequence Problem”

Figure 3: Crochemore et al. (2001) LCS contours

Figure 3: Crochemore et al. (2001) LCS contours

Figure 4: Formula of Crochemore et al. (2001), as shown by Hyyrö (2004).

Figure 4: Formula of Crochemore et al. (2001), as shown by Hyyrö (2004).

  • \(O(n\ceil{m/w})\) LCS, like Allison and Dix (1986)
  • \(4\) instead of \(5\) bit-wise operations.
  • Contours
  • Bit-profile \(M[y_j]\) for $j$th char of \(y\), and \(M’[y_j]\) is its negation.
  • Encoding: \(\Delta L’_j\) is the negation of the differences in a column, i.e. \(1\) when two consecutive values are the same.
  • \(4\) operations, but \(2\) table accesses: \[\Delta L’_{j+1} = (\Delta L’_j + (\Delta L’_j \land M[y_j])) \lor (\Delta L’_j \land M’[y_j])\]
  • In practice computing \(M’[y_j] = \sim M[y_j]\) on the fly is faster (Hyyrö 2004).
  • Only the \(+\) has to be carried through the entire row if it consists of multiple words.

Hyyrö (2004)

“Bit-Parallel Lcs-Length Computation Revisited”

Figure 5: Formula of Hyyrö (2004).

Figure 5: Formula of Hyyrö (2004).

Figure 6: Tiling the restricted grid of Hyyrö (2004).

Figure 6: Tiling the restricted grid of Hyyrö (2004).

  • Reviews Allison and Dix (1986) and Crochemore et al. (2001); very well written.

  • \(O(n \ceil{m/w})\) LCS, or \(O(n\ceil{d/w})\) LCS based on Ukkonen’s band doubling for simple edit distance, i.e. edit distance without substitutions.

  • Bit-profile \(PM_\lambda\) called pattern matching bit-vectors or match vector for \(\lambda\in \Sigma\).

  • Uses same encoding as Crochemore et al. (2001), but calls it \(V’\).

  • One less operation/table lookup less than Crochemore et al. (2001):

    \begin{align*} U &= V’ \& PM_{B_j}\\ V’ &= (V’ + U) | (V’ - U) \end{align*}

  • Two carry-through operations between words.

  • Implementation is column-wise.

  • \(11\) operations overhead in the loop to do carry and looping and such.

  • Measured runtime differences between implementations are small (\(<20\%\)) and likely depend mostly on how well the compiler optimizes each version.

  • Figure 6 shows the tiling when given a lower bound on LCS (resp. upper bound on simple edit distance).

Hyyrö (2017)

“Mining Bit-Parallel Lcs-Length Algorithms”

  • Using an exhaustive search, it is shown that under reasonable assumptions LCS can not be solved using \(3\) binary operations.
  • A total of \(5\) algorithms using \(4\) operations are found.

Figure 7: The five 4-operation LCS algorithms found by Hyyrö (2017).

Figure 7: The five 4-operation LCS algorithms found by Hyyrö (2017).

Edit distance

Wright (1994)

“Approximate String Matching Using Withinword Parallelism”

Figure 8: DP of Wright (1994). I don&rsquo;t get it.

Figure 8: DP of Wright (1994). I don’t get it.

Figure 9: Modified DP of Wright (1994). This one is self referential&hellip; I get it even less.

Figure 9: Modified DP of Wright (1994). This one is self referential… I get it even less.

  • \(O(nm/w)\) antidiagonal proccessing of the DP-matrix using bitpacking on \(\mod 4\) values, as in Lipton and Lopresti (1985) (see Lipton and Lopresti (1985) below).
  • This is implemented without \(\min\) operations by using equalities instead.
  • Uses \(3\) bits per value: one padding for carry to avoid overflowing into the next state.
  • I have some confusions about this paper:
    • The DP recurrence (Figure 8 and Figure 9) seems to be missing \(+1\) for the indel terms. What am I missing??
    • Algorithm 1 has weird initialization (the for loop over \(i\) uses an uninitialized \(j\)??)
    • (Algorithm 2 makes sense again.)

Myers (1999)

“A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming”

Figure 10: Myers (1999) bitpacking algorithm when (mleq w).

Figure 10: Myers (1999) bitpacking algorithm when (mleq w).

Figure 11: Myers (1999) bitpacking algorithm modification for (m&gt;w).

Figure 11: Myers (1999) bitpacking algorithm modification for (m>w).

  • \(O(n\ceil{m/w})\) Edit distance
  • Semi-global alignment. For long patterns, the technique of Wu, Manber, and Myers (1996) is used for \(O(n \ceil{k/w})\) expected time.
  • Bit-profile \(Peq\)
  • Bitpacking; \(15\) operations assuming horizontal input deltas are \(0\) and no horizontal output deltas are needed.
  • Encoding: Ph, Mh, Pv, Mv indicators store whether Horizontal/Vertical differences are Plus \(1\) or Minus \(1\). Horizontal deltas are standalone bits, and vertical deltas are packed.
  • Core observation: there is a carry effect when there are specific long runs of ones. This is similar to the carry of addition.
  • Core component are \(Xv = Eq | (\Delta v_{in} = M)\) and \(Xh = Eq | (\Delta h_{in} = M)\)
  • Between blocks in a column, \(h_{out}\) is computed and carried over, instead of carrying the addition and two shift operations individually.

Figure 12: Myers (1999) block based algorithm for semi-global alignment.

Figure 12: Myers (1999) block based algorithm for semi-global alignment.

Hyyrö (2001)

Explaining and Extending the Bit-Parallel Algorithm of Myers

Figure 13: Hyyrö (2001) bitpacking algorithm when (mleq w).

Figure 13: Hyyrö (2001) bitpacking algorithm when (mleq w).

  • \(O(n\ceil{m/w})\) edit distance, or \(O(n \ceil{k/w})\) expected time semi-global alignment.
  • Equivalent but slightly different bit algorithm than Myers (1999); core component is \(D0 = Xv | Xh\).
  • Also shows how to do transpositions (Damerau 1964).
  • Good introduction and exposition.
  • Uses \(15\) operations (\(HP_j\ll 1\) can be reused); same as Myers (1999) \(15\) operations.

Hyyrö (2003)

“A Bit-Vector Algorithm for Computing Levenshtein and Damerau Edit Distances”

Figure 14: Hyyrö (2003) tiles bit-vectors diagonally.

Figure 14: Hyyrö (2003) tiles bit-vectors diagonally.

  • \(O(\ceil{d/w}m)\) edit distance (optionally with transpositions).
  • Merges bitpacking (Myers 1999) with band doubling (Ukkonen 1985a).
  • Introduces diagonal tiling, to better match the statically banded region of (Ukkonen 1985a).
  • Diagonal tiling allows the removal of some shifts (i.e. in the last two lines of Figure 13, but adds a shift in the opposite direction for \(D0\). This introduces a ‘‘backwards’’ dependency on the vector below-left of it that is not present with rectangular tiling.
  • The pattern-match vector \(PM_j\) is shifted to correct for the unaligned reads.
  • Includes a comparison with band doubling and bitpacking algorithms of Ukkonen and Myers. Surprisingly, Ukkonen’s algorithm that computes values by increasing distance (i.e. Dijkstra) is reported as faster that the band doubling algorithm, even though Dijkstra is usually considered slow. Sadly no code is provided.

Hyyrö, Fredriksson, and Navarro (2004)

“Increased Bit-Parallelism for Approximate String Matching”

  • For short patterns, when \(m \ll w\), the \(O(\ceil{m/w}n)\) runtime wastes many bits of each word.
  • They show how to search for \(\floor{w/m}\) patterns simultaneously, by packing multiple patterns in a single word, for \(O(\ceil{r/\floor{w/m}}n)\) total time for \(r\) patterns.
  • They show how to search for a single pattern in \(O(n/\floor{w/m})\).
  • They apply the cut-off techniques to improve this to \(O(\ceil{r/\floor{w/k}}n)\) and \(O(n/\floor{w/k})\) expected time respectively when at most \(k\) errors are allowed.
  • To avoid interference when adding/shifting, the most significant bit of each pattern is set to \(0\) beforehand.
  • The score at each position is tracked by packing $m$-bit counters into a single word, together with adding some offset to make detection of \(>k\) values easy.
  • To efficiently search a single pattern, the text is split into \(r:=\floor{w/m}\) chunks. Then, instead of searching multiple patterns against the same text, one can search the same pattern against different texts by or-ing together the bit-profile of the different text characters.

Hyyrö, Fredriksson, and Navarro (2005)

“Increased Bit-Parallelism for Approximate and Multiple String Matching”

This is the journal version of the conference paper Hyyrö, Fredriksson, and Navarro (2004) above, and includes a few more results.

  • It applies the ideas to multiple exact string matching, searching \(r\) patterns in average time \(O(\ceil{r \log_\sigma w/w}n)\), by searching the first \(\ceil{\log_\sigma w}\) characters of each pattern in parallel.

  • It applies to one-vs-all edit distance, where multiple patterns are packed in a word, and similar for LCS.

Hyyrö and Navarro (2002)

“Faster Bit-Parallel Approximate String Matching”

This is the conference paper of the journal paper Hyyrö and Navarro (2004) below.

Hyyrö and Navarro (2006)

“Bit-Parallel Computation of Local Similarity Score Matrices with Unitary Weights”

  • Solves local alignment using bitpacking in \(O(mn \log \min(m,n,w)/w)\) time.

  • Contains a quite nice introduction on global vs semi-global vs local alignment.

    Local similarity computation needs a somewhat different arrangement and, curiously, it seems not expressible using the distance model, but just the score model.

  • Score model \(+1\) for match, \(-1\) for substitution and indel.

  • Difficulties:

    • Absolute scores must be known to do \(\max(0, \cdot)\).
    • Cells can differ by \(2\).
  • Introduces witnesses: Every \(Q = O(\log \min(m,n))\) rows the absolute value is stored and tracked (using bitpacking). For each column, all absolute values are then compared against \(0\) and \(k\) by taking the \(m/Q\) known values and incrementally shifting these down using the know vertical differences.

  • The resulting algorithm is a beast with \(21\) lines of code each containing multiple bit operations.

Indel distance

This is problem of finding all matches of a pattern in a string with indel distance at most \(k\), where only indels (insertions and deletions) are allowed, and substitutions are not allowed (or equivalently, they have cost \(2\)). Note that in the semi-global alignment case this is not exactly equivalent to LCS.

Lipton and Lopresti (1985)

“A Systolic Array for Rapid String Comparison”

Figure 15: DP of Lipton and Lopresti (1985).

Figure 15: DP of Lipton and Lopresti (1985).

  • Introduces recurrence on the DP matrix using mod 4 arithmetic.
  • Parallel over antidiagonals
  • Note: this has substitution cost \(2\), so it’s actually using indel distance.

Hyyrö, Pinzon, and Shinohara (2005a)

“Fast Bit-Vector Algorithms for Approximate String Matching under Indel Distance”

Modifies existing algorithms for indel-distance.

  • Myers (1999) requires modifications, because along diagonals values can now increase by \(2\) instead of only \(0\) and \(1\). Runtime is \(O(\ceil{m/w}n)\).

  • Wu and Manber (1992) and Baeza-Yates and G. Navarro (1999) are modified by simply removing one case from the DP recurrence. Runtimes are \(O(k\ceil{m/w}n)\) and \(O(\ceil{(k+2)(m-k)/w}n)\) respectively, and are faster than Myers (1999) for small \(k\) and small \(m\).

  • The recurrence for \(D\) in the first paragraph of the introduction (and also later in the introduction) seems to be missing some \(+1\) terms for indels. Or maybe I’m missing something?

Hyyrö, Pinzon, and Shinohara (2005b)

“New Bit-Parallel Indel-Distance Algorithm”

  • Improves the runtime of the indel-distance variant of Myers (1999) introduced in (Hyyrö, Pinzon, and Shinohara 2005a) from \(26\) operations to \(21\) operations.
  • The overall speedup was \(24.5\%\), more than the instructions savings of \(19\%\).
  • Has a much clearer presentation than the previous paper.

Automata methods

These methods are based on automata (as opposed to a DP matrix). This means that instead of storing the distance to a state, they for instance indicate whether the state can be reached with cost \(d\) in a bit vector/matrix \(R^d\). Nevertheless, some of these methods seem very closely related to DP based methods, and this categorisation is not absolute. Rather, I just needed some way to separate out papers a bit.

This category also includes e.g. the Knuth-Morris-Pratt algorithm for linear time exact string matching.

Hamming distance

Landau and Vishkin (1986)

“Efficient String Matching with K Mismatches”

  • \(O(k(m\log m+n))\) approximate string matching under Hamming distance.
  • Constructs some kind of jumplists with mismatches to quickly determine the new number of mismatches after shifting the pattern. (But I did not read or understand in detail.)

Baeza-Yates and Gonnet (1992)

“A New Approach to Text Searching”

  • \(O(mn/w)\) string search under Hamming distance; one of the first uses of bitpacking.

    • Shift-add algorithm works by counting the number of mismatches along diagonals of the \(m\times n\) matrix.
  • Already submitted in 1989.

  • Supposedly builds on theory of finite automata, but seems like a pretty direct algorithm to me.

  • Extends to matching with character classes, but still only does Hamming distance.

  • \(O(nm \log k/w)\) when counting up to $k$-mismatches (Hamming distance), by storing each count in \(\log k\) bits (instead of a single match/mismatch bit).

  • Multiple exact string matching in \(O(\ceil{m_{sum}/w}n)\) time.

  • side note: 3 column layout is terrible – too much scrolling up and down.

TODO Baeza-Yates and Gonnet (1994)

“Fast String Matching with Mismatches”

Edit distance

Ukkonen (1985b)

“Finding Approximate Patterns in Strings”

When the pattern length \(m\) is small, one can build an automaton where states correspond to columns of the DP matrix, and transitions to parsing a character of the text.

  • This gives \(O(n)\) processing time once the automaton has been built.
  • There can be up to \(K=\min(3^m, 2^t \cdot |\Sigma|^t\cdot m^{t+1})\) different columns, meaning \(O(m\cdot |\Sigma|\cdot K)\) precomputation can be very slow.
  • All states can be found from the initial state by BFS/DFS.
  • Not practical.

Landau and Vishkin (1985)

“Efficient String Matching in the Presence of Errors”

  • Note that Landau and Vishkin (1986) was submitted before this.

  • Generalizes this earlier paper to edit distance in \(O(m^2 + k^2 n)\) time.

    we first build a table based on analysis of the pattern. Then, we examine the text from left to right checking possible occurrences with respect to on starting location (in the text) at each iteration.

  • Like the Hamming distance paper, this seems quite involved and technical.

  • TODO: read more carefully at some point (although the details do not sound super interesting/relevant).

Landau and Vishkin (1988)

“Fast String Matching with K Differences”

  • Builds on Landau and Vishkin (1985) (earlier conference paper).
  • \(O(m + k^2 n)\) approximate string search.

Wu and Manber (1992)

“Fast Text Searching”

  • Published in same journal issue as Baeza-Yates and Gonnet (1992) and extends it to edit distance.

  • \(O(nkm/w)\) to find all matches of cost at most \(k\).

    • Introduces indicator bit-vectors \(R^d_j[i]\), such that \(R^d\) stores whether there is a path of cost \(d\) to DP state \((i,j)\).
  • Introduces the partition approach: If \(k\ll m\) errors are allowed, at least one part must match when the pattern is partitioned into pieces of length \(r=\floor{m/(k+1)}\). Thus, one can first do a exact search for multiple patterns, followed by an inexact search around matches.

    (TODO: Cite for A*PA)

  • Solves multiple exact pattern matching by interleaving the (equal length) patterns and shifting left by the number of patterns, instead of by \(1\).

  • Extends to character classes and wild cards, like Baeza-Yates and Gonnet (1992).

  • Extends to small integer costs for operations.

  • For long patterns and \(k>w\), only up to \(w\) errors are considered by default and more is only computed when a good candidate is found.

  • For regular expressions, nondeterministic finite automaton are using.

  • agrep: Approximate grep.

Bergeron and Hamel (2002)

“Vector Algorithms for Approximate String Matching”

  • Extends (Baeza-Yates and Gonnet 1992) and (Myers 1999) to arbitrary bounded integer weights for edit distance, showing that \(O(c \log c)\) bit vector operations are sufficient per column transition, for \(O(mnc\log( c)/w)\) total runtime.
  • Very formal paper – hard to read and understand what is practically going on.
  • Does not give an actual algorithm/implementation, nor experiments.

Suffix array methods

Hamming distance

Galil and Giancarlo (1986)

“Improved String Matching with K Mismatches”

  • \(O(m+nk)\) for $k$-mismatches string matching
  • Uses a suffix array with LCA for \(O(1)\) extension. At each position, simply extend \(k\) times.

Grossi and Luccio (1989)

“Simple and Efficient String Matching with K Mismatches” Based on permutations.

  • First finds all occurrences of permutations of the pattern in the text.
  • Extend that to find all permutations of the pattern with at most \(k\) mismatches.
  • For each match, check if the hamming distance is small.
  • Personal remark: it should be possible to extend this to edit distance.

Edit distance

Landau and Vishkin (1989)

“Fast Parallel and Serial Approximate String Matching”

  • \(O(nk)\) time approximate string matching under edit distance.
  • \(O(\log m + k)\) serial algorithm using \(n\) processors.
  • Uses diagonal transition (Ukkonen 1983, 1985a) together with \(O(1)\) extension using a suffix array. (Similar to Galil and Giancarlo (1986).)

Galil and Park (1990)

“An Improved Algorithm for Approximate String Matching”

  • \(O(kn)\) approximate string matching with \(k\) differences
  • Diagonal transition
  • Builds \(O(m^2)\) lookup table for longest-common-prefix between any two suffixes of the pattern.
  • Uses reference triples \((u,v,w)\): a maximal substring \(u\dots v\) of the text that equals \((u-w)\dots (v-w)\) of the pattern.
  • Using these, diagonals can be extended efficiently.
  • TODO: try to understand this better. It sounds interesting, but needs careful reading.

Chang and Lawler (1990)

“Approximate String Matching in Sublinear Expected Time”

Based on suffix trees.

  • Edit distance in \(O(m)\) space and \(O(nk/m \cdot \log m)\) sublinear expected time on random strings. Worst case \(O(nk)\).

  • quote:

    Previous algorithms require at least \(O(nk)\) time. When \(k\) is a s large as a fraction of \(m\), no substantial progress has been made over \(O(nm)\) dynamic programming.

  • Twice cites Ukkonen, personal communication :(

  • Exact matching in sublinear expected time: For positions \(s \in \{m, 2m, 3m, \dots\}\), find all \(j\) such that \(T[s+1, \dots, j]\) is a suffix of the pattern. For random strings, \(j-s > \log(m)\) is not a suffix with high probability, so there are only \(O((n/m) \log m)\) matches in total, each of which is quickly checked.

  • In the inexact case, we can for each position \(S_j\) in \(T\) query the length \(\ell\) of longest prefix of \(T[S_j, \dots]\) that is a substring of the pattern, and then jump to \(S_{j+1} = S_j + \ell +1\). If \(S_{j+k+2} - S_j \geq m-k\) that means it may be possible to match all \(m\) chars here with at most \(k\) mistakes, in which case an DP-based algorithm can be used for verification.

  • To obtain sublinear expected time, the text can be partitioned into \((m-k)/2\) chunks, so that any match includes at least one whole region. Then, we can make \(k+1\) maximum jumps from the start of each region. Only if those span the entire region, a match is possible there.

Chang and Lawler (1994)

“Sublinear Approximate String Matching and Biological Applications”

This is the journal version of the conference paper Chang and Lawler (1990). It seems no additional methods are introduced. (Rather, additional applications are shown.)

Other

Some more papers that I downloaded as part of this reading session, but that turned out somewhat unrelated.

Hyyrö (2008a)

“An Efficient Linear Space Algorithm for Consecutive Suffix Alignment under Edit Distance (Short Preliminary Paper)”

Solves the problem of consecutive suffix alignment, where \(A\) is aligned to prefixes of growing suffixes \(B_{j..n}\) for decreasing \(j\). Given an \(O((m+n)n)\) time and \(O(m+n)\) space algorithm, which is the first linear space algorithm.

This can be used when the end position of a match in approximate string matching is known, and the start position needs to be recovered.

The algorithm description looks very technical, and sadly no high-level overview and/or figures are provided, so I did not read this in detail.

Hyyrö, Narisawa, and Inenaga (2010)

“Dynamic Edit Distance Table under a General Weighted Cost Function”

This generalizes Hyyrö (2008a) to non-unit cost weights.

It has a somewhat intuitive explanation of an earlier algorithm of Kim and Park.

TODO

Many citations link to Lecture notes in compute science instead of the original conference. Ideally this is fixed.

TODO

Chang and Lampe (1992)

“Theoretical and Empirical Comparisons of Approximate String Matching Algorithms”

Baeza-Yates 1989 Improved string searching

Baeza-Yates 1989 Efficient text searching (PhD thesis)

Baeza-Yates 1989 string searching algorithms revisited

Baeza-Yates and Perleberg (1996)

“Fast and Practical Approximate String Matching”

Baeza-Yates and Navarro (1996)

“A Faster Algorithm for Approximate String Matching”

Baeza-Yates and G. Navarro (1999)

“Fast and Flexible String Matching by Combining Bit-Parallelism and Suffix Automata”

Fredriksson (2003)

Baeza-Yates and Navarro (2004)

Hyyrö and Navarro (2004)

“Bit-Parallel Witnesses and Their Applications to Approximate String Matching”

  • New bitpacking technique building on ABNDM [TODO] and Myers (1999) bitpacking.
  • Introduces witnesses: sparse states of the DP matrix from which others can be bounded quickly.
  • Improves Myers (1999) as well.
  • \(\alpha = k/m\) is called difference ratio.

Fredriksson and Navarro (2003)

“Average-Optimal Multiple Approximate String Matching” This is the conference paper corresponding to the twice as long journal paper Fredriksson and Navarro (2004).

Fredriksson and Navarro (2004)

“Average-Optimal Single and Multiple Approximate String Matching”

Fredriksson and Grabowski (2005)

“Practical and Optimal String Matching”

Farrar (2006)

“Striped Smith–Waterman Speeds Database Searches Six Times over Other Simd Implementations”

Hyyrö (2008b)

“Improving the Bit-Parallel Nfa of Baeza-Yates and Navarro for Approximate String Matching” bit parallel NFA

Benson, Hernandez, and Loving (2013)

Setyorini et al. (2017)

Goenka et al. (2020)

Mishin, Berezun, and Tiskin (2021)

Loving, Hernandez, and Benson (2014)

(NO_ITEM_DATA:X)

NO_ITEM_DATA:X

References

Allison, Lloyd, and Trevor I. Dix. 1986. “A Bit-String Longest-Common-Subsequence Algorithm.” Information Processing Letters 23 (5): 305–10. https://doi.org/10.1016/0020-0190(86)90091-8.
Baeza-Yates, R.A., and G.H. Gonnet. 1994. “Fast String Matching with Mismatches.” Information and Computation 108 (2): 187–99. https://doi.org/10.1006/inco.1994.1007.
Baeza-Yates, Ricardo A. 1992. “Text-Retrieval: Theory and Practice.” In Proceedings of the Ifip 12th World Computer Congress on Algorithms, Software, Architecture - Information Processing ’92, Volume 1 - Volume I, 465–76. NLD: North-Holland Publishing Co.
Baeza-Yates, Ricardo A., and Chris H. Perleberg. 1996. “Fast and Practical Approximate String Matching.” Information Processing Letters 59 (1): 21–27. https://doi.org/10.1016/0020-0190(96)00083-x.
Baeza-Yates, Ricardo, and Gaston H. Gonnet. 1992. “A New Approach to Text Searching.” Communications of the Acm 35 (10): 74–82. https://doi.org/10.1145/135239.135243.
Baeza-Yates, Ricardo, and Gonzalo Navarro. 1996. “A Faster Algorithm for Approximate String Matching.” Lecture Notes in Computer Science, 1–23. https://doi.org/10.1007/3-540-61258-0_1.
———. 2004. “Text Searching: Theory and Practice.” Studies in Fuzziness and Soft Computing, 565–97. https://doi.org/10.1007/978-3-540-39886-8_30.
Baeza-Yates, and R. G. Navarro. 1999. “Faster Approximate String Matching.” Algorithmica 23 (2): 127–58. https://doi.org/10.1007/pl00009253.
Benson, Gary, Yozen Hernandez, and Joshua Loving. 2013. “A Bit-Parallel, General Integer-Scoring Sequence Alignment Algorithm.” Lecture Notes in Computer Science, 50–61. https://doi.org/10.1007/978-3-642-38905-4_7.
Bergeron, Anne, and Sylvie Hamel. 2002. “Vector Algorithms for Approximate String Matching.” International Journal of Foundations of Computer Science 13 (01): 53–65. https://doi.org/10.1142/s0129054102000947.
Chang, W. I., and E. L. Lawler. 1994. “Sublinear Approximate String Matching and Biological Applications.” Algorithmica 12 (4–5): 327–44. https://doi.org/10.1007/bf01185431.
Chang, W.I., and E.L. Lawler. 1990. “Approximate String Matching in Sublinear Expected Time.” Proceedings 31st Annual Symposium on Foundations of Computer Science. https://doi.org/10.1109/fscs.1990.89530.
Chang, William I., and Jordan Lampe. 1992. “Theoretical and Empirical Comparisons of Approximate String Matching Algorithms.” Lecture Notes in Computer Science, 175–84. https://doi.org/10.1007/3-540-56024-6_14.
Crochemore, Maxime, Costas S. Iliopoulos, Yoan J. Pinzon, and James F. Reid. 2001. “A Fast and Practical Bit-Vector Algorithm for the Longest Common Subsequence Problem.” Information Processing Letters 80 (6): 279–85. https://doi.org/10.1016/s0020-0190(01)00182-x.
Damerau, Fred J. 1964. “A Technique for Computer Detection and Correction of Spelling Errors.” Communications of the Acm 7 (3): 171–76. https://doi.org/10.1145/363958.363994.
Farrar, Michael. 2006. “Striped Smith–Waterman Speeds Database Searches Six Times over Other Simd Implementations.” Bioinformatics 23 (2): 156–61. https://doi.org/10.1093/bioinformatics/btl582.
Fredriksson, Kimmo. 2003. “Row-Wise Tiling for the Myers’ Bit-Parallel Approximate String Matching Algorithm.” Edited by Mario A. Nascimento, Edleno Silva de Moura, and Arlindo L. Oliveira, Lecture notes in computer science, 2857: 66–79. https://doi.org/10.1007/978-3-540-39984-1_6.
Fredriksson, Kimmo, and Szymon Grabowski. 2005. “Practical and Optimal String Matching,” Spire’05, , 376–87. https://doi.org/10.1007/11575832_42.
Fredriksson, Kimmo, and Gonzalo Navarro. 2003. “Average-Optimal Multiple Approximate String Matching.” In Combinatorial Pattern Matching, edited by Ricardo Baeza-Yates, Edgar Chávez, and Maxime Crochemore, 109–28. Berlin, Heidelberg: Springer Berlin Heidelberg.
———. 2004. “Average-Optimal Single and Multiple Approximate String Matching.” Acm Journal of Experimental Algorithmics 9 (December). https://doi.org/10.1145/1005813.1041513.
Galil, Z, and R Giancarlo. 1986. “Improved String Matching with K Mismatches.” Acm Sigact News 17 (4): 52–54. https://doi.org/10.1145/8307.8309.
Galil, Zvi, and Kunsoo Park. 1990. “An Improved Algorithm for Approximate String Matching.” Siam Journal on Computing 19 (6): 989–99. https://doi.org/10.1137/0219067.
Goenka, Sneha D., Yatish Turakhia, Benedict Paten, and Mark Horowitz. 2020. “Segalign: A Scalable Gpu-Based Whole Genome Aligner.” Sc20: International Conference for High Performance Computing, Networking, Storage and Analysis, November. https://doi.org/10.1109/sc41405.2020.00043.
Grossi, R., and F. Luccio. 1989. “Simple and Efficient String Matching with K Mismatches.” Information Processing Letters 33 (3): 113–20. https://doi.org/10.1016/0020-0190(89)90188-9.
Hyyrö, H. 2001. Explaining and Extending the Bit-Parallel Algorithm of Myers. Julkaisusarja a. University of Tampere, Department of Computer and Information Sciences. https://books.google.ch/books?id=FH7vcQAACAAJ.
Hyyrö, Heikki. 2003. “A Bit-Vector Algorithm for Computing Levenshtein and Damerau Edit Distances.” Nordic Journal of Computing 10 (1): 29–39.
———. 2004. “Bit-Parallel Lcs-Length Computation Revisited.” In Proc. 15th Australasian Workshop on Combinatorial Algorithms (Awoca 2004), 16–27.
———. 2008a. “An Efficient Linear Space Algorithm for Consecutive Suffix Alignment under Edit Distance (Short Preliminary Paper).” Lecture Notes in Computer Science, 155–63. https://doi.org/10.1007/978-3-540-89097-3_16.
———. 2008b. “Improving the Bit-Parallel Nfa of Baeza-Yates and Navarro for Approximate String Matching.” Information Processing Letters 108 (5): 313–19. https://doi.org/10.1016/j.ipl.2008.05.026.
———. 2017. “Mining Bit-Parallel Lcs-Length Algorithms.” Lecture Notes in Computer Science, 214–20. https://doi.org/10.1007/978-3-319-67428-5_18.
Hyyrö, Heikki, Kimmo Fredriksson, and Gonzalo Navarro. 2004. “Increased Bit-Parallelism for Approximate String Matching.” Lecture Notes in Computer Science, 285–98. https://doi.org/10.1007/978-3-540-24838-5_21.
———. 2005. “Increased Bit-Parallelism for Approximate and Multiple String Matching.” Acm Journal of Experimental Algorithmics 10 (December). https://doi.org/10.1145/1064546.1180617.
Hyyrö, Heikki, Kazuyuki Narisawa, and Shunsuke Inenaga. 2010. “Dynamic Edit Distance Table under a General Weighted Cost Function.” Lecture Notes in Computer Science, 515–27. https://doi.org/10.1007/978-3-642-11266-9_43.
Hyyrö, Heikki, Yoan Pinzon, and Ayumi Shinohara. 2005a. “Fast Bit-Vector Algorithms for Approximate String Matching under Indel Distance.” Lecture Notes in Computer Science, 380–84. https://doi.org/10.1007/978-3-540-30577-4_44.
———. 2005b. “New Bit-Parallel Indel-Distance Algorithm.” Lecture Notes in Computer Science, 380–90. https://doi.org/10.1007/11427186_33.
Hyyrö, Heikki, and Gonzalo Navarro. 2002. “Faster Bit-Parallel Approximate String Matching.” Lecture Notes in Computer Science, 203–24. https://doi.org/10.1007/3-540-45452-7_18.
———. 2004. “Bit-Parallel Witnesses and Their Applications to Approximate String Matching.” Algorithmica 41 (3): 203–31. https://doi.org/10.1007/s00453-004-1108-z.
———. 2006. “Bit-Parallel Computation of Local Similarity Score Matrices with Unitary Weights.” International Journal of Foundations of Computer Science 17 (06): 1325–44. https://doi.org/10.1142/s0129054106004443.
Landau, Gad M, and Uzi Vishkin. 1989. “Fast Parallel and Serial Approximate String Matching.” Journal of Algorithms 10 (2): 157–69. https://doi.org/10.1016/0196-6774(89)90010-2.
Landau, Gad M., and Uzi Vishkin. 1985. “Efficient String Matching in the Presence of Errors.” 26th Annual Symposium on Foundations of Computer Science (Sfcs 1985). https://doi.org/10.1109/sfcs.1985.22.
———. 1986. “Efficient String Matching with K Mismatches.” Theoretical Computer Science 43: 239–49. https://doi.org/10.1016/0304-3975(86)90178-7.
———. 1988. “Fast String Matching with K Differences.” Journal of Computer and System Sciences 37 (1): 63–78. https://doi.org/10.1016/0022-0000(88)90045-1.
Lipton, Richard J, and Daniel Lopresti. 1985. “A Systolic Array for Rapid String Comparison.” In Proc. Of the Chapel Hill Conf. On Vlsi, 1985, 363–76.
Loving, Joshua, Yozen Hernandez, and Gary Benson. 2014. “Bitpal: A Bit-Parallel, General Integer-Scoring Sequence Alignment Algorithm.” Bioinformatics 30 (22): 3166–73. https://doi.org/10.1093/bioinformatics/btu507.
Mishin, Nikita, Daniil Berezun, and Alexander Tiskin. 2021. “Efficient Parallel Algorithms for String Comparison.” 50th International Conference on Parallel Processing, August. https://doi.org/10.1145/3472456.3472489.
Myers, Gene. 1999. “A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming.” Journal of the Acm 46 (3): 395–415. https://doi.org/10.1145/316542.316550.
Navarro, Gonzalo. 2001. “A Guided Tour to Approximate String Matching.” Acm Computing Surveys 33 (1): 31–88. https://doi.org/10.1145/375360.375365.
Navarro, Gonzalo, and Mathieu Raffinot. 2000. “Fast and Flexible String Matching by Combining Bit-Parallelism and Suffix Automata.” Acm Journal of Experimental Algorithmics 5 (December): 4. https://doi.org/10.1145/351827.384246.
———. 2002. “Flexible Pattern Matching in Strings,” May. https://doi.org/10.1017/cbo9781316135228.
Setyorini, Kuspriyanto, D H Widyantoro, and A Pancoro. 2017. “The Implementation of Bit-Parallelism for Dna Sequence Alignment.” Journal of Physics: Conference Series 835 (May): 012004. https://doi.org/10.1088/1742-6596/835/1/012004.
Ukkonen, Esko. 1983. “On Approximate String Matching.” Lecture Notes in Computer Science, 487–95. https://doi.org/10.1007/3-540-12689-9_129.
———. 1985a. “Algorithms for Approximate String Matching.” Information and Control 64 (1–3): 100–118. https://doi.org/10.1016/s0019-9958(85)80046-2.
———. 1985b. “Finding Approximate Patterns in Strings.” Journal of Algorithms 6 (1): 132–37. https://doi.org/10.1016/0196-6774(85)90023-9.
Wright, Alden H. 1994. “Approximate String Matching Using Withinword Parallelism.” Software: Practice and Experience 24 (4): 337–62. https://doi.org/10.1002/spe.4380240402.
Wu, Sun, U. Manber, and Gene Myers. 1996. “A Subquadratic Algorithm for Approximate Limited Expression Matching.” Algorithmica 15 (1): 50–67. https://doi.org/10.1007/bf01942606.
Wu, Sun, and Udi Manber. 1992. “Fast Text Searching.” Communications of the Acm 35 (10): 83–91. https://doi.org/10.1145/135239.135244.
NO_ITEM_DATA:X