\[\newcommand{\dol}{\$}\]
These are some comments and new ideas on the paper by Loukides, Pissis, Thankachan, and Zuba (2023).
Later note: there are some bugs in the new ideas I present here due to me having a too simplified understanding of some parts of the problem. But I don’t have time to clean things up.
Comments
Apart from the geometry, everything seems quite elementary.
Well written; the separation into Main Idea, Theorem, Construction, and Querying works well.
- It’s a bit hard to distil what is the state of the art for each query type before this paper.
Prelims
- Fig 2: slight inconsistency that failure transitions to the root are omitted on the left but shown on the right.
One-to-One
The tricky part here is the rank-select to find the closest preceding \(S_i\) suffix before \(S_j\). Would be cool if it could be replaced with something simpler.
- Fig 4a: \(\dol_i < \texttt{a}\), so really there should be another edge above the \(\dol_i\) to \(r_i\).
- Fig 4b: Where is \(w\)? Should be \(u\)?
- Missed opportunity to have \(l_i\) on the left or \(r_i\).
Hence this is equal to $O(1+\log \log k)$
ratheris bounded by
.
One-to-All
- reference 2 looks weird?
- neither [2] nor [10] mention $τ$-micro-macro decomposition by name exactly.
- Called topology tree in [2]? But those components do not have bounded size.
- Originally called restricted partitions.
- The relation between the $τ$-micro-macro tree and the original tree is not explained. Nor its relevant algorithmic properties. Lemma 13 only talks about construction time.
- Turns out the details of this tree are kinda skipped over completely, but also the only important part is that we store \(\Theta(k)\) information for at most \(O(n/k)\) nodes, and that we can walk up from any node and find such a node in at most \(O(k)\) steps.
- Proof that they can be constructed:
- Start with each node in an independent component.
- Keep merging as long as components are \(<z\) and at most \(2\) boundary nodes after merging.
- Leafs can
Report and Count
- Some heavy lifting with geometry theorems here :D
- It’s not completely clear to me whether this was especially invoked to get the slightly better \(O(\log n / \log \log n)\) instead of \(O(\log n)\). To me a \(O(\log n)\) non-black-box approach seems preferable, and anyway you don’t give the construction time of the \(O(\log n / \log \log n)\) method, so this seems kinda useless.
- This looks complicated initially but seems quite straightforward conceptually. The fix for double counting is also intuitive.
- Maybe some non-geometric algorithms can be used instead that simplify construction and total conceptual complexity. (See below.)
Top-\(K\)
It’s basically a binary search over Count
, followed by some tricks for edge
cases.
A small rant on $τ$-micro-macro trees
I spent some time digging into citations to figure out how micro-macro trees are constructed, so let me persist it here.
Loukides et al. (2023) uses $τ$-micro-macro decomposition. It is not clear about node or edge partition. (The wikipedia lemma on graph partition assumes node-disjoint clusters, as did I.)
Gagie et al. (2013) uses micro-macro decomposition and includes exactly the same figure as Loukides et al. (2023), but does not explain the construction. It is not clear about vertex or edge partition of the graph. In fact, the paragraphs in these two papers are extremely similar. It cites:
Alstrup, Secher, and Spork (1997) (Lemma 2) uses micro trees and macro trees specifically for components of size at most \(\log n\). For the proof, it refers to:
Frederickson (1991) introduces the restricted partition of order \(z\), which partitions the vertices such into clusters of size at most \(2z-2\), and states
It is not hard to show that the number of clusters in a restricted partition of order \(z\) is \(\Theta(m/z)\).
which seems to be the first introduction of this concept, but does not provide an explicit construction, and in fact, this does not seem so obvious.
Frederickson (1997) is a second published version of the same paper as above. (Things where slow: Received Feb. 1992; Accepted June 1995; Published April 1997.) Sadly google prefers the previous paper, since this one has changed to
It is not hard to show that the number of clusters in a restricted partition of order \(z\) is \(\Theta(m/z)\). We do this after the proof of the upcoming Lemma 2.2.
And indeed a paragraph with proof has been inserted, but it depends on properties of the multilevel partitions that are also introduced but not relevant here.
Alstrup et al. (1997) is cited by Loukides et al. (2023) and introduces topology trees which are trees of nested clusters of unbounded size, where each cluster has a boundary of size at most \(2\). I don’t yet see how this relates to micro-macro trees.
Bille and Gortz (2011) is also cited by Loukides et al. (2023), and actually explains the construction algorithm in Lemma 5.1. It creates an edge-disjoint partition. It seems the proof that the number of partitions is \(O(n/s)\) is not super straightforward and requires some careful bounding of different types of clusters.
It refers back to Frederickson (1997), and Alstrup et al. (1997), stating that the construction is effectively the same, but this doesn’t seem super obvious to me. (But on the other hand, there really is only one sort of construction one can do here, so they all must be closely related.)
Anyway, the recursive algorithm \(cluster(v)\) presented in this last paper is (approximately) as follows:
- Consider each child \(u\) of \(v\) separately.
- If the subtree \(T(u)\) below \(u\) has at most \(\tau\) vertices, make a component out of \(\{v\}\cup V(T(u))\).
- Otherwise, choose a node \(w\) of maximum depth below \(u\) with at most \(\tau\) vertices between \(w\) and \(u\). Make an internal cluster out of these vertices, together with \(u\) and \(w\), and recurse on \(cluster(w)\).
Ideas for simplification
Replace $τ$-micro-macro tree
How about something simpler like:
- Sort all nodes by decreasing depth. (\(O(n)\) using bucket/radix sort)
- Going from deep to not-deep: Walk up \(k-1\) steps, marking each visited
vertex as SKIP.
- If reaching a vertex already marked SKIP: stop.
- If reaching a vertex marked SAVE, stop.
- Otherwise, mark the $k$th parent as SAVE.
Heavy-Light-Decomposition (HLD) for \(Count\) queries in \(O(\log n)\) time
This is a simpler (more elementary/classical) approach that has \(O(n)\) memory, \(O(n)\) construction time, and \(O(\log n)\) query time (as opposed to the \(O(\log n)\) or \(O(\log n/\log \log n)\) time of Theorem 19/20).
A Count query \(Count(i, l)\) is equivalent to: find the number of outgoing \(\dol_j\) edges on the path \(P\) starting in the node \(v\) of \(S_i\) in \(ST_R\) and going up to depth \(l\). (Possibly only counting multiple \(\dol_j\) edges once.)
For each node \(u\) in \(ST_R\), store the total number of outgoing \(\dol_\cdot\) edges as \(t_u\).
If there is a \(\dol_j\) edge going out of both \(u\) some node \(w\) strictly below \(u\), subtract \(1\) from \(t_{c(u,w)}\), where \(c(u, w)\) is the unique child of \(u\) that is an ancestor of \(w\).
Care must be taken when \(c(u,w)\) is the start of \(P\), in which case we must not subtract the \(1\). To avoid this, one solution is to insert an additional node on the edge where the \(-1\) is stored, instead of accumulating it into the child directly. Or the \(-1\) can simply be stored in \(c(u,w)\), but independently of the count \(t_{c(u,w)}\).
Alternatively, we could add \(1\) to all other (non-\(c(u,w)\)) children of \(u\). As long as the alphabet is constant that over head is OK. This is similar to cutting the rectangles with vertical cuts (bottom of Fig 6b), while the previous method is rather similar to horizontal cutting of rectangles (top of Fib 6b).
We want to compute \(Count(i,l) = \sum_{u\in P} t_u\).
Consider the heavy-light-decomposition \(HLD_R\) of \(ST_R\).
Each path from a node \(v\) uf \(ST_R\) to the root intersects at most \(\lg n\) components of \(HLD_R\). In particular this holds for \(P\).
Apart from the top component containing \(v\), each such component intersection covers exactly a prefix of the component.
We can precompute and store prefix sums in each component in \(O(n)\) total time.
The top intersection is a (non-prefix) interval of some component. This sum is simply the difference of two prefix sums.
Construction time is
- \(O(n)\) for the HLD (using DFS)
- \(O(n)\) for the prefix sums
Query time is \(\log n\): We process \(\log n\) HLD components in \(\log n\) time each.
Finding the largest \(l\) with \(Count(i, l) \geq K\) in \(O(\log n)\) time
This is similar to \(Top(i,K)\), but does not report the actual strings.
The problem is now to find the largest \(l\) such that \(\sum_{u\in P} t_u \geq K\). A naive approach is to walk up the tree \(ST_R\), starting at the node of \(S_i\), and going up until the accumulated sum of \(t_u\) is \(\geq K\).
Using the above HLD, we can again split the path into HLD-components and walk up one component at a time, until the sum to the start of the component is large enough. To find the precise start, we can do a binary search inside the HLD-component. This takes \(O(\log n)\) time for walking up the HLD-components, and \(O(\log n)\) time to binary search inside that component, for \(O(\log n)\) total query time.
Reporting matching strings
To add reporting to both \(Count\) and \(Top\) queries, we can do the following:
For each node \(u\) of \(ST_R\), store a list/set \(T_u := \{(j, d(u)) : u\dol_j \in E_{ST_R}\}\).
Instead of \(\sum_{u\in P} t_u\), we are now interested in \(\bigcup_{u\in P} T_u\)
- the union can be either concatenation of lists of tuples \((j, d(u))\),
- or only taking the maximum \(d(u)\) for each \(j\),
- or only merging sets of \(\{j\}\).
We can’t store prefix-unions within HLD-components, because that could take too much space.
Instead, we can store for each node a pointer to the closest ancestor \(u\) that contains a non-empty \(T_u\). Then we can simply follow these pointers up until the start of \(P\) is reached, in total \(O(output)\) time.
To prevent double-counting, we can instead store a pointer to closest ancestor that contains a \((j, d)\) that is not already present in the current subtree.
One possible issue here is when many parents contain one new unseen \(j\), but also many \(j\) that were already seen before. In that case we keep iterating over these and discarding them. This could give total runtime \(O(\log n + K^2)\) in the worst case.
This can be fixed by using the alternative method of pushing the \((j,d)\) marker to all children that don’t have \(j\) yet (ie the vertical slicing in the bottom of Fig 6b). Since there could be many children, this could increase memory usage by a factor \(\sigma\). Instead, we can insert intermediate nodes for all children left of \(c(u,w)\) and all children right of \(c(u,w)\) and add \((j,d)\) to these intermediate nodes. Similar to how there will only be at most \(2n\) rectangles, there will also be at most \(O(n)\) added nodes for this, and still each node contains at least one new element when walking up paths, so that the overall complexity remains \(O(\log n + K)\).
Both of these types of pointers can be computed in \(O(n)\) time and \(O(n+k)\) space using DFS.
For the \(Report\) queries, this gives \(O(\log n + output)\) runtime.
For \(Top(i, K)\) queries, we can first determine the correct level \(l\), and then report (a size \(K\) subset of) \(Report(i, l)\) in \(O(K)\) time. There is no overhead if \(Count(i, l)»K\), since we can just stop merging elements as soon as the output set reaches size \(K\).
Comparison
All are \(O(n)\) memory.
query | construction | query time (paper) | note | new cons. + query time | note |
---|---|---|---|---|---|
One-to-One(i,j) | \(n \log \log k\) | \(\log \log k\) | ST + Rank-Select | - | |
One-to-All(i) | \(n\) | \(k\) | Aho-Corasick | \(k\) | simpler tree |
Count(i,l) | ? | \(\log n /\log \log n\) | black box | - | |
Report(i,l) | ? | \(\log n /\log \log n + out\) | black box | - | |
Top(i,K) | ? | \(\log^2 n /\log \log n + K\) | black box | - | |
Count(i,l) | \(n\log n\) | \(\log n\) | geometry | \(n\), \(\log n\) | HLD |
Report(i,l) | \(n\log n\) | \(\log n +out\) | geometry | \(n\), \(\log n + out\) | HLD |
Top(i,K) | \(n\log n\) | \(\log^2 n +K\) | geometry | \(n\), \(\log n + K\) | HLD |
Closing thoughts
- Why do we use Aho-Corasick automaton for One-to-All queries, but Suffix Tree for all other queries?
- Can we use ST for One-to-All queries?
- Can we use AC for the other queries?
- Can we get rid of the rank-select for One-to-One queries to improve construction time? We only do one specific kind of query on them? (Rank-Select is kinda complicated.)
- Can we answer \(TopAll(K)\) queries that return \(Top(i, K)\) for all \(k\) strings in \(O(n + kK)\) time? (Does not necessarily need separate construction and query time.)
- Can we extend to fuzzy matching, allowing some errors?
- Can we A* to efficiently construct a fuzzy string-graph, by only considering sufficiently good candidates?
- If my reading is correct, Simpson and Durbin (2010) computes all edges of length \(\geq \tau\) of \(AlltoAll\) in \(O(n+output)\) using the FM-index. It can also directly return all irreducible edges (of length \(\geq \tau\)) in \(O(n)\) total time, which seems very nice and in a way the best we can wish for.
WIP research proposal is here.