Augmenting Suffix Trees,
with Applications

by
Yossi Matias, S. Muthukrishnan,
    Suleyman Cenk Sahinalp, Jacob Ziv
 
 

Summarized by Genady Garber





Introduction
In this paper presented two algorithmic problems, one from the area of Data Compression and the other from Information Retrieval. The main results are highly efficient (linear or near linear time) algorithms for these problems. All these algorithms rely on the suffix tree, a versatile data structure in combinatorial pattern matching. Suffix trees, with suitably simple augmentations, have found numerous applications in string processing. In these applications too, was augmented the suffix tree with extra edges and additional information.

Problems & Backgroung
 1.  The Document Listing Problem
       We are given a set of documents T = (T1, . . . , Tk) for preprocessing. Given a query pattern P , the   problem is to output a list of all the documents that contain P as a substring.  A related query is where we are required to merely report the number of documents that contain P. The previously known  algorithm that solves this problem in O(|P|) time is based on data structures for computing lowest common ancestor (LCA) queries. The document listing problem is of great interest in information retrieval and has independently been formulated in many scenarios, for example in discovering gene homologies.

-The Suffix-DAG Data Structure
o Build the suffix-DAG of documents T1, . . ., Tk, in O(t) = O(S tk), using O(t) space
o The suffix-DAG of T, denoted by SD(T), contains the generalized suffix tree, GST(T), of
   the set T at its core.
o A GST(T) is defined to be the compact trie of all the suffixes of each of the documents
    in T.
o Each leaf node l in GST(T) is labeled with the list  of documents which have a suffix
   represented by the path from the root to l.
o The substring represented by a path from the root to any given  node n denoted by P(n)
o The nodes of SD(T) are the nodes of GST(T)
o The edges of SD(T) are of two types:
        * the skeleton edges of SD(T) are the edges of GST(T)
        * the supportive edges of SD(T) are defined as follows:
           for any nodes n1 and n2 in SD(T) there is a pointer edge from n1 to n2 if and only if:
                - n1 is an ancestor of n2
                - among the suffix trees ST(T1), ST(T2), . . . , ST(Tk) there exists at least one,
                   say  ST(Ti), which has two nodes ,   n1,i and n2,i such that:
                        # P(n1) = P(n1,i) ,P(n2) = P(n2,i)
                        # n1,i  is the parent of  n2,i
                - such an edge is labeled with i for all relevant documents in Ti
o In order to respond to the count and list queries, one of the standard data structures
   that support least common ancestor (LCA) queries on SD(T) in O(1) time was built.
o Also each internal node n of SD(T) contains
        * array that stores its supportive edges in pre-order fashion
        * number of documents which include P(n) as substring
 

2. The (a,b)-HYZ Compression Problem
    Given a binary string T of length t. We are asked to replace disjoint blocks (substrings) of size b with desirably shorter codewords. The codewords are selected in a way that it would be possible for a corresponding decompression algorithm to compute the original T out of the string of codewords. This is done as follows. Say the first i-1 blocks has been compressed. To compute codeword cj for block j should be determined its context :
the context of block T[i:l] is the longest substring T[k:i-1], k<i of size at most a such that T[k:l] occures earlier in T. The codeword cj is the ordered pair <g,l> where g is the length of context and l is rank of block j with respect to the context, according to some predetermined ordering. The (a,b)­HYZ compression scheme is based on the intuition that similar symbols in data appear in similar contexts.

-Example
o T = 010011010
o a = 2 , b = 1
o the context of block 9 is T[7 : 8] = 01
o the two substrings which follow earlier occurrences of this context are T[3] = 0
   and  T[6]  =  1
o the lexicographic order of block 9 amount these substrings is 1

-Theorem 1
There is an algorithm to implement the compression scheme C(a,b)  which  runs in O(tb) time and requires O(tb) space, independent of a

-Theorem 2
There is an algorithm to implement the compression method Ca,b for a = log t and b = log log t in O(t) time using O(t) space
Proof Sketch
For any descendents wi of v which are b chars apart from v, the DataStructure enables to compute lexicographic order of the path between any wi and v, allowing easy computation of codeword of block of size brepresented by path between v and wi
    * The algorithm exploits the fact that the context size is bounded, and its seeks
        similarities between suffixes of the input up to a small size.

Results and conclusions

Document listing problem
Processing k documents in linear time ( O(Si ti) ) and space
Time to answer a query with pattern P is   O( |P| logk + out)

(a,b)-HYZ compression problem
 1)   unbounded a, complexity O(tb)
Gives linear time for  b = O(1)
The only previously known algorithm, where for
 b = O(1), the author presents an O(ta) time algorithm, and for unbounded a this running time is O(  )
2)  a = O(log t), b = log log t, complexity O(t)
There is no any previously known algorithms