"Storage management for evolving Data bases" - Summary.

Based on the paper by

J. Kleinberg , R. Motwani , P. Raghavan , S. Venkatasubramanian

Summarized by Natalya Segal

 

Motivation and Abstract:

The problem of maintaining large volumes of data that arrive continuously over time is increasingly prevalent in databases and digital libraries. The frequent need for the most recently arrived data, motivate creating sliding window indices -  indices were data items reside for a fixed time, afterwards considered expired and can be wiped out from the index.  As the work required to update the index structure after the deletion of a small number of  expired items often outweighs the savings obtained by reclaiming the freed storage space,  in the rapidly changing indices such as sliding window ones we'll often prefer using  flexible time window management - window management where an  index is divided into the sub blocks, called here compartments and data items are deleted by compartments. A compartment is deleted if and only if all of it's data items are expired.
 
This approach reduces the cleaning up cost at the expanse of storing expired items, so  introducing the problem of minimizing the storage required. The presented paper describes a technique for optimizing the data items  to compartments assignment, so that the storage required for the index will be minimal (Total Load Minimization Problem) and the data items will be well balanced over compartments (Maximum Load Minimization Problem). The second will minimize a query time in the worst case.

Introduction:

Assuming that time window size and the compartments number are  fixed we'd like to minimize Total Load and Maximal Load in the data items to compartments assignment, computed out of the off line logging information available. Such an approach advantage is in performing all the computations off line and introducing no on line overhead and the obvious disadvantage is in inapplicability to the pure on line model with no a priory knowledge of either items sizes or their arrival times.

Thus the pure on line input model is considered separately. There is suggested the simple BALANCED PARTITION approximation algorithm providing approximation ratio of k/(k-1) for the Total Load problem and 2k-1/(k-1) for that of  the Maximum Load. This approximation ratios can be achieved only provided the tight upper bound on the total load of the live items in the system, non tight one will result in producing assignments far from optimal.. The computation time is O(1).

Returning to the off line optimal assignment computation, we can perform it successfully in polynomial time for both finite off line and periodic input models, when concerning Total Load minimization problem. For the Maximum Load minimization problem, that is NP-Complete even in the finite input model, as it contains the Bin packing problem, the same technique will provide 2-approximation ratio.

The Problem Formal Statement:

Results Presented in the Paper and Their Usage:

For the Total Load Minimization problem:

On line input model:

    No Polynomial time algorithm can provide the optimal solution for the problem in the pure on line model. Although the paper proposes a simple on-line algorithm - BALANCED PARTITION, that achieves approximation ratio of k/(k-1), were the number if compartments equals kWhen the number of compartments is 2 the slight improvement, called AGGRESSIVE BALANCED PARTITION, provide 2-competitive technique. For this algorithm we need to a priori know the tight upper bound on the total load of the live items in the system, which can be approximated out of the database logging information. Non tight bound will result in producing assignments far from optimal.

    Advantages: The algorithm introduces no on line overhead, off line computation are done in O(1) time.
    Disadvantages: First, we need to reveal the tight upper bound on the total load of the live items in the system. Second the assignment produced is not the optimal one.
     
     

    Balanced Partition

    If we are given T = the maximum total size of active items at any point in time, compute L =  T/(k-1), were k is a number of compartments. BALANCED-PARTITION stacks up items in each compartment Pi until the total size first exceeds L, then moves to the next compartment. 
     
    Aggressive Balanced Partition
    The same as  BALANCED-PARTITION but  the last item is not taken, so the first compartment load never exceeds L.
     Figure1

Periodic input model:

    There is a Polynomial time algorithm producing the optimal assignment from the point of  view of the total load. The optimal assignment pre computation takes  O(log(Smax * m) * m2 ) and introduces no on line overhead. The algorithm is based on the (k, I) - sparse path finding problem in p-graphs (The algorithm and  explanations follow).
     

    Finite off line input model:

    There is a Polynomial time algorithm, running in O(log(Smax * m) * m ), were m is the number of items  to arrive and Smax equals to the maximum size of a single data item. All the computations are done before the items begin to arrive, so introducing no overhead in time when the items should be really assigned to the index compartments.

    The finite off line model algorithm is of low self usage and is mostly used as a step towards the development of an analogous algorithm for the periodic model as well as for the approximation for the Max Load problem in finite off line and periodic models. Although it can be self used for the off line reorganization of the existing index, were all the data sizes and arrival times in the index are already known.

       

      For the Maximum Load Minimization problem:

Here we need one more definition. Let's define Compression in an assignment F, as an interval of time (t, t') in which the assignment "returns" to a compartment still containing items. If there are no such returns in the assignment F, we say that F contains no compression

The problem of minimizing the maximum load is NP - Complete, since it contains the Bin Packing problem as a special case. The major difficulty in the maximum load minimization problem, compared to the total load minimization one, is that the optimal assignment Fopt need not be compression free.

Finite off line and Periodic input models:

However, compression free assignments have a number of practical advantages - particularly, temporal locality - and thus remain an interesting case to focus on. The sparse path algorithm - that is used to find the optimal assignment from the point of total load minimization - can be used to find an optimal compression free assignment for the Max Load minimization problem in polynomial time. This assignment is 2 * optimal assignment from the Max Load point of view.

  On line input model:

BALANCED-PARTITION maintains maximum load of  L + S max, were Smax = the maximum size of a single item and  is 2k-1/(k-1) - competitive
 
 

Sparse Path Algorithm: The Technique:

The technique reduces the problem of finding an optimal assignment to the problem of finding a (k, I) - sparse path in graph GN, were k is a number of compartments, I - interval set reflecting items being simultaneously alive. In GN, N stands for the assignment load bound and  the graph edges reflect that the data items - translated to the graph vertices - maintain a load (either maximum or total, depending on the problem) that does not exceed N. The graph type: either  path graph  or p-graph reflect the input model: finite off line and periodic respectively.

Then there is an assignment of weight N   ó  there is a (k, I) - sparse path in graph GN. See presentation slides for the proofs.

For simplicity, the idea explanations are given for the finite off line case, although it works with some little graph enhancements (turning the path graph into the p-graph) for the periodic case as well.
 
So let's take a closer look at the technique. This will require some more definitions to come.

More Definitions:

  • Path graph G = (V, E): V = {v 1, ., vn} and for each i = 1, ., n-1 there exist j > i such that vi has out edges to the nodes: vi, . , vj. Examples can be found in the  presentation slides.
  • Path P is (k, I) - sparse ó no interval in I contains k vertices of P.
  • f:{1, ., n} à{1, ., n};       f(i) = j    ó   Xj is the last item to arrive before the expiration of Xi
  • A path graph GN = (V, E): V = {v1, ., vn+1};  E: there is a directed edge (vià vj)   ó i < j and Sum Sk,  for k ranging from i to f(j-1), is less or equals N.
  • I = {Ivi = (vj,...,vf(j-1))} i = 1, ... m,  I'vi = (vi,...,vj) set of contiguous indices, where for each vi in GN, let vj be the rightmost endpoint of any interval in I, containing vi   => I'={Iv1, ., Ivn} is called interval system.
  • The PTime (k, I')-sparse path finding algorithm is presented bellow:

    p(v) - Path, starting from v and iteratevly following longest out edges, pz(v) - As p(v), but no vertex in Z is visited.
    q(v, d) - Vertex visited on the d-th step of p(v), qz(v, d) - Vertex visited on the d-th step of pz(v) 
         

        (k, I) - Sparse Path Finding Algorithm :

            (1) Initialize Z = Empty Set; 
            (2) Until Z does not change:   
                for u = vn, vn-1, ..., v1:  
                  if ( qZ(u, k-1) is in Iu )  
                    put u in sZ;
                  }
                }
              }
            (3) if (v1 is in Z)  
                no (k, I') sparse spanning path exists;
               
              else  {
                PZ(v1) is such a path;
              }
        Figure 2
      To find an optimal assignment we perform a binary search on N, ranging from 1 to maximal total load of the live items in the system, for each N construct GN and run the (k, I) - Sparse Path Finding Algorithm. If such a path exists, we'll construct the optimal assignment out of it by  taking  the data items corresponding to the nodes in the path to begin the assignment blocks.
     

    More Information:

    For detailed explanations and examples see the presentation slides.
     
    For general information on the indexing techniques see: - "Wave Indices: Indexing Evolving databases." by N. Shivakumar and H. Garsia-Molina.