Storage management for evolving Data bases

"Storage management for evolving Data bases" - Summary.

Based on the paper by

J. Kleinberg , R. Motwani , P. Raghavan , S. Venkatasubramanian

Summarized by Natalya Segal

Motivation and Abstract:

The problem of maintaining large volumes of data that arrive continuously over time is increasingly prevalent in databases and digital libraries. The frequent need for the most recently arrived data, motivate creating sliding window indices - indices were data items reside for a fixed time, afterwards considered expired and can be wiped out from the index. As the work required to update the index structure after the deletion of a small number of expired items often outweighs the savings obtained by reclaiming the freed storage space, in the rapidly changing indices such as sliding window ones we'll often prefer using flexible time window management - window management where an index is divided into the sub blocks, called here compartments and data items are deleted by compartments. A compartment is deleted if and only if all of it's data items are expired.

This approach reduces the cleaning up cost at the expanse of storing expired items, so introducing the problem of minimizing the storage required. The presented paper describes a technique for optimizing the data items to compartments assignment, so that the storage required for the index will be minimal (Total Load Minimization Problem) and the data items will be well balanced over compartments (Maximum Load Minimization Problem). The second will minimize a query time in the worst case.

Introduction:

Assuming that time window size and the compartments number are fixed we'd like to minimize Total Load and Maximal Load in the data items to compartments assignment, computed out of the off line logging information available. Such an approach advantage is in performing all the computations off line and introducing no on line overhead and the obvious disadvantage is in inapplicability to the pure on line model with no a priory knowledge of either items sizes or their arrival times.
Thus the pure on line input model is considered separately. There is suggested the simple BALANCED PARTITION approximation algorithm providing approximation ratio of k/(k-1) for the Total Load problem and 2k-1/(k-1) for that of the Maximum Load. This approximation ratios can be achieved only provided the tight upper bound on the total load of the live items in the system, non tight one will result in producing assignments far from optimal.. The computation time is O(1).
Returning to the off line optimal assignment computation, we can perform it successfully in polynomial time for both finite off line and periodic input models, when concerning Total Load minimization problem. For the Maximum Load minimization problem, that is NP-Complete even in the finite input model, as it contains the Bin packing problem, the same technique will provide 2-approximation ratio.

The Problem Formal Statement:

Basic Definitions:

P1, ., Pk - k compartments; Assignment F: Items -> Compartments
Load L(P_i) = Total size of items, live or expired, that were assigned to Pi since it was last inactive
Weight at time t of F W _t (F) = Sum L(P_i ) for all Pi, active at time t.
W(F) = max W_t(F) for all t. Called Total load
W_Max(F) = max L(P_i) for all Pi active at time t. Called Maximum load

The formal introduction of the input models:

Standard on line input model:
Data items arrive continuously over time and we assume neither a priory knowledge of the item sizes nor of their arrival times.
Periodic input Model
Infinite sequence of item arrivals is constructed from the periodically repeated finite one with known sizes and arrival times. More precisely, we are given a basic period, T, consisting of item sizes s₁, ..., sm and times 0 <= t_i <= T. This quantities generalize the infinite sequence of arrivals as follows: an item of size Si arrives at time jT + t_i for j = 0, 1, ... . We assume that T = W * p > 0 for some integer p, otherwise we'll merge several periods.
Finite off line input model:
Here there is a finite sequence of the item arrivals, with known arrival times as well as the items sizes.
If items X₁, .,Xn arrive at times t₁, ., t_n, we assume that there is no i such that: t _i+1 - t_i >= w. Otherwise we simply decompose the initial problem into independent sub problems at all such gaps.

Results Presented in the Paper and Their Usage:

For the Total Load Minimization problem:

On line input model:

No Polynomial time

optimal

approximation ratio of k/(k-1),

Advantages: The algorithm introduces no on line overhead, off line computation are done in O(1) time.
Disadvantages: First, we need to reveal the tight upper bound on the total load of the live items in the system. Second the assignment produced is not the optimal one.

Balanced Partition

If we are given T = the maximum total size of active items at any point in time, compute L = T/(k-1), were k is a number of compartments. BALANCED-PARTITION stacks up items in each compartment Pi until the total size first exceeds L, then moves to the next compartment.
Aggressive Balanced Partition The same as BALANCED-PARTITION but the last item is not taken, so the first compartment load never exceeds L.

Figure1

Periodic input model:

Polynomial time

p-graphs

Finite off line input model:

Polynomial time

The finite off line model algorithm is of low self usage and is mostly used as a step towards the development of an analogous algorithm for the periodic model as well as for the approximation for the Max Load problem in finite off line and periodic models. Although it can be self used for the off line reorganization of the existing index, were all the data sizes and arrival times in the index are already known.

For the Maximum Load Minimization problem:

Here we need one more definition. Let's define Compression in an assignment F, as an interval of time (t, t') in which the assignment "returns" to a compartment still containing items. If there are no such returns in the assignment F, we say that F contains no compression
The problem of minimizing the maximum load is NP - Complete, since it contains the Bin Packing problem as a special case. The major difficulty in the maximum load minimization problem, compared to the total load minimization one, is that the optimal assignment F_opt need not be compression free.

Finite off line and Periodic input models:
However, compression free assignments have a number of practical advantages - particularly, temporal locality - and thus remain an interesting case to focus on. The sparse path algorithm - that is used to find the optimal assignment from the point of total load minimization - can be used to find an optimal compression free assignment for the Max Load minimization problem in polynomial time. This assignment is 2 * optimal assignment from the Max Load point of view.
On line input model:
BALANCED-PARTITION maintains maximum load of L + S max, were Smax = the maximum size of a single item and is 2k-1/(k-1) - competitive

Sparse Path Algorithm: The Technique:

The technique reduces the problem of finding an optimal assignment to the problem of finding a (k, I) - sparse path in graph GN, were k is a number of compartments, I - interval set reflecting items being simultaneously alive. In GN, N stands for the assignment load bound and the graph edges reflect that the data items - translated to the graph vertices - maintain a load (either maximum or total, depending on the problem) that does not exceed N. The graph type: either path graph or p-graph reflect the input model: finite off line and periodic respectively.
Then there is an assignment of weight N ó there is a (k, I) - sparse path in graph GN. See presentation slides for the proofs.

For simplicity, the idea explanations are given for the finite off line case, although it works with some little graph enhancements (turning the path graph into the p-graph) for the periodic case as well.

So let's take a closer look at the technique. This will require some more definitions to come.
More Definitions:

Path graph G = (V, E): V = {v ₁, ., vn} and for each i = 1, ., n-1 there exist j > i such that vi has out edges to the nodes: vi, . , vj. Examples can be found in the presentation slides.

Path P is (k, I) - sparse ó no interval in I contains k vertices of P.

f:{1, ., n} à{1, ., n};       f(i) = j    ó   Xj is the last item to arrive before the expiration of Xi

A path graph GN = (V, E): V = {v₁, ., v_n+1}; E: there is a directed edge (v_ià v_j)   ó i < j and Sum S_k, for k ranging from i to f(j-1), is less or equals N.

I = {Iv_i = (vj,...,v_f(j-1))} i = 1, ... m, I'v_i = (vi,...,v_j) set of contiguous indices, where for each v_i in GN, let v_j be the rightmost endpoint of any interval in I, containing v_i=> I'={Iv₁, ., Iv_n} is called interval system.

The PTime (k, I')-sparse path finding algorithm is presented bellow:
p(v) - Path, starting from v and iteratevly following longest out edges, pz(v) - As p(v), but no vertex in Z is visited.
q(v, d) - Vertex visited on the d-th step of p(v), qz(v, d) - Vertex visited on the d-th step of pz(v)

(k, I) - Sparse Path Finding Algorithm :

(1) Initialize Z = Empty Set;
(2) Until Z does not change: {

for u = vn, vn-1, ..., v1: {
if ( qZ(u, k-1) is in Iu ) {
put u in sZ;
}
}
}
(3) if (v1 is in Z) {

no (k, I') sparse spanning path exists;
}
else {

PZ(v1) is such a path;
}

Figure 2

To find an optimal assignment we perform a binary search on N, ranging from 1 to maximal total load of the live items in the system, for each N construct GN and run the (k, I) - Sparse Path Finding Algorithm. If such a path exists, we'll construct the optimal assignment out of it by taking the data items corresponding to the nodes in the path to begin the assignment blocks.

More Information:
For detailed explanations and examples see the presentation slides.

For general information on the indexing techniques see: - "Wave Indices: Indexing Evolving databases." by N. Shivakumar and H. Garsia-Molina.