**Join Algorithms For Online Aggregation**.

By Peter J. Haas and Joseph M. Hellerstein.

Summarized: Michal Ozery.

Introduction:

This paper presents a new family of join algorithms, called **ripple joins**, for online processing aggregation queries in a relational DBMS. Examples to aggregation operators are SUM, COUNT, AVG, STDEV, etc. As it is well known, traditional offline join algorithms are designed to minimize the time till completion. However, ripple joins are designed to minimize the time till an acceptably precise estimate of the query result is available. This time is independent in the size of the input relations. Since large-scale aggregation queries are typically used to get a "big picture" of data set, ripple joins' approach seem to be more attractive to use.

Current Systems process queries in "blocking" mode, meaning - no feedback till termination of the query. On the contrast, the output of ripple joins is continuously displayed to the user. The output of ripple joins is a series of progressively refined "confidence intervals" of the form [*X**n - hn, Xn - hn*] with a fixed probability *p*. Ripple joins are adaptive, adjusting their behavior in accordance with the statistical properties of the data. Moreover, they permit the user to tradeoff the time between successive updates of the confidence interval and the amount by which the confidence interval length (2**h**n*) decreases. Naturally, the longer the time between updates - the bigger the decrease in confidence interval length.

**Assumption**: all the tupples of the input relations are processed in a random order.

Overview of Ripple Join.

The main idea of ripple join is based on sampling. Since all the tuples are assumed to be randomly ordered, retrieving a new tuple can be considered as a sampling step. After each sampling step, the statistics are updated and the interval confidence is calculated.

In the simplest version of two-table ripple join, one previously-unseen random tuple is retrieved from each R and S at each sampling step. These new tuples are joined with the previousely-seen tuples and with each other. Thus the Cartesian product *R x S* is swept out as depicted in Figure 1.

Figure 1: The element of *R x S* that have been seen after n sampling steps of "square" ripple join (*n* = 1,2,3,4).

The "square" version of the ripple join described above draws samples from R and S at the same rate. However, in order to provide the shortest possible confidence intervals, it is often necessary to sample one relation at a higher rate than another. This requirement leads to the general "rectangular" version of the ripple join as depicted in Figure 2.

Figure 2: The element of *R x S* that have been seen after n sampling steps of "rectangular" ripple join (*b**1* = 3, *b**2* = 2, *n* = 1,2,3).

The general algorithm with *K* (>= 2) base relations *R**1,…,RK* retrieves *b**i* previously-unseen random tuples from *R**i* at each sampling step for *1 <=i <= K*. The sampling rates of the relations are called "aspect ratios". (Figure 3 corresponds to the special case in which *K=*2, *b**1*=3, *b**2*=2.) Note that when *b**1* = 1, *b**i* = |*R**i*| the result is the known join algorithm "Nested Loops".

Variations of Ripple Join.

__Block Ripple Join__- sample units are blocks of tuples at size*alpha*. (In classic ripple join*alpha*= 1.)__Index Ripple Join__- ripple join doesn't require any indexing. However, if an index already exist on join attributes of relation R - its preferable aspect ratio is |R| (sweeping an entire row is fast). In this case ripple join is identical to "Nested Loops".__Hash Ripple Join__(for equi-joins only) - All the sampled tuples are kept in hash tables in memory. Thus, calculating the join condition of a new sampled tuple with previous sampled tuples is very fast (saving I/O). This hashing scheme breaks down, of course, when the hash tables no longer fit in memory. However, several practical tests show that very tight confidence intervals often can be achieved long before memory is filled.

Calculation of Confidence Intervals.

The computation of the confidence interval is based upon the Central Limit Theorem. The series of sampling steps can be considered as a series of independent, equally distributed random variables.

After each sampling step, the confidence interval *(X**n+hn, Xn-hn) *has to be updated*. *The estimation of the final result, *Xn,* is based on the sample set with the assumption of uniforimity on the complete Cartesian Product. (E.g. if we sampled .001 of the Cartesian Product and we calculated SUM=*x*, then *X**n*=1000 * x.).

The confidence interval length *h**n* is estimated by a linear combination of the variances of statistical "data points" computed over the input Relations. Those "data points" are the values of conditional random variables {*X[R**i,r], 1<=1<=K}*; A conditional random variable *X[R,r]* is the result of the join condition of the Cartesian Product reduced to *{r}* on relation R (e.g. *S x {r} *instead of* S x R*). Since *X[R,r]* can't be computed without sweeping all relations except *R* - it is approximated in the same manner *X**n* is estimated.

Adaptation of Ripple Join.

Obviously, the bigger the aspect ratios are - the longer it take to update the interval confidence (larger sample set). However, formulas show that *h**n* becomes shorter as the aspect ratios grow. Therefor, we have reached an optimization problem: minimizing *h**n* while keeping the update time short as defined by the user. After each sampling step, the statistical properties of the data are updated causing *h**n* to be updated as well. Thus every sampling step we have a new optimization problem followed by a new optimal values for aspect ratios.

**Personal Notes** (By Michal Ozery)

- In the performance section there are two examples.

- The first example meant to illustrate the effectiveness of index/hash ripple join in comparison to other ripple joins, like block ripple. The main difference between block ripple join and hash ripple join is that in the later the sampled tuples are kept in memory. We can always start with hash ripple join and enjoy a "quick" start, and then fall back to block ripple join when we ran out of memory. Moreover, when the sample set isn't big enough our estimations are doomed to fail. Therefor in block ripple join we can always expect a "lame" start which can be avoided if we use hash ripple join as a kind of on-line initialization.
- In the second example, different aspect ratios were compared. This example probably meant to show the effectiveness of "good" aspect ratios compared to "bad" ones. However, showing how "bad" aspect ratios affect the result of the algorithm also pointed at some weakness of ripple joins. Ripple join results are based on approximation (based on Central Limit Theorem) which also uses some other approximations (the variance approximation - formula (4.9) in the paper). This can cause some of the great instabilities of the algorithm answer (as seen in the "bad" aspect ratios graph). In addition, even the ripple join with the "good" aspect ratios seemed to stabilize on a mistake (
*h**n*) at the size of 1/10 of the total range. This might be a "bigger" picture than a user expects to get.

- The idea of giving an estimated answer based on a sample set is self-evident when it comes to huge relations. One of the main results of this paper is the significance of the way we sample. Different sampling rates, known as aspect ratios, proved to have changed the performance of the algorithm. Ripple join doesn't require any a priori knowledge and is adaptive to the data. Its adaptation is seen by the convergence of the initial aspect ratios to the optimal ones. However, it might be worth thinking how and if a preliminary knowledge on the input relations can be kept and maintained such that the starting point of the algorithm will be better. Meaning, the aspect ratios won't be initialized to 1 (equal rate of sampling) but be initialized to more suitable values that might save long time of unstable result. The calculations presented on this paper referred to the statistical properties of a set of tuples from a specific set of relations that did make a certain join condition. In spite of , there might be a way to analyze the distribution of one property in relation in away we can conclude on any sub-distribution of its tuples. Anyway, if the user has some set of "preferred" join queries - it could be useful to maintain some data just for those certain join queries such that the next time the user invokes one of them - we'll have a better initializations for the aspect ratios.