seminar99\seminar

Integrating Association Rules Mining with Relational = Database Systems: Alternatives and Implications Sunita = Sarawagi = ; Shiby = Thomas &= nbsp; Rakesh Agrawal sunita@almaden.ibm.com = sthomas@cise.ufl.edu ragrawal@almaden.ibm.com Summarized by: Pini Faran
&nb= sp; &nbs= p; 031779895

Introduction

Data mining on large data warehouses is = becoming extremely important these days. By the term data mining we mean the ability to exploit "hot" / "interesting" / "useful" information from the data warehouse. The article presents several architectures which = couple the mining process with database systems, as opposed to former = architectures for the mining operation. These architectures are also tested using = benchmarks for performance, as well as examined for other aspects in architectural analysis, such as: space consumption, coding ease and = portability.

Most early mining systems, as hinted before, had loose coupling between the mining algorithm and the physical = place where the data to be processed was stored. Rather than that, mining was performed mainly on file systems, and specialized data structures and = algorithms were devised for them. It has only been lately proposed to try and = integrate mining with databases, including adding extensions to known query = languages (mainly SQL) to support mining. Hence, DMQL and M-SQL were born.

But why use a query language to implement mining algorithms in the first place? Using an existing query language has = several advantages:

Usage of database indexing and query processing for performance = boost
No need to devise specialized data structures for the mining = process
Usage of SQL parallelism
Space management

Mining Architectures

= The main architecture, which is suggested by the article writers, has the = following layout: The mining procedure will be expressed using extended SQL or a graphical language. The procedure will be processed by a = Preprocessor + Optimizer, which will generate code in SQL-92 for a relational = engine, or generate code in SQL-OR for object relational engine. This = architecture will be named as the SQL architecture, and it will be compared = with the following alternate architectures:

Read directly from DBMS: Data is read in tuples from the DBMS to the mining process using a cursor. There are two variations for this approach: loose coupling, in which the DBMS runs in a different address space from the mining process, and stored procedure, = where the mining process is implemented as a stored procedure, so it is = executed withing the address space of the DBMS (this saves process context = switching in the CPU). In both flavors, the mining results are stored back in the DBMS.

Cache mine: This option is a variation of the stored procedure approach presented above. After reading the entire data from the DBMS, it is cached temporarily on the local disk. This data is discarded once the mining process terminates.

User defined function (UDF): The mining algorithm is expressed as a collection of user defined functions (UDFs). A UDF is simply an SQL procedure that returns a scalar value. Note that here all the = processing is done by the UDFs, and little usage is made of the DBMS advantages = detailed above (the DBMS simply provides the input to the UDFs). This = architecture can be seen as an extreme case of the SQL-OR approach in the SQL = architecture (in the SQL architecture, the UDFs are light weight).

The Problem

The problem that is to be solved by the = different architectures defined above, is the association rule mining = problem. An association rule can be defined as follows: Given a set of = transactions (a transaction is a set of items), the expresssion X->Y is an = association rule, where X and Y are sets of items. The intuitive meaning of the rule is that transactions which contain items in X tend to also = contains items from Y. An example of such a rule might be: "30% of = transactions that contain beer also contain diapers; 2% of all transactions = contain both these items". 30% is called the confidence of the rule, = while 2% is refered to as the support of the rule.

The association rule mining problem is therefore defined as finding all the rules that satisfy a user defined minimum = support and minimum confidence. The problem can be decomposed into 2 sub = problems:

Find all combinations of items, called frequent itemsets, whose support is greater than the minimum support.
Use the frequent itemsets to generate the association rules.

The first sub problem is the most time consuming (the second sub = problem's input is a very small portion of the data, after obtaining the results of the first sub problem), and the article concentrates on its solution.

The Solution

The solution to be used is the Apriori = algorithm: the algorithm performs several passes over the data. At each pass we = have 2 phases: candidate generation, in which we create a superset = C_k of all frequent k-itemsets (sets with k items), and support = counting, in which we determine which of the elements in C_k is highly popular in the real data. The most popular itemsets will compose = F_k+1 for the next pass.

The Implementation

We represent the input of the real data in a = 2-column transaction table, where we keep pairs of tid (transaction identifier) and item (item identifier).

The first part of the Apriori algorithm, candidate generation, can be easily implemented using SQL-92: We perform a = join step, in which we make a k-way join of F_k-1 with itself (SQL join) to obtain C_k, and then perform the prune step, in which we remove all elements from C_k, in which = exists at least one subset who is not one of F_k-1 elements (the = logic behind this step reasons that popular sets of size k must merge from = previous popular items of size k-1. Thus, we cannot be popular if our "parents" were unpopular).

The 2^nd part of the algorithm (support counting) is more time consuming. Several implementations (the first two are SQL-92 based, the others are SQL-OR based) will be represented, with a short discussion of each one, and a summary on the performance tests that were conducted to compare between them.

K-way joins: In the k-th pass, we perform a join of = C_kwith k transaction tables, and perform a group by on the resulting itemsets (to count the most popular results). For k=3D2 we can use a = small optimization which replaces C₂with the 2-way join of = F₁.

Subquery-based: Here we use the common prefixes in = C_k (don't forget that they "share" the popular sets they were created from ) to reduce the number of necessary joins. In the l-th subquery, we find all tids that match the distinct itemsets from the first l = columns of C_k. The output is then joined with T to obtain the result of the subquery. For pass-2 the special optimization of the K-way joins is used.

Note: both approaches above, which rely strictly on SQL-92 (without object relational extensions), were found to be much slower than the SQL-OR solutions to be presented below, so they will be omitted from the discussion that comes afterwards. The SQL-OR = implementations are detailed below:

GatherJoin: Generates all k-item combinations of items = contained in a transaction, joins them with C_k, and perform group by on the join result. In order to do so, we use here 2 table functions (a = table function is a virtual table which generates tuples on the fly; it is an OR extension): Gather, which generates for each transaction an item list of the items contained in it, and Comb-K, which returns all k-item = combinations of a transaction. We then perform the join with C_k, and group by on the result. We also perform the special pass-2 optimization by = joining F₁ with itself in order to obtain C₂.

GatherCount: This is a variation of the GatherJoin approach, in which we perform the group by operation inside the table function for the second pass. In order to do so, we represent C₂ as a = 2-dimensional array. Performing this conversion for k>2 is more complicated, as it = would force us to maintain C_k in a hash table.

GatherPrune: A variation of GatherJoin in which the join with C_kis performed inside the table function (so the no. of = combinations to be joined with C_k is reduced). In order to do so, however, we need to convert C_k into a BLOB (Binary Large OBject) and pass it as an argument to the table function.

Horizontal: This is yet another variation of the GatherJoin = approach, in which the Gather table function first transforms the input data into horizontal format (i.e. for each transaction its contained items in the real data), and than continue similarly like GatherJoin.

Vertical: This is a somewhat different approach from = GatherJoin and its variations. Here we create for each item in the data a = BLOB containing all the transactions in which its appears. In the support = counting phase, for each itemset in Ck, all the tids for each item are collected and then counted in the intersection between them.

SQL-OR Comparisons

All the SQL-OR approaches that were given above = were tested using 4 datasets (measuring the time for the Apriori algorithm = execution in the different implementations). The Horizontal approach was far worse than the others, because of its overhead in transforming the data into horizontal format, so it will be omitted from the discussion from here.

The Vertical approach has the best overall = performance from all the others. Note that as opposed to Horizontal, the time to = convert the data here to vertical format is smaller as here we convert the = frequent items which are smaller (in amount) than the number of = transactions.

GatherJoin performance was worse on the higher = passes of the algorithm, because of the large number of combinations that were generated by the table function. It did perform better than Vertical, = however, in some cases in the 2^nd pass of the algorithm. In other = cases, GatherCount was the better option.

2 factors that affect the choice between = GatherJoin, GatherCount and Vertical are the number of frequent items (on which the costs of Vertical depends quadratically), and the average number of frequent items per transaction (on which GatherJoin depends).

Architecture Comparisons

Resulting from the conclusions above, the final = SQL-based architecture represented for comparison with the other architectures, is composed as a hybrid scheme which chooses the best of the 3 = approaches GatherJoin, GatherCount and Vertical for each pass of the algorithm, = based on costs estimation.

The tests were conducted using the same datasets that were used for the previous comparisons. The following sections = summarizes different aspects of comparisons between the different architectures.

Timing Comparison: Cache-Mine was found to have the best = performance in most cases, closely followed by the SQL architecture. Stored = procedure and UDF had by far the worst timing performance.

Scale-up comparison: Performing the tests with increasing = number of transactions and increasing average transaction length did not change the picture: all approaches scale linearly in both cases, but Cache mine was leading with the lowest curve.

Space comparison: Here UDF and stored procedure require the = least amount of space. Cache Mine and SQL need as much extra storage as the = data (both for the cache in Cache Mine, and in Vertical, in SQL, which = converts the data into a new format).

Porting: When porting from one OS to another (same DBMS), SQL and UDF architectures are more portable (the other 2 approaches are file system dependant). Porting from one DBMS to another (same OS), however, is harder in SQL and UDF since these approaches rely on SQL-OR features (UDFs, table functions) which are not implemented in every DBMS.

Final notes

The Cache-Mine alternative was found to be, in = most cases, the best alternative to be used. The results do point, however, that using the SQL architecture is a legitimate alternative, that may = have its advantages in several aspects (porting) and by using it we do not = suffer a heavy performance penalty.

In my point of view, what we a little lacking in the article is a more detailed description of how the other approaches (Cache Mine, Stored procedure, UDF) are implemented. Having this = knowledge would have given a wider understanding as to the meaning of the results of the final comparison between the architecture (that is, the results for the SQL architecture can be explained according to the rest of the article, but the results for the other 3 approaches are given as is, = without knowing what implementation was used for each architecture).

------=_NextPart_000_0360_01BF7BA0.FDFB9390--