Scalable Mining For Classification Rules in Relational Databases

Scalable Mining For Classification Rules in Relational Databases

Min Wang Bala Iyer Jeffrey Scott Vitter

Article Summery By Nadav Grossaug

Abstract

The article presents some of the problems the authors think as important in the classification algorithms that are used and known today :

all state of the art algorithms require O(N) in memory data structure

the best previously known classifier needs quadratic number of I/O for large N

these problems becomes critical when N increases.

The authors present a new classification algorithm called MIND (MINing in Databases) which is :

scalable over I/O and #CPU (can be run at parallel with no extra cost)

Implemented over Relational Database using SQL – which makes its implementation easy.

Do not need an O(N) in memory Data Structure.

The Classification problem (informally)

We are given a training set (or DETAIL table) consisting of many training examples. Each training example is a row with multiple attributes, one of which is a class label. The objective of the classification is to process the DETAIL table and produce a classifier, which contains a description(model) for each class. The models will be used to classify future data for which the class labels are unknown.

The Algorithm

This pseudo code gives the Overview for the algorithm to Create the Decision tree Classifier. , The DETAIL table is the set of training Examples, the gini index is the splitting index (definition of gini index can be found in the article).

GrowTree(DETAIL TABLE)

Initialize tree T and put all records of DETAIL in root

while (some leaf in T is not a STOP node)

for each attribute i do

evaluate gini index for each non-STOP leaf at

each split value with respect to attribute i

for each non-STOP leaf do

get the overall best split for it;

partition the records and grow the tree for one more level according to best splits;

mark all small or pure leaves as STOP nodes;

return T;

As seen the most I/O and CPU consuming part of the algorithm is computing the best split point for each non-STOP node, this is done by computing the gini index for each split point and each attribute for that node.

Another I/O consuming task is to update the node each record belongs to after a split ,in MIND implementation update of the records is not needed, this is done by a function that gets as input the classifier Tree (at its present condition) and the record and return the node for which the record belongs, this function can be seen as another attribute to the DETAIL table called leaf_num. This is the key reason for I/O redaction as no Updates are needed after a split.

Database Implementation

The authors describe MIND Database implementation Over Relational Database using SQL. As mentioned above the most consuming part is computing the best split for each node ,this implementation will show how these is done.

For Each Attribute and each level of the tree we compute a dimension table DIMi (where i is the number of attributes ) which hold the count for each attribute value, leaf_num and class, this can be done by i passes over DETAIL table with regular SQL:

INSERT INTO DIMi

SELECT leaf_num,class,attri,count(*)

FROM DETAIL

WHERE leaf_num,<> STOP

GROUP BY leaf_num,class,attri

or by one pass using a special user-defined function.

Size of DIMi = #leaves * #distinct values of attri * #classes can usually be memory resident (when attri has discrete values). From know all I/O and manipulation will be done only on DIMi, from DIMi we can by a series of simple SQL statements get the best split for each node on this level.

When best split for each node is known the partitioning stage can be preformed, each non-STOP leaf is split, the split condition is added to it and two new leafs, the changes are updated in the classifier tree so no UPDATE to records in DETAIL table need to be done in order to indicate which record belongs to which node (it is done at run time by the function described earlier.

Performance

When all DIMi tables are in memory all other manipulation on them is done in memory ,by that the number of passes over DETAIL table is the DEPTH of the classifier TREE (typically the classifier DEPTH is around 10-20), which bring as to I/O complexity of

Experimental Results show very clearly MIND superiority over SPRINT (which is the state of the art classification algorithm known until know) when DETAIL table size is larger then one million records (over the authors Database).

Conclusions

MIND works well because

MIND rephrases the classification to a Database problem

MIND avoid UPDATES the DETAIL table

Parallelism and Scaling Are achieved by the use of RDBMS

MIND uses a user function to get the performance gain in the DIMi creation.

Remarks On the Article

First of all the idea on combining the Relational Database Benefits with the Classification problem is a very good idea especially with the use of the user defined functions to reduce I/O costs, But there are few details that I think need to be cleared:

Some Implementation details are in my opinion not counted in the I/O calculation, that’s because every SQL statement is always executed in memory and then in disk (when committed) so all DIMi creation and other SQL statements also create I/O complexity (usually smaller relatively to #records).

I think there are a lot of fine details that are not in the article and influence the performance of MIND and SPRINT like the Database transaction log, the memory allocated for Database use and memory size in the computer test (maybe with larger amount of memory SPRINT will run better),also the explanation for what to do when DIMi is larger then memory are very brief.

The Use of defined function to calculate the node each record is in helps to reduce I/O and practically eliminates the need to update all the data ,it can be used without the Database implementation, it is interesting to see an implementation of SPRINT with a function like that compared to MIND.

Lacking a benchmark the Authors used a synthetic Discrete uniformly Distributed Database which fits very well to their assumption on MIND, it is interesting to see MIND performs on more real data with different distribution and discrete values from larger ranges (which effect on the size of DIMi and its ability to be memory resident).