Scalable Mining For Classification Rules in Relational Databases

Min Wang Bala Iyer Jeffrey Scott Vitter


Article Summery By Nadav Grossaug


The article presents some of the problems the authors think as important in the classification algorithms that are used and known today :

these problems becomes critical when N increases.

The authors present a new classification algorithm called MIND (MINing in Databases) which is :

The Classification problem (informally)

We are given a training set (or DETAIL table) consisting of many training examples. Each training example is a row with multiple attributes, one of which is a class label. The objective of the classification is to process the DETAIL table and produce a classifier, which contains a description(model) for each class. The models will be used to classify future data for which the class labels are unknown.


The Algorithm

This pseudo code gives the Overview for the algorithm to Create the Decision tree Classifier. , The DETAIL table is the set of training Examples, the gini index is the splitting index (definition of gini index can be found in the article). 


Initialize tree T and put all records of DETAIL in root

while (some leaf in T is not a STOP node)

for each attribute i do

evaluate gini index for each non-STOP leaf at

each split value with respect to attribute i

for each non-STOP leaf do

get the overall best split for it;

partition the records and grow the tree for one more level according to best splits;

mark all small or pure leaves as STOP nodes;

return T;

As seen the most I/O and CPU consuming part of the algorithm is computing the best split point for each non-STOP node, this is done by computing the gini index for each split point and each attribute for that node.

Another I/O consuming task is to update the node each record belongs to after a split ,in MIND implementation update of the records is not needed, this is done by a function that gets as input the classifier Tree (at its present condition) and the record and return the node for which the record belongs, this function can be seen as another attribute to the DETAIL table called leaf_num. This is the key reason for I/O redaction as no Updates are needed after a split.


Database Implementation

The authors describe MIND Database implementation Over Relational Database using SQL. As mentioned above the most consuming part is computing the best split for each node ,this implementation will show how these is done.

For Each Attribute and each level of the tree we compute a dimension table DIMi (where i is the number of attributes ) which hold the count for each attribute value, leaf_num and class, this can be done by i passes over DETAIL table with regular SQL:


SELECT leaf_num,class,attri,count(*)


WHERE leaf_num,<> STOP

GROUP BY leaf_num,class,attri

or by one pass using a special user-defined function.

Size of DIMi = #leaves * #distinct values of attri * #classes can usually be memory resident (when attri has discrete values). From know all I/O and manipulation will be done only on DIMi, from DIMi we can by a series of simple SQL statements get the best split for each node on this level.

When best split for each node is known the partitioning stage can be preformed, each non-STOP leaf is split, the split condition is added to it and two new leafs, the changes are updated in the classifier tree so no UPDATE to records in DETAIL table need to be done in order to indicate which record belongs to which node (it is done at run time by the function described earlier.



When all DIMi tables are in memory all other manipulation on them is done in memory ,by that the number of passes over DETAIL table is the DEPTH of the classifier TREE (typically the classifier DEPTH is around 10-20), which bring as to I/O complexity of

Experimental Results show very clearly MIND superiority over SPRINT (which is the state of the art classification algorithm known until know) when DETAIL table size is larger then one million records (over the authors Database).


MIND works well because


Remarks On the Article

First of all the idea on combining the Relational Database Benefits with the Classification problem is a very good idea especially with the use of the user defined functions to reduce I/O costs, But there are few details that I think need to be cleared: