Cached Sufficient Statistics for Efficient Machine Leaning with Large Datasets

The two papers introduce a new algorithm and data structure for quick counting and machine learning datasets. The task of counting is fundamental in the field of machine learning algorithms operating on datasets of symbolic attributes. Therefore, it is important to achieve computational efficiency in it, especially when applied to large datasets. The authors give a very sparse data structure, the ADtree, that minimizes memory use, thus making it applicable to large datasets, and an algorithm operating on it, that is used to construct the contingency table. Under several assumptions, the cost of these operations is shown to be independent of the number of records in the dataset and log-linear in the number of non-zero entries in the contingency table.

The author give an example of applying their method to the problem of discovering association rules in large datasets. They show how the method can significantly accelerate exhaustive search of rules compare to traditional counting, presenting results in a variety of datasets involving many records and attributes.

Many machine learning algorithms operating on datasets of symbolic attributes need to do frequent counting. The work is also applicable to Online Analytical Processing (OLAP) and Data Mining (DM), where operations on large datasets such as multidimensional databases accesses, DataCube operations and discovering association rules could benefit from fast counting.

We are given a data set with R records and M attributes. The attributes are called a₁, a₂, ... , a_M. The value of attribute a_i in the kth record is a small integer in the range {1, 2, ..., n_i}, where n_i is called the arity of the attribute i.

A query is a set of (attribute = value) pairs in which the left hand sides of the pairs form a subset of {a₁, ..., a_M}, arranged in increasing order. The total number of queries is (n₁ + 1)(n₂ + 1) ··· (n_M + 1), because each attribute can either appear with one of its values, or it can not appear (having the value a_i = * "don't care" value). Some examples of queries are

A count of a query, denoted by C(Query) is simply the number of records in the datasets matching all the (attribute = value) pairs in Query. For the above three queries and the dataset given in Figure 1, the count is

Each subset of attributes a_i(1) ... a_i(n) has an associated contingency table denoted by ct(a_i(1) ... a_i(n)). This is a table with a row for each of the possible sets of values for a_i(1) ... a_i(n). The row corresponding to a_i(1) = v₁, ..., a_i(n) = v_n records the count C(a_i(1) = v₁, ..., a_i(n) = v_n)). The dataset in figure 1 has 8 contingency tables (figure 2).

It is quite easy to see that there exists a mechanism that counts in constant time. Simply cache the contingency table for each query. But one must realize that doing so for even very small dataset, could require vast amount of memory. For example, a dataset with 20 attributes each of arity 2, will take (2 + 1)²⁰ ~ 1.5GB of memory.

Figure 3 describes An ADtree. An ADnode (shown as a rectangle) has child nodes called “Vary nodes” (shown as ovals). Each ADnode represents a query and stores the number of records that match the query. The Vary a_j child of an ADnode has one child for each of the n_j values of attribute a_j. The kth such child represents the same query as Vary a_j’s parent, with the additional constraint a_j = k.

because information for Vary nodes with indices below i+1 can be obtained from an already exist path in the ADtree.

We can store a NULL pointer instead of a node for any query that matches zero records. All of the specializations of such query will have count zero too. Still, we have (n₁ + 1)(n₂ + 1) ··· (n_M + 1) possible nodes in the worst case, although in practice it could reduce a considerable amount of memory.

Each Vary a_j node in the above ADtree stores n_j subtrees. Instead, we will find the most common of the values of a_j (call it MCV) and store a NULL in place of the MCVth subtree. The remaining n_j – 1 subtrees do not change. The number of nodes need to be stored in the sparse ADtree, for binary relations, is now bounded by 2^M (instead of 3^M(, because we omit one value for a_j, having at most n₁ · n₂ ··· n_M possible queries. We next show how it is possible to compute a full contingency table from the sparse ADtree.

We are given an ADtree and an arbitrary set of attributes {a₁, ... , a_n} for which we wish to quickly compute the contingency table. Notice that the conditional contingency table ct(a_i(1) ... a_i(n) | Query) can be built recursively. We first build

Note to that we do not have to specify the explicit query, but provide the corresponding ADnode of the ADtree.

The next algorithm shows how to build a contingency table for a set of attributes and an ADnode:

When we iterate through the algorithm we are unable to compute the conditional contingency table CT_MCV for a_i(1) = MCV, because this subtree is missing. But notice the following property of contingency tables:

We can use the above algorithm to obtain ct(a_i(2) ... a_i(n) | Query) by calling

The complexity of building a contingency table with the above method is given in the following recurrence relation

Because we make k calls to MakeContab with n-1 attributes to compute k ¹ MCV, and to compute k = MCV we subtract k-1 contingency tables which will each require k^n-1 subtractions. The solution for this relation is

If we didn’t cache any data, we would need O(nR + kⁿ) operations where R is the number of records in the dataset. When we have kⁿ << R we are significantly cheaper then simple full table scans. Since we are interested in large datasets, this result is very promising. The authors show that in datasets where R > 100,000, the improvement achieved is in orders of magnitude.

Note also that this result is independent of M, the number of attributes in the dataset.

It is not worth building the ADtree data structure for a small number of records. Instead, we set a number, R_min to be the number of rows under which we do not expand the subtree, but simply store a set of pointers to the dataset. The major consequence of this change is that mow the dataset needs to be retained in main memory.

The problem of discoveringassociation rules in largedatabases hreceived considerable resattention. Complex association rules are capable of representing concepts such as "PurchasedChips = TRUE" and "PurchasedSoda = FALSE" and "CostumerType = OCCASIONAL" => "AgeRange = YOUNG". Such concepts are very useful in data mining applications, for example, and are very intuitive and actionable to work with.

Define dataset as in the previous section. We define literal as an attribute-value pair such as "education = master". Let L be the set of all possible literals for a dataset. An association rule is an implication of the from S1 => S2 where S1, S1 Î L, and S1, S2 are disjoint.

Each rule has a measure of statistical significance called support. For a set of literal S Î L , the support of S is the number of records in the dataset that match all the attribute-value pairs in S, and is denoted supp(S). We define the support of the rule S1 => S2 to be supp(S1 Ç S2). A measure of the strength of the rule is called confidence, and is defined to be

When the problem is mining association rules to predict user-supplied target set of literals S2, it requires calculating large numbers of rule confidences and supports. We are in for a task of counting through the dataset over and over, for each given predicate of literals. Small improvement in this task will have a great influence on the overall performance.

Figure 4 shows some examples of different dataset used by the authors to test their improved counting. The experiments involved using the CN2 algorithm that searches for association rules, comparing two methods of counting: simple table scan and the ADtree data structure and algorithm mentioned above. Notice the short building time of the ADtree and the low amount of memory it consumes.

Note also that on all datasets, except for the BIRTH dataset, a significant speedup was achieved. The problem with the BIRTH dataset was due to its sparseness, having 95% of the values for 70 attributes being "False". This caused the algorithm to encounter MCV’s many times, thereby spawn longer searches.

Here are two examples of open problems that the authors describe. The articles give more than just two, but these are the most serious problems in my opinion:

The ADtree is designed entirely for symbolic attributes. When facing numeric attributes, the solution so far is discretization, but it is useless for range queries involving numeric attributes.

Although the tree can be built cheaply and lazily, the ADtree cannot be updated cheaply with a new record, because each new record may match up to 2^M nodes in the tree in the worst case.