Scalable Techniques For Mining Causal Structures

(Summary)

The article describes techniques for finding causal relationships between objects when mining market basket data. The knowledge of causal relationships between objects can be helpful for enhancing the understanding of the data.

A basic tenet in classical statistics is that correlation does not imply causation. Causal relationships between objects cannot be determined just according to some statistical association between them.

In this paper, the authors explore the applicability of constraint-based methods for causal discovery.

The main insight in constraint-based methods is that the information about statistical independence and dependence relationships among a set of variables can be used to constrain the possible causal relationships among a subset of those variables.

For example, just according to the statistical relation between two variables X and Y we cannot infer that X is the cause of Y (or vice versa). But if there is a third variable W, by computing the statistical dependence and statistical independence between the three variables, sometimes it is possible to conclude that X causes Y.

There are two major rules for determining causal relationship between two objects:

CCC causality rule and CCU causality rule.

The CCC causality rule is as follows:

If A, B and C are three variables that are pairwise correlated (CCC - all three pairs (A, B), (B, C) and (A, C) are correlated) and A and C become independent when conditioned on B. Then one of the following relations can be inferred:

Aß Bà C (B causes A and C);

Aà Bà C (A causes B and B caused C);

Cà Bà A (C causes B and B caused A).

If there is some priory knowledge that A has no causes then the only possible relation is:

Aà Bà C (A causes B and B caused C).

The CCU causality rule is as follows:

If A, B and C are three variables such that (A, B) and (A, C) are correlated and (B, C) are uncorrelated (CCU - two pairs are correlated and one pair is uncorrelated) and B and C become dependent when conditioned on A. Then the following relation can be inferred:

Bà Aß C (B causes A and C causes A).

The algorithms described in the paper use the CCC and the CCU rules to define the causal relations between variables. The tests for CCC and CCU rules are local e.g. they work only on three variables at the same time and do not explore the relationships between all variables of the causal Baysian network.

The local focus of the algorithms allows them to be simple and relatively fast for the size of the databases they work on. But at the same time the algorithms may produce a disjoint picture of the causal relationships. For example, if there is a hypothetical causal network:

Wà Xà Yà Zà A, the algorithms may output the following pairwise causal relationships: Xà A, Xà Z, Yà Z, Yà A, Zà A. These are not all-possible causal relations and also the global relation between all variables is lost.

Another limitation of the algorithms is the accumulated statistical error. Statistical tests are performed to define if one of the CCC or CCU rules is satisfied for a given triple of variables. There is probability of error for each one of the statistical tests for dependence, independence, conditional independence and dependence between two variables. The tests are performed many times and the error is accumulated which may cause the output of faulty causal relations.

The statistical tests for the algorithms are based on chi-squared test. The chi-squared value computed for two variables is compared to some threshold and according to the fact whether the value is greater than the threshold or not, the variables are defined to be correlated or uncorrelated correspondingly. The probability of error is the highest around the threshold.

To minimize the error two thresholds are chosen: one for correlation and one for uncorrelation. If the chi-squared value is between the two thresholds then no classification of correlation or uncorrelation is given. This may cause the algorithms to discard some causal relations but it minimizes the statistical error and thus it increases the credibility of the output.

The naïve algorithm iterates over all triples of items (variables) and checks if the current triple satisfies the conditions for CCC or CCU causality.

The algorithm is improved by first finding all CC-paths of a specific item A (CC-path of item A is for example B-A-C where (B, A) and (C, A) are correlated). The algorithm goes over all CC-paths for each item and checks if the third couple is correlated or uncorrelated (in the example, the third couple is (B, C)). If the third couple is correlated the triple is checked for CCC causality, if the couple is uncorrelated the triple is checked for CCU causality, otherwise (the third couple is nether correlated nor uncorrelated) the triple is no further processed.

The same algorithm can be done for all CU-paths of a specific item A (CU-path of item A is for example B-A-C where (B, A) are correlated and (C, A) are uncorrelated). This algorithm can check only for CCU causality - it verifies if the third couple is correlated.

The runtime of those algorithms depends on the amount of correlated and uncorrelated pairs of items in the database.

The CC-paths algorithm runs faster if there are less correlated pairs than uncorrelated. The CU-paths algorithm is just the opposite.

In addition, the CC-paths algorithm when checking for CCC causality needs some priori knowledge on the data so that it can limit the possibilities of causal relations between the three variables. If there is no such priory knowledge the CCC causality rule is not very helpful and meaningful.

The described algorithms find causal relations with efficiency needed for the large data sets they work on. They do not find all possible causal relations. But this is however acceptable in the data mining context because the main goal in data mining is to explore the data rather than to test a hypothesis.

Finding causal relationships is useful for a variety of reasons. One is that it can help in visualizing relationships among variables. Another is that, unlike correlation relation causation is asymmetric. In the context of text analysis, this can help identify phrases.

Third reason is that there are much more correlation relationships than causal relationships. It is much easier to work with a small amount of rules than with a big one. These causal relationships can be very useful for understanding the data and can help in the decision-making concerning the data.

There are still a variety of unresolved and unexplored issues in the area of causal relationships.

The authors of the article suggest some ideas for further research such as:

  1. Can we devise efficient algorithms for using incremental causal information to perform disambiguation?
  2. Is there a principled way to resolve bi-directional causality for disambiguation?
  3. Is there a way to determine optimal values for the correlation and uncorrelation threshold for a given data set?
  4. Is it possible to replace the cutoff values with efficient estimates of the probability of correlation?
  5. How hidden variables influence the causal relationships between the local subset of variables?
  6. Is there some heuristics that can be used so that the algorithms become more efficient? Possible directions of research can be for example:

There are some ideas of mine for improving the algorithms:

1) The output causal relations represent a directed graph. It is difficult for the user to see the global picture of relations between the variables. Algorithm for finding the longest disjoint paths in graph can be run on the output of the causal discovery algorithms. The new output will give the user much global picture of the causal relations between the objects.

2) The algorithms output only the causal relationship between variables. Sometimes it is possible to conclude for sure that there is no causation between two variables. There can be additional output with the list of variables for which the algorithm can tell for sure that there is no causal relation between them. The lack of causation can help the user in the decision making at the same way the causation can.

3) When CCC causality is checked it is important to have some priori knowledge on one of the variables in the triple, so that the number of possible causal relations can be reduced. If there are a couple of variables for which we have priori knowledge that they have no causes w1, w2, …, wk, we can run the algorithms for all triples containing w1, afterwards for all triples containing w2 etc. At the end, we can perform conjunction or union between the outputs received from each iteration.

The causal relations in the conjunction output will have higher credibility.

There will be more causal relations in the union output.

Anyway, both outputs can help for enhancing the understanding of the data.