Exercise 4

Classification and Clustering

Question 1 (theoretical warming-up).

Show that misclassification rate, information index and Gini index satisfy the properties of the impurity measure, that is the corresponding g(p1,...,pJ):
  1. has the only maximum at (1/J,...,1/J)
  2. achieves its minimum only at (1,0,...,0), (0,1,0,...,0), (0,...,0,1)
  3. is a symmetric function of p1,...,pJ, i.e. it is invariant to pertrubations of pj's

Question 2.

The data set iris (available in R) contains the data on 50 flowers from each of 3 species of iris: Setosa, Versicolor and Virginica (totally, 150 flowers). 4 measurements have been made on each flower: sepal length and width, and petal length and width.

The data were collected by Edgar Anderson in 1935. Fisher was the first statistician to study it in 1936 and from that on it became a famous test case for various classification procedures. 70 years have passed... and here is a new generation of young statisticians equipped with the knowledge of modern statistical techniques partially unkwown to Fisher, faces the challenge of this data. Apply classification procedures you have studied/know (discriminant analysis, logistic regression, k-nearest neighbour, neural networks, classification trees, SVM, etc.). Some brave volunteers may even try boosting, bagging and/or random forests. Discuss the ways to compare different classifiers. Comment the results.

Question 3.

We have already discussed several issues about wines in one of the previous homework exercises. Now we continue talking about this important fascinating topic.

The file Winea.dat contains the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines (the first 13 columns of the data file):

  1. Alcohol
  2. Malic acid
  3. Ash
  4. Alcalinity of ash
  5. Total phenols
  6. Flavanoids
  7. Nonflavanoid phenols
  8. Proanthocyanins
  9. Color intensity
  10. Hue
  11. OD280/OD315 of diluted wines
  12. Proline
The last three columns are indicator variables for the first, second and third cultivar respectively.
  1. Compare various classifiers and choose a good classifier(s) for detecting the cultivar by the results of chemical analysis of wine. Is there a need to perform the complete chemical analysis? What are the main characteristics that distinguish between cultivars?
  2. The file Wineb.dat contains another set of analogous data. Use it for evaluating your classifier(s) and comment the results.
  3. L'Haim!

Question 4.

Microarrays are considered a breakthrough technology in biology and genetics allowing simultaneous quantative study of thousands of genes. DNA microarrays measure the expression of a gene in a cell by measuring the amount of mRNA present for that gene. A typical gene expression dataset collects the expression values from a series of DNA microarray experiments. A large file NCI.dat contains the human tumor microarray data. The samples are 64 cancer tumors from different patients. The data are 6830x64 matrix, with each column representing expression measurements for the 6830 genes for a given patient. Important research questions arising in microarray study are to understand which genes are most similar across samples and do certain genes show especially high/low expression for certain cancer samples. Although, in fact, we do know sample lables indicating types of cancer for patients in the sample, it is probably useful to view the problem as unsupervised learning (clustering) problem and examine posthoc which labels fall into which clusters.
  1. Apply K-means clustering algorithm with K running from 1 to 10. Choose an ``optimal'' number of clusters.
  2. Try hierarchical clustering algorithms: agglomerative and divisive. In both cases compare the results for single linkage, complete linkage and group average. Compare the results.
  3. For all clustering algorithms you have used compare their results with sample labels of the patients given in the file Label.dat. Comment on the success of various clustering procedures at grouping together samples of the same cancer.

Computational Note for R users:

Here is a (partial) list of R functions for classification and clustering See the corresponding help files for details of their use. You will probably need to install first several libraries from CRAN :