Exercise 4
Classification and Clustering
Question 1 (theoretical warming-up).
Show that misclassification rate, information index and Gini index satisfy
the properties of the impurity measure, that is
the corresponding g(p1,...,pJ):
- has the only maximum at (1/J,...,1/J)
- achieves its minimum only at (1,0,...,0), (0,1,0,...,0),
(0,...,0,1)
- is a symmetric function of p1,...,pJ,
i.e. it is invariant to pertrubations of pj's
Question 2.
The data set iris (available in R) contains the data on 50
flowers from each of 3 species of iris: Setosa, Versicolor and Virginica
(totally, 150 flowers).
4 measurements have been made on each flower: sepal length and width, and
petal length and width.
The data were collected by Edgar Anderson in 1935. Fisher was the first
statistician to study it in 1936 and from that on it became a famous test case
for various classification procedures. 70 years have passed... and here is
a new generation of young statisticians equipped with the knowledge of modern
statistical techniques partially unkwown to Fisher, faces the challenge of this
data.
Apply classification procedures you have studied/know
(discriminant analysis, logistic regression, k-nearest neighbour, neural
networks, classification trees, SVM, etc.). Some brave volunteers may even
try boosting, bagging and/or random forests.
Discuss the ways to compare different classifiers. Comment the results.
Question 3.
We have already discussed several issues about wines in one of the previous
homework
exercises. Now we continue talking about this important fascinating topic.
The file Winea.dat contains
the results of a chemical analysis of wines grown in the same region in Italy but derived
from three different cultivars.
The analysis determined the quantities of 13 constituents
found in each of the three types of wines (the first 13 columns of the data file):
- Alcohol
- Malic acid
- Ash
- Alcalinity of ash
- Total phenols
- Flavanoids
- Nonflavanoid phenols
- Proanthocyanins
- Color intensity
- Hue
- OD280/OD315 of diluted wines
- Proline
The last three columns are indicator variables for the first, second and third
cultivar respectively.
- Compare various classifiers and choose a good classifier(s) for detecting
the cultivar by the results of chemical analysis of wine.
Is there a need to perform the complete chemical analysis? What are the main
characteristics that distinguish between cultivars?
- The file Wineb.dat contains
another set of analogous data. Use it for evaluating your classifier(s) and
comment the results.
- L'Haim!
Question 4.
Microarrays are considered a breakthrough technology in biology and genetics
allowing simultaneous quantative study of thousands of genes.
DNA microarrays measure the expression of a gene in a cell by measuring the
amount of mRNA present for that gene.
A typical gene expression dataset collects the expression values from a series
of DNA microarray experiments.
A large file
NCI.dat contains the human tumor microarray data.
The samples are 64 cancer tumors from different patients.
The data are 6830x64 matrix, with each column
representing expression measurements for the 6830 genes for a given patient.
Important research questions arising in microarray study are to understand which
genes are most similar across samples and do certain
genes show especially high/low expression for certain cancer samples.
Although, in fact, we do know sample lables indicating types of cancer for
patients in the sample, it is
probably useful to view the problem as unsupervised learning (clustering)
problem and examine posthoc which labels fall into which clusters.
- Apply K-means clustering algorithm with K running from 1 to
10. Choose an ``optimal'' number of clusters.
- Try hierarchical clustering algorithms: agglomerative and divisive. In both
cases compare the results for single linkage, complete linkage
and group average. Compare the results.
- For all clustering algorithms you have used compare their results with
sample labels of the patients given in the file Label.dat.
Comment on the success of various clustering procedures at
grouping together samples of the same cancer.
Computational Note for R users:
Here is a (partial) list of R functions for classification and clustering
See the corresponding help files for details of their use. You will probably
need to install first several libraries from CRAN :
- Classification
- linear discriminant analysis - lda
- quadratic discriminant analysis - qda
- multivariate logistic regression - multinom (library nnet)
- neural networks - nnet (library nnet)
- k-nearest neighbour - knn (library class)
- CART - tree (see also prune.tree, cv.tree,
predict.tree) (library tree) or
rpart (library rpart)
- SVM - svm (library e1071)
- bagging - bagging (library adabag)
- boosting (AdaBoost) - boosting (library adabag)
- random forests - randomForest (library randomForest)
- Clustering
- K-means - kmeans
- K-medoids - pam (library cluster)
- agglomerative clustering - hclust, agnes (library cluster)
- divisive clustering - diana (library cluster)
- you may also need the function dist to calculate pairwise distances
between objects in the data