Exercise 4
Classification and Clustering
Theoretical warming-up
Question 1.
Consider a set of linear classifiers ${\cal C}_L(d)=\{\eta(x)=sign(\beta^T {\bf x}), x \in \mathbb{R}^d\}$. Prove that its VC-dimension $VC({\cal C}_L(d))=d$
by the following steps (of course, you may use a different proof!):
- $VC({\cal C}_L(d)) \geq d$ (lower bound).
- Hint:
consider the canonical basis ${\bf e}_1,\ldots,{\bf e}_d$ of $\mathbb{R}^d$ and prove that for any vector ${\bf y} \in \{-1,1\}^d$, there exists $\beta \in \mathbb{R}^d$ such that
$sign(\beta^T {\bf e}_j)=y_j, j=1,\ldots,d$.
- $VC({\cal C}_L(d)) < d+1$ (upper bound).
- Hint: for any vectors ${\bf x}_1,\ldots,{\bf x}_{d+1} \in \mathbb{R}^d$, there exists a nonzero vector ${\bf w} \in \mathbb{R}^{d+1}$ such that
$\sum_{j=1}^{d+1}w_j {\bf x}_j={\bf 0}$ (why?). Without loss of generality we can assume that there exists $w_{j_0}>0$. Define $y_j=sign(w_j),\;j=1,\ldots,d+1$ and
show that there is no $\beta \in \mathbb{R}^d$ such that $sign(\beta^T {\bf x}_j)=y_j$ for all $j=1,\ldots,d+1$
(consider $\sum_{j=1}^{d+1}w_j \beta^T {\bf x}_j$).
Question 2.
Show that misclassification rate, information index and Gini index satisfy the
properties of the impurity measure, that is the corresponding $\gamma(p_1,\ldots,p_L)$:
- has the only maximum at $(1/L,...,1/L)$
- achieves its minimum only at $(1,0,...,0), (0,1,0,...,0), (0,...,0,1)$
- is a symmetric function of $p_1,\ldots,p_L$, i.e. it is invariant to permutations of $p_l$'s.
Applications of classification and clustering
Question 3.
The data set
iris (available in R) contains the data on 50 flowers from each of 3 species of iris: Setosa, Versicolor and
Virginica (totally, 150 flowers). 4 measurements have been made on each flower: sepal length and width, and petal length
and width.
The data were collected by Edgar Anderson in 1935. Fisher was the first statistician to study it in 1936 and from that
on it became a famous test case for various classification procedures. Many years have passed... and here is a new generation
of young statisticians equipped with the knowledge of modern statistical techniques partially unkwown to Fisher, faces the
challenge of this data. Apply classification procedures you have studied/know (discriminant analysis, logistic regression,
k-nearest neighbour, neural networks, classification trees, SVM, ensemble methods, etc.). Discuss the ways to compare different classifiers. Comment the results.
Question 4.
All of us suffer from a tremendous number of spam (junk) emails. Hence, it
has become very important to design an automatic spam detector that could
filter out spam before arriving to users' mailboxes.
Various spam detectors try to identify spam messages by several
characteristics like relative frequences of a series of commonly occuring words
(e.g.,
business,
address,
internet),
percentage of certain characters (e.g.,
ch(,
ch!),
the average and/or longest length of uninterrupted sequences of capital
letters, etc. in the email message.
The file Spam.dat
contains information from 4061 email messages. For all of them
the true outcome of its type (spam (1) or not (0)) is available - see the last column), along with relative frequences of 57 various predictors (see above).
Click here for more details on the data.
- Select randomly 1000 email messages from the data and leave them aside as a test set for further evaluation.
- Use the remaining messages as a training set and apply various claissification procedures you have learned (discriminant analysis, logistic regression,
k-NN, (deep) neural networks, SVM, CART, boosting, random forests, etc.). Tune the corresponding parameters of the procedures for better fit. Compare various approaches and comment the results.
- Apply now all the procedures based on the training set to the test set. Comment the results.
- Make brief final conclusions.
Question 5.
Microarrays are considered a breakthrough technology in biology and genetics allowing simultaneous
quantative study of thousands of genes. DNA microarrays measure the expression of a gene in a cell by measuring the amount
of mRNA present for that gene. A typical gene expression dataset collects the expression values from a series of DNA
microarray experiments. A large file
NCI.dat contains the human tumor
microarray data. The samples are 64 cancer tumors from different patients. The data are 6830x64 matrix, with each column
representing expression measurements for the 6830 genes for a given patient. Important research questions arising in
microarray study are to understand which genes are most similar across samples and do certain genes show especially
high/low expression for certain cancer samples. Although, in fact, we do know sample lables indicating types of cancer for
patients in the sample, it is probably useful to view the problem as
unsupervised learning (clustering) problem and examine posthoc which labels fall into which clusters.
- Apply K-means clustering algorithm with differemt K. Choose an ''optimal'' number of clusters.
- Try hierarchical clustering algorithms: agglomerative and divisive. In both cases compare the results for single linkage, complete linkage and group average. Compare the results.
- For all clustering algorithms you have used compare their results with sample labels of the patients given in the
file
Label.dat. Comment on the success of various clustering procedures at grouping together samples
of the same cancer.
Computational Notes for R users
Here is a (partial) list of R functions for classification and clustering See the corresponding help files for details
of their use. You will probably need to install first several libraries from
CRAN :
- Classification
- linear discriminant analysis - lda
- quadratic discriminant analysis - qda
- multivariate logistic regression - multinom (library nnet)
- (deep) neural networks -
- nnet (library nnet) (allows a single hidden layer)
- neuralnet (library neuralnet) (allows deep neural networks but no categorical predictors)
- k-nearest neighbour - knn (library class)
- CART -
- tree (see also prune.tree, cv.tree, predict.tree) (library tree)
- rpart (library rpart)
- SVM - svm (library e1071)
- bagging and random forest - randomForest (library randomForest)
- boosting - gbm (library gbm)
- Clustering
- K-means - kmeans
- K-medoids - pam (library cluster)
- agglomerative clustering - agnes (library cluster)
- divisive clustering - diana (library cluster)
- you may also need the function dist to calculate pairwise distances between objects in the data