Final Project

Question 1. Improvisations on Real Data.

The file Engine.dat contains the results of high-speed temperature measurements in a certain part of the airplane engine over equal periods of time.

Apply various techniques you've learned in the course (or know from any other sources), create new ones if necessary to denoise and analyse this data. Discuss the results, comment on them, share your motivation, hopes, successes and disappointments... Give a brief but comprehensive analysis. At the end, choose the most suitable estimator(s) and comment the results.

Question 2. Multivariate Solo.

This question is somewhat different. First, I tell you from the very beginning that the data you'll get are simulations of some known (meanwhile to me!) quite simple multivariate functions corrupted by random noise. Your task is to try to reconstruct these functions back from noisy data.

Second, unlike other questions where all of you received the same data for band's improvisation, here you have chance for your own solo. Everyone will get his/her own data. The number of independent variables (x's) varies from two to three (please, don't take it personally!), but the vector of the dependent variable (y) is always the last column in the data file.

Try first various nonparametric methods to smooth the data - it will hopefully give you some intuition or hint on at least a parametric form of the unknown (to you!) function. Try several of nonparametric methods to strengthen your intuition. Once you have a guess on your function, you can test it by the corresponding parametric regression model.

Here the data themes for your solo improvisations:

If someone's theme for solo is missing, please, let me know immediately!

Question 3. Classification Passions.

All of us suffer from a tremendous number of spam (junk) emails. Hence, it has become very important to design an automatic spam detector that could filter out spam before arriving to users' mailboxes. Various spam detectors try to identify spam messages by several characteristics like relative frequences of a series of commonly occuring words (e.g., business, address, internet), percentage of certain characters (e.g., ch(, ch!), the average and/or longest length of uninterrupted sequences of capital letters, etc. in the email message.

The file Spam.dat contains information from 4061 email messages. For all of them the true outcome of its type (spam (1) or not (0)) is available - see the last column), along with relative frequences of 57 various predictors (see above). Click here for more details on the data.

  1. Select randomly 1000 email messages from the data and leave them aside as a test set for further evaluation.
  2. Use the remaining messages as a training set and apply various claissification procedures you have learned (discriminant analysis, LDA, logistic regression, k-NN, neural networks, SVM, CART, boosting, random forests, etc.). Tune the corresponding parameters of the procedures for better fit. Compare various approaches and comment the results.
  3. Apply now all the procedures based on the training set to the test set. Comment the results.
  4. Make brief final conclusions.


Good Luck!