Question 1. Improvisations on Real Data.
The file Engine.dat contains the results of high-speed
temperature measurements in a certain part of the airplane engine over equal
periods of time.
Apply various techniques you've learned in the course (or know from
any other sources), create new ones if necessary to denoise and analyse this data.
Discuss the results, comment on them, share your motivation, hopes, successes
and disappointments... Give a brief but comprehensive analysis.
At the end, choose the most suitable estimator(s) and comment the results.
Question 2. Multivariate Solo.
This question is somewhat different.
First, I tell you from the very beginning that
the data you'll get are simulations of some known (meanwhile to me!)
quite simple multivariate functions corrupted by random noise.
Your task is to try to reconstruct these functions back from noisy data.
Second, unlike other questions where all of you received the same data for
band's improvisation, here you have chance for your own solo.
Everyone will get his/her
own data. The number of independent variables (x's) varies
from two to three (please, don't take it personally!),
but the vector of the dependent variable (y) is always the
last column in the data file.
Try first various nonparametric methods to smooth the data - it will hopefully
give you some intuition or hint on at least a parametric form of the unknown
(to you!) function. Try several of nonparametric methods to
strengthen your intuition.
Once you have a guess on your function, you can test it by the
corresponding parametric regression model.
- When one deals with noisy data it is always possible that two statisticians
finish with different models for the same data set (not only because they are
Jewish! (-:)). Argue your model and even
if it is not exactly the real one, I may accept your point!
- What to do if despite all attempts you can't detect the
underlying function? Well, present the analysis you've made.
Here the data themes for your solo improvisations:
If someone's theme for solo is missing, please, let me know immediately!
Question 3. Classification Passions.
All of us suffer from a tremendous number of spam (junk) emails. Hence, it
has become very important to design an automatic spam detector that could
filter out spam before arriving to users' mailboxes.
Various spam detectors try to identify spam messages by several
characteristics like relative
frequences of a series of commonly occuring words
(e.g., business, address, internet),
percentage of certain characters (e.g., ch(, ch!),
the average and/or longest length of uninterrupted sequences of capital
letters, etc. in the email message.
The file Spam.dat
contains information from 4061 email messages. For all of them
the true outcome of its type (spam (1) or not (0)) is available - see
the last column), along with relative frequences of 57 various predictors
(see above).
Click
here for more details on the data.
- Select randomly 1000 email messages from the data and leave them
aside as a test set for further evaluation.
- Use the remaining messages as a training set and apply various
claissification procedures you have learned (discriminant analysis,
LDA, logistic regression,
k-NN, neural networks, SVM, CART, boosting, random forests, etc.). Tune
the corresponding parameters of the procedures for better fit. Compare
various approaches and comment the results.
- Apply now all the procedures based on the training set to the test set.
Comment the results.
- Make brief final conclusions.
- the deadline is 18 February (8 Adar), 2013
- the final score is a weighted average of Theoretical Part (~45%),
Practical Part (~45%) and homework exercises (~10%)
- if something is not clear or you have questions, call me (5389) or email
felix@post.tau.ac.il
Good Luck!