School of Mathematical Sciences
Monday, May 31, 2010
Schreiber 006, 12:15
Tel Aviv University
On the (Ab)use of Principal Component Analysis in Genetics
Principal Component Analysis (PCA) is a statistical approach for summarizing, presenting and denoising high-dimensional data. It has statistical and geometrical interpretations, but mathematically it boils down to an eigen-decomposition problem, and the top principal components are the leading eigenvector-eigenvalue combinations of the data matrix.
PCA has been widely used in Genetics for decades, and is a central tool in many practices in this area, with its biological implications often being over- and mis-interpreted.
In this talk I will first review some of the major uses of PCA in Genetics, with examples of results of high scientific impact (and populat interest) derived based on PCA.
I will then concentrate on the seminal "Science" paper by Menozzi, Piazza and Cavalli-Sforza in 1978, and the book by the same authors in 1994, which established the use of PCA of genetic data for making inferences about human history and migration. Specifically, the 1978 paper concluded that the Neolithic expansion (circa 6000 BC) had a major effect on the European genetic landscape. In 2008, a Nature Genetics paper by Novembre and Stephens claimed that the results in these original works "resemble mathematical artifacts" which are expected even if no long-range migration was involved in shaping the genetic landscape. Their arguments are based on properties of Toeplitz matrices and their eigen-decompositions. I will re-examine the properties of the original data and the relevant mathematical results, and demonstrate that the arguments of Novembre and Stephens do not apply in this case. A critical re-analysis of the original data will lead us to conclude that the original results from 1978 are statistically valid, though their historical interpretation is difficult to verify.
Coffee will be served at 12:00 before the lecture
at Schreiber building 006