Statistical methods to interpret genotypic data

Woolaston, Alex; Tier, Bruce; Murison, Robert

Title

Publication Date

2007

Author(s)

Woolaston, Alex

Tier, Bruce

Murison, Robert

Type of document

Thesis Doctoral

Language

en

Entity Type

Publication

UNE publication id

une:6697

Abstract

Recent developments in genetic techniques have provided high throughput tools such as single nucleotide polymorphism (SNP) chips and cDNA microarrays to assist in genetic selection. Such high throughput devices necessitate new statistical approaches so that the massive amounts of data gathered can be exploited in an effective manner. This thesis describes some statistical methods that can be applied to SNP data and microarray data. Firstly, the use of SNP data to predict molecular breeding value (MBV) is studied. Principal component analysis (PCA) is used to summarize the variation of the high dimensional SNP space within a smaller dimensional projection space of principal components (PCs). It is demonstrated how the PCs can be used in principal component regression (PCR) to predict the MBV of dairy cattle from their SNP values alone with both simulated and real data. Highly reliable estimated breeding values (EBVs) are available for the real animals. A cross-validation method is used to predict MBVs for dairy sires, with a correlation of 0.69 between the EBVs and estimated MBVs obtained for these real data. The impact of erroneous SNP values, missing SNP values and the number of animals with known EBVs genotyped is also examined. Through simulation, it is found that erroneous SNP values of greater than 2% reduce the accuracy of prediction, whereas the number of missing SNP values has little impact on the accuracy of prediction. As expected, an increase in the number of animals with already known EBVs increases the accuracy of prediction. Kernel regression is used to predict MBV from the intrinsically discrete SNP data. Binomial kernels, which treat the SNP values as a discrete variable, and a Gaussian kernel, which imposes a continuous structure on the marker data, are employed and compared. It is empirically demonstrated that the Gaussian kernel outperforms the binomial kernel when used in Nadaraya-Watson kernel regression. Secondly, statistical methods to account for the nuisance spatial trends found in microarray slides are assessed. Wavelets are proposed as a method of modeling spatial effects in two colour cDNA microarrays where the spatial error component may be represented as a fractal surface. This method is compared with smoothing splines plus first order autoregressive detrending using data collected from mice in a time-course experiment. Two schemes for selecting control genes are also assessed for these data,(i) pre-determined and (ii) the genes that do not over- or under-express throughout the experiment. It is shown that the spatial adjustment and the set of control genes can influence the interpretation of test genes. Results from this microarray study are also used to generate simulated data to assess the models to remove spatial trends. The wavelets threshold approach is the most successful when the nuisance spatial trends in the images are rough and fractal, but there is little difference between the models for images with smoother spatial bias.

Link

link

Statistical methods to interpret genotypic data

Files: