Motivation: DNA methylation is a molecular modification of DNA that plays crucial roles in regulation of gene expression. Particularly, CpG rich regions are frequently hypermethylated in cancer tissues, but not methylated in normal tissues. However, there are not many methodological literatures of case-control association studies for high-dimensional DNA methylation data, compared with those of microarray gene expression. One key feature of DNA methylation data is a grouped structure among CpG sites from a gene that are possibly highly correlated. In this article, we proposed a penalized logistic regression model for correlated DNA methylation CpG sites within genes from high-dimensional array data. Our regularization procedure is based on a combination of the l1 penalty and squared l2 penalty on degree-scaled differences of coefficients of CpG sites within one gene, so it induces both sparsity and smoothness with respect to the correlated regression coefficients. We combined the penalized procedure with a stability selection procedure such that a selection probability of each regression coefficient was provided which helps us make a stable and confident selection of methylation CpG sites that are possibly truly associated with the outcome.
Results: Using simulation studies we demonstrated that the proposed procedure outperforms existing main-stream regularization methods such as lasso and elastic-net when data is correlated within a group. We also applied our method to identify important CpG sites and corresponding genes for ovarian cancer from over 20 000 CpGs generated from Illumina Infinium HumanMethylation27K Beadchip. Some genes identified are potentially associated with cancers.
Contact: sw2206@columbia.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
Bacterial heat-labile (LT) enterotoxins signal through tightly regulated interactions with host cell gangliosides. LT-IIa and LT-IIb of Escherichia coli bind preferentially to gangliosides with a NeuAcα2-3Galβ1-3GalNAc terminus, with key distinctions in specificity. LT-IIc, a newly discove ... more
Phosphate deficiency is characteristic for many natural habitats, resulting in different physiological responses in plants and bacteria including the replacement of phospholipids by glycolipids and other phosphorous-free lipids. The plant pathogenic bacterium Agrobacterium tumefaciens, whi ... more
A tryptophan side chain was introduced into subsite +1 of family GH-18 (class V) chitinases from Nicotiana tabacum and Arabidopsis thaliana (NtChiV and AtChiC, respectively) by the mutation of a glycine residue to tryptophan (G74W-NtChiV and G75W-AtChiC). The specific activity toward glyco ... more