Sequencing technologies, especially RNA-Seq, become more and more attractive to many researchers as platforms for studying cancer. One of the most important goals for analysing RNA-Seq experiments is to identify genes that are differentially expressed (DE) across specified conditions in an experimental design. In addition, researcher might also be interested in gene set tests or particular pathways of cell development.
RNA-Seq data takes the form of integer counts. There are several statistical issues need to be addressed. First of all, RNA-Seq count data has to be fitted under an appropriate statistical model in which all sources of variation of any complex experimental design are accounted for. Secondly, the variability of counts as well as the mean-variance relationship needs to be estimated. The cost of RNA-Seq experiments often limits RNA-Seq studies to only a small number of replicate libraries, which makes it difficult to obtain reliable variance estimates for each gene. The problem is further complicated by the fact that different genes may have different variabilities. Finally, a powerful but statistically rigorous testing method is required for the purpose of detecting DE genes and gene set testing.
Here we propose a complete statistical pipeline for gene-level analyses of RNA-Seq. For RNA-Seq experiments with complex design, linear models are fitted to RNA-Seq data after transformation. An empirical Bayes strategy is proposed for variance estimation. It stabilizes the variance estimates by borrowing information across genes while still allowing each gene to have its own variability. It can be shown that the empirical Bayes strategy is better than using the variance estimated from either the pooled data or each individual gene only. The stabilized variance estimate is then used in a moderated test for DE. Similar strategies can be used for testing changes in pathways or expression signatures.