RNA-Seq gene expression estimation with read mapping uncertainty

B Li, V Ruotti, RM Stewart, JA Thomson… - …, 2010 - academic.oup.com
B Li, V Ruotti, RM Stewart, JA Thomson, CN Dewey
Bioinformatics, 2010academic.oup.com
Motivation: RNA-Seq is a promising new technology for accurately measuring gene
expression levels. Expression estimation with RNA-Seq requires the mapping of relatively
short sequencing reads to a reference genome or transcript set. Because reads are
generally shorter than transcripts from which they are derived, a single read may map to
multiple genes and isoforms, complicating expression analyses. Previous computational
methods either discard reads that map to multiple locations or allocate them to genes …
Abstract
Motivation: RNA-Seq is a promising new technology for accurately measuring gene expression levels. Expression estimation with RNA-Seq requires the mapping of relatively short sequencing reads to a reference genome or transcript set. Because reads are generally shorter than transcripts from which they are derived, a single read may map to multiple genes and isoforms, complicating expression analyses. Previous computational methods either discard reads that map to multiple locations or allocate them to genes heuristically.
Results: We present a generative statistical model and associated inference methods that handle read mapping uncertainty in a principled manner. Through simulations parameterized by real RNA-Seq data, we show that our method is more accurate than previous methods. Our improved accuracy is the result of handling read mapping uncertainty with a statistical model and the estimation of gene expression levels as the sum of isoform expression levels. Unlike previous methods, our method is capable of modeling non-uniform read distributions. Simulations with our method indicate that a read length of 20–25 bases is optimal for gene-level expression estimation from mouse and maize RNA-Seq data when sequencing throughput is fixed.
Availability: An initial C++ implementation of our method that was used for the results presented in this article is available at http://deweylab.biostat.wisc.edu/rsem.
Contact:  cdewey@biostat.wisc.edu
Supplementary information:  Supplementary data are available at Bioinformatics on
Oxford University Press