An international research group co-led by the Institute for Integrative Systems Biology (I2SysBio), located at the University of Valencia Science Park, has published in Nature Methods the largest comparative study to date on methods for analyzing long-read sequencing data from the human transcriptome. Different technologies and various computational tools available for long-read sequencing of RNA molecules, which are essential molecules for genes to perform their function, were analyzed. The study found a greater diversity of RNA than expected, which could have major implications for the study of disease, aging and the very complexity of life on Earth.
For years, an international consortium known as the Long Read RNA-Seq Genome Annotation Genome Sequencing Assessment Project (LRGASP) evaluated methods and technologies in long read RNA sequencing experiments. Now, this global consortium in which CSIC plays a key role has published the results of this effort, offering guidance for the future of RNA sequencing experimentation and analysis. The work, published in the journal Nature Methods, evaluates the strengths and weaknesses of the two main long-read RNA sequencing platforms, Oxford Nanopore Technologies and Pacific Biosciences, as well as the computational methods used to evaluate the data.
“Although the human genome has been sequenced from end to end, we still face great challenges in defining exactly how genes give rise to the enormous diversity of RNA and protein molecules that make up a living being. This knowledge is very important, because small changes in the DNA-to-RNA step can lead to pathologies”
RNA is the molecular compound in cells that transmits information from DNA to proteins through the processes of transcription and translation, which are universal to all living things. Long-read RNA sequencing makes it possible to look at whole RNA molecules and identify small changes in the way genes give rise to proteins. These small changes are critical to the constitution of complex organisms such as humans, and failures in their synthesis are associated with various diseases. Long-read RNA sequencing is used to identify these changes and associate them with various biological processes.
"Although the human genome has been sequenced from end to end, we still face great challenges in defining exactly how genes give rise to the enormous diversity of RNA and protein molecules that make up a living being. This knowledge is very important, because small changes in the DNA-to-RNA step can lead to pathologies,” explains Ana Conesa, CSIC research professor at I2SysBio and one of the researchers who have led this consortium. Her team evaluated the RNA predictions proposed by 14 bioinformatics laboratories around the world, using the SQANTI3 software developed by this group at I2SysBio, one of the reference bioinformatics tools in the field.
Higher than expected RNA diversity
More than 427 million long read sequences were generated and analyzed in the study. The data came from humans, mice and manatees. The use of manatee data allowed the methods to be tested in a species without a reference genome. "It was important to test the techniques in a non-model species, as it is increasingly common to see studies with long-read RNA sequencing in these not-so-well-studied organisms. This lack of prior information must be taken into account during the analyses because it can directly affect our results,” says Francisco J. Pardo Palacios, predoctoral researcher at I2SysBio and first author of this work.
“It was important to test the techniques in a non-model species, as it is increasingly common to see studies with long-read RNA sequencing in these not-so-well-studied organisms. This lack of prior information must be taken into account during the analyses because it can directly affect our results”
After extensive data collection and analysis, the consortium produced a set of recommendations for RNA sequencing. In general, long-read sequencing approaches perform much better than short-read sequencing, with the quality of the reads, rather than their abundance, being the key accuracy factor. In addition, they found a surprising number of undocumented transcripts in human and mouse genomes. "We have seen that there is a much greater diversity of RNAs than we thought. We are seeing that each individual, even each cell, has its own personal transcriptome. The next step is to find out what relevance this has in disease, aging and species diversity,” summarizes Ana Conesa.
The paper concludes that there is no single best approach to long-read RNA sequencing. The article describes best practices depending on the different objectives that individual studies may have. Different existing technologies have differences in error rates, sequencing throughput, and read length, so researchers must prioritize which is most important for their area of study. “I think this will help a lot of people who want to further develop the technology, there is still room for improvement in many of these methods,” concludes Angela Brooks, a researcher at the University of California Santa Cruz (USA) and co-author of the study.