A research group led by I2SysBio develops a new computational tool to investigate genome complexity

17/04/2024

The Institute for Integrative Systems Biology (I2SysBio), located at the University of Valencia Science Park, has published in Nature Methods a proprietary software to analyze data obtained by long-read sequencing of the genome. This system makes it possible to discover new RNA molecules and assign them a function in the creation of tissues. This will deepen our knowledge of the formation of the organism and its diseases

The complexity of an organism emerges from its genome, the book that contains its DNA instructions for life. The method for reading this book, sequencing, has evolved towards reading longer and longer fragments of the genome. In this field, a research group led by the Institute of Integrative Systems Biology (I2SysBio), located at the University of Valencia Science Park (PCUV), has published in Nature Methods an improvement of a proprietary computer program capable of discovering new transcripts, RNA molecules used by genes to synthesize proteins and create tissues, from their sequencing with long-reading instruments, and assigning them a function in the formation of the organism.

Long-read sequencing is the third generation of genome sequencing methods. Compared to short fragment reads, which analyze about 200 nucleotides (the 'letters' that make up genes), long-read methods can obtain reads 100 times longer, about 20,000 nucleotides, leaving fewer 'gaps' in the genome information to be filled by bioinformatics tools. This was one of the reasons for Nature Methods itself to consider it 'Method of the Year 2022'.

A few years earlier, in 2018, researcher Ana Conesa, then at the University of Florida, developed a software called SQANTI to analyze the information extracted by these long-read methods. Now, her research team at I2SysBio publishes in Nature Methods a substantial improvement of this software that can be freely used on the leading commercial systems employing long-read sequencing, Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT).

"Long-read techniques better analyze the complexity of human transcripts and the transcriptome," says Conesa. This identifies the portion of the genome that is read in each cell to give rise to tissues and organs. Thus, a single gene can give rise, through small changes in the RNA structure it encodes, to a great diversity of transcripts, and with them proteins with different cellular functions... "Short-read sequencing cannot solve this puzzle. Long-read sequencing better reconstructs the functional complexity of the human transcriptome, which is key to studying certain diseases, especially neurological and cancer diseases," says the CSIC researcher.

“Long-read sequencing better reconstructs the functional complexity of the human transcriptome, which is key to studying certain diseases, especially neurological and cancer diseases”, Ana Conesa, researcher at the I2SysBio

To better understand the complexity of the organism and diseases

The version now released, SQANTI3, solves some previous problems, arising from RNA degradation or the unique analysis of each molecule, to introduce remarkable improvements. The program is now able to discover new transcripts that were not in the genome databases used by these computer programs. In addition, using Artificial Intelligence techniques, the software can assign functional information to the new transcript, "something essential for understanding the functional complexity of the organism and of diseases," Conesa remarked. 

The I2SysBio's Garnatxa computing cluster, which has 15 computing nodes capable of providing 950 parallel computing threads, has been used to develop this software. In addition, the Gene Expression Genomics group led by Ana Conesa at I2SysBio participates in ELIXIR, one of the strategic infrastructures for the European Strategy Forum on Research Infrastructures (ESFRI) that allows life science laboratories across Europe to share and store their data. 

The University of Florida and Pacific Biosciences, one of the companies that commercializes the technology for long-read sequencing through its PacBio system, collaborated in the development of SQANTI3, which recommends the use of the Spanish software to analyze its data. The software is free to use and already has "thousands of users all over the world," according to Conesa, although "the success of this tool also requires more technical personnel to attend to the numerous requests we receive. Thus, the researcher has co-led the recent launch of the CSIC Computational Biology and Bioinformatics Connection, a platform to connect people, methods and resources in these fields at the CSIC.

In the media

https://www.csic.es/es/actualidad-del-csic/el-csic-desarrolla-una-nueva-herramienta-informatica-para-investigar-la-complejidad-del-genoma