Team from I2SysBio finds errors in coronavirus gene sequences included in the world’s largest database

04/04/2025

A study led by the Institute of Integrative Systems Biology (I2SysBio) discovers 'artifacts' in the sequences with repair of the deletions of the virus that causes COVID-19, which affects infection and vaccine response. Many of the sequences with repair mutations in the virus spike protein, the key to infecting human cells, were due to errors in data processing  

A multidisciplinary team led by the Institute of Integrative Systems Biology (I2SysBio), located in the scientific-academic area of the Science Park of the University of Valencia, and joint center of the Higher Council for Scientific Research (CSIC) and the University of Valencia (UV), has just published a study that discovers a new perspective on the ability of the SARS-CoV-2 virus to mutate and infect humans. Through a review of the most widely used virus genetic database during the pandemic, the research team found 'false positives' in their ability to repair deletions, a process that restores sections of the viral genome that affect the ability of the virus to replicate or evade the host’s immune system. The work, published in the journal Virus Evolution, involves researchers from the CSIC’s Valencia Institute of Biomedicine (IBV) and the La Fe Health Research Institute (IIS-La Fe).

The work led by the I2SysBio Pathogenomics group in collaboration with the Viral Biology group of the same institute offers an innovative perspective on certain rare genetic changes in the SARS-CoV-2, the 'key' that the coronavirus uses to infect our cells. The research focused on so-called deletion repair events in this protein, where the virus appears to correct its genome.

After a massive data mining in the most widely used SARS-CoV-2 virus genome database in the pandemic, GISAID, they found that several of the initial findings were probably due to errors introduced by data processing in large genetic databases. The computer methods used to analyse millions of viral sequences can be misleading, giving the impression that the virus repairs its deletions more regularly. By comparing these already processed data with information obtained directly from genome sequencing (sequencing readings), the team has been able to obtain a more realistic view of the genetic changes that the virus undergoes. 

Less than 60% of confirmed repair events

"Using the GISAID gene sequence repository, we estimate a very high frequency of these deletion repair events that are expected to be rare", explains Mireia Coscollá Devís, I2SysBio researcher leading the study. "We realized that the sequences in the GISAID database are processed by each laboratory differently and contained many false positives for this type of markers. Thus, although in some cases we were able to confirm that it was a real phenomenon, most of them were the result of processing sequences", he reveals.

Thus, "we saw that less than 60 percent of deletion repair events could be confirmed. Although we have not been able to quantify it exactly for everyone, we can buy the proportions of the marker in various databases, and we see that the difference is 5 to 51 times less frequent than what appeared in the processed databases", calculated by the I2SysBio researcher.

Although these repair events are rare, the study shows that when they occur, they can subtly affect the behavior of the virus. "For example, certain repairs can modify the way in which the virus enters cells or influence the response to antibodies generated by vaccination," says Coscollá, something that the research team demonstrated through in vitro experiments.

"We realized that the sequences in the GISAID database are processed by each laboratory differently and contained many false positives for this type of markers. Thus, although in certain cases we were able to confirm that it was a real phenomenon, most of them were the result of processing sequences", Mireia Coscollá Devís, researcher at I2SysBio

Exchange of pathogen genomic data

Thus, "our research highlights the importance of carefully examining genetic data to avoid wrong conclusions," says the I2SysBio researcher. The World Health Organization (WHO) recommends a policy of sharing genomic data on pathogens to protect public health. However, in Spain there is no central collection of human, animal and environmental pathogen sequence data, nor is there a policy for the exchange of anonymized data between health and scientific institutions. This makes it difficult to monitor and respond to infectious diseases, including the monitoring of antimicrobial resistance, researchers say.

The work has been funded by the Ministry of Science, Innovation and Universities and by the European Union with Next Generation EU/PRTR funds through the CSIC’s PTI+ Global Health. In addition, it is supported by the Generalitat Valenciana and the European Social Fund through aid CIACIF/2022/333. The computational work was carried out in Garnatxa, the high performance computing (HPC) cluster of the Institute of Integrative Systems Biology. 

 

Do not miss our reel about the VIPERA project with Mireia Coscollá and other researchers of I2SysBio 

 

 

Source: CSIC Delegation of Comunitat Valenciana

Miguel Álvarez-Herrera, Paula Ruiz-Rodriguez, Beatriz Navarro-Domínguez, Joao Zulaica, Brayan Grau, María Alma Bracho, Manuel Guerreiro, Cristóbal Aguilar‐Gallardo, Fernando González-Candelas, Iñaki Comas, Ron Geller, Mireia Coscollá, Genome data artifacts and functional studies of deletion repair in the BA.1 SARS-CoV-2 spike protein, Virus Evolution, 2025; https://doi.org/10.1093/ve/veaf015

 

--

 

 

Recent Posts