Special Software Enhances Accuracy of Genome Annotation for Animals and Plants

"BRAKER3 marks a significant development in bioinformatics and provides academics all over the world access to a high-performance tool for genome annotation"

21-Jun-2024
Computer-generated image

Symbolic image

The newly developed software BRAKER3 provides scientists all over the world access to a high-performance instrument for genome annotation, i.e. for identifying and labelling several relevant characteristics of a genomic sequence. The software represents a considerable advancement in bioinformatic research. It was developed by researchers at the University of Greifswald in collaboration with colleagues at the Georgia Institute of Technology in Atlanta (USA). The international bioinformatics team, in Greifswald led by Prof. Dr. Mario Stanke, has now presented the software in the journal Genome Research. BRAKER3 exploits the fact that the same genes can be found in a similar form in various species even if their joint evolutionary origin has long since passed, for example as is the case for a butterfly and a fruit fly. The development of the software was financed by the US National Institutes of Health.

The precise determination of the structure of protein-coding genes in genome sequences is a key for the biological understanding of life. The success of numerous experiments depends to a great extent on error-free genome annotation. The cataloguing of protein-coding genes in eukaryotic genomes is therefore one of the greatest challenges faced by the Earth BioGenome Project. This aims to sequence the genomes of at least 1.5 million eukaryotic species. Eukaryotes have cells that have a cell nucleus. Eukaryotic organisms include animals, humans, plants, and fungi. Individual genome projects can be used for purposes such as: the targeted treatment of diseases transmitted by animals, the study of gene functions in insects or in the breeding of plants.

A central problem faced by many tools for genome annotation is the so-called supervised learning: the underlying mathematical models require training examples that consist of genes in the target species in order to adjust parameters to this target species. This is where the BRAKER3 team is able to build on the experience gained from previous software versions, also including the combined evidence from transcriptomics and protein data in this training step. In contrast to the previous versions of the tool, both evidence types can now be considered simultaneously.

In benchmark tests with 11 species, BRAKER3 clearly outperformed the previous versions. The improvement is particularly clear in species with large and complex genomes, e.g. the mouse and the chicken. Furthermore, the new version of the software is much more precise than alternative programmes that have been used extensively in the past.

“BRAKER3 represents a considerable advancement in the accuracy and automation capabilities of eukaryotic genome annotation, especially for large and structurally complex genomes,” explains Lars Gabriel from the University of Greifswald’s Institute of Mathematics and lead author of the publication. “The new software version is a tool that is already being used by a large and rapidly growing number of users. The team’s efforts to design the software so that it runs in isolated packages that contain all of the required components for the programme and on various computer systems without extra adjustments has been welcomed particularly positively by the international research community. This principle, which is known as ‘containerization’, was decisively influenced by the excellent high-performance computing infrastructure at Greifswald’s University Computer Centre,” says Dr. Katharina Hoff from the University of Greifswald’s Institute of Mathematics. She has been working on the development of BRAKER for many years.

“BRAKER3 marks a significant development in bioinformatics and provides academics all over the world access to a high-performance tool for genome annotation. During the next stages of development, the developers shall specifically enhance and train large language models, as genomes can be understood as a ‘language’ of biology whose encoded genes follow a strict grammar,” explains Prof. Dr. Mario Stanke, Head of the Bioinformatics Research Group at the University of Greifswald’s Institute of Mathematics.

Original publication

Other news from the department science

Most read news

More news from our other portals

Fighting cancer: latest developments and advances