22Jul2025
14:00 Doctoral defense room 85 of IC2
Topic on
Protein Function Annotation using Machine Learning and Local Alignment
Student
Gabriel Bianchin de Oliveira
Advisor / Teacher
Zanoni Dias - Co-advisor: Hélio Pedrini
Brief summary
With the advancement of sequencing techniques in recent decades, millions of proteins have had their amino acid sequences determined through laboratory experiments. However, identifying the specific characteristics of each protein, such as its functions, is still costly and time-consuming, since it requires complex experimental procedures. Understanding the functions performed by proteins is essential for the advancement of several scientific applications, since they play fundamental roles in the biological processes of living organisms. To reduce the gap between the number of proteins with known sequences and those with manually annotated functions, several studies have been conducted with the aim of applying computational methods for this analysis, aiding in the discovery of the functions performed by proteins. Although computational techniques based on amino acid sequences have already shown good results, especially with the use of natural language processing approaches, such as Transformer-based models, and sequence alignment by tools such as DIAMOND and BLAST, the task still remains open, highlighting the complexity and the continuous need for new methodological advances. In this research, we present two machine learning-based methods using natural language processing techniques, as well as two ensemble methods that combine the predictions of machine learning approaches with local alignment, as well as intermediate models. During the evaluation on the CAFA5-derived dataset, which is the most recent CAFA challenge dataset and the main reference for the protein function classification task, the proposed methods outperformed the approaches in the literature, establishing themselves as the new state of the art in protein function prediction using only the amino acid sequence. Finally, we present memory-optimized versions, which require less computational capacity to achieve results comparable to the original versions, in addition to a web server containing the optimized versions of the proposed methods.
Examination Board
Headlines:
| Zanoni Dias | IC / UNICAMP |
| Ana Ligia Barbour Scott | CMCC / UFABC |
| Carlos Henrique da Silveira | ICT/UNIFEI |
| Guilherme Pimentel Telles | IC / UNICAMP |
| Marcelo da Silva Reis | IC / UNICAMP |
Substitutes:
| Alexandre Mello Ferreira | IC / UNICAMP |
| Raquel Cardoso de Melo Minardi | DCC / UFMG |
| Felipe Rodrigues da Silva | Embrapa |