11 out 2024
09:00 Master's Defense Completely remote
Theme
Analysis of self-supervised approaches for fine-tuning language models for Portuguese tasks
Student
Gian Franco Joel Condori Luna
Advisor / Teacher
Marcelo da Silva Reis - Co-advisor: Didier Augusto Vega Oliveros
Brief summary
Organizations often face the limitation of having a small amount of labeled data to calibrate and refine their language models (LMs) in specific contexts. This scarcity of annotated data translates into a significant challenge for the development and improvement of LMs, since the quality and quantity of data are critical factors in the performance and generalization of the model. On the other hand, the acquisition or creation of labeled data is characterized by its high demand in terms of time and financial resources; this complicated and expensive process can represent a significant barrier for organizations, limiting their ability to implement effective machine learning solutions tailored to their specific needs. The literature shows that similar problems have been solved through self-supervised fine-tuning, using different pre-training approaches. However, to our knowledge, there was no description and evaluation of such training protocols for LMs in Portuguese. Thus, in this dissertation we propose how to adapt the BERTimbau Portuguese LM pre-training protocol to a self-supervised fine-tuning procedure, accompanied by an evaluation of how this procedure can affect generalization and downstream tasks when using unlabeled data. We performed several experiments with three datasets from different contexts, in which we unfroze different numbers of layers in the model and used different learning rate settings, thus determining an optimal training regime for the self-supervised fine-tuning protocol. The results using sentiment analysis as a downstream task, with labeled data from the same datasets, indicated that unfrosting only the last layer already yields good results, which would allow users with limited computational resources to obtain excellent results with the method. In addition, the effectiveness of self-supervised fine-tuning on larger datasets was highlighted, suggesting its potential for future research in more advanced pre-trained LMs.
Examination Board
Headlines:
Marcelo da Silva Reis IC / UNICAMP
Andre Santanche IC / UNICAMP
Thiago Alexandre Salgueiro ICMC / USP
Substitutes:
Rafael de Oliveira Werneck IC / UNICAMP
Ronaldo Cristiano Prati CMCC / UFABC