03 May 2021
14:00 Master's Defense Fully distance
Theme
Using Transformers and Emoji in the Sentiment Classification of Texts from Social Networks
Student
Tiago Martinho de Barros
Advisor / Teacher
Hélio Pedrini (Advisor) / Zanoni Dias (Co-advisor)
Brief summary
Recent advances in the area of ​​Natural Language Processing have brought better solutions to a number of interesting tasks such as Linguistic Acceptability, Answers to Questions, Reading Comprehension, Natural Language Inference and Sentiment Analysis. In this paper, we focus on Sentiment Analysis, which is a research field focused on the computational study of feelings. Sentiment Analysis has many practical applications such as recommendation systems, monitoring user satisfaction and predicting the outcome of elections. The mentioned tasks are important for the advancement of Artificial Intelligence, as they are challenging and can be applied in several problems. The traditional approach is to build a specific classifier for each task, however, with the popularization of the concept of pre-training followed by fine tuning, it became very common to use the same architecture in different problems, through fine tuning with data the task in question. Methods such as ULMFiT, ELMo, BERT and their derivatives have achieved substantial success in many Natural Language Processing tasks, however, they share a disadvantage: to pre-train these models from scratch, substantial amounts of data and computational resources are required. In this work, we propose a new methodology for classifying sentiment in texts, based on BERT and focusing on emoji, treating them as an important source of feeling instead of considering them as simple input tokens. In addition, a pre-trained BERT model can be used as a starting point for our model, significantly reducing the total training time required. We evaluated the use of additional pre-training with texts containing at least one emoji. We also employ data augmentation to improve the generalizability of our model. Experiments on two data sets of tweets in Brazilian Portuguese - TweetSentBR and 2000-tweets-BR - show that our methodology produces competitive results in relation to previously published methods and BERT.
Examination Board
Headlines:
Hélio Pedrini IC / UNICAMP
Rodrigo Frassetto Nogueira NeuralMind
Esther Luna Colombini IC / UNICAMP
Substitutes:
Gerberth Adín Ramírez Rivera IC / UNICAMP
David Menotti Gomes DInf / UFPR