01 April 2024
13:30 Doctoral defense Room 85 of IC2
Theme
Improved Image Filling Based on Vision Transformers and Pencil-Sketch
Student
Jose Luis Flores Campana
Advisor / Teacher
Hélio Pedrini - Co-supervisor: Helena de Almeida Maia
Brief summary
Image filling is a computer vision technique focused on restoring damaged or missing regions in an image. Since the advent of deep neural networks, especially convolutional neural networks (CNNs), image filling has made great progress in restoring damaged images. However, the limited receptive fields of CNNs can sometimes result in unreliable results due to their inability to capture the global context of the image. Recently, Transformers have been used in the field of computer vision to deal with the problem of CNNs to model the global context of the image. Transformers can learn long-range dependencies through self-attention mechanisms, and because of this ability, Transformers can also be essential for achieving realistic results when image content has large missing regions and complex scenes. However, the quadratic computational and memory costs in Transformers make their use prohibitive in high-resolution images and restricted devices. To overcome this problem, we propose a Vision Transformers architecture with variable hyperparameters that (i) subdivides the feature maps into a variable number of multiscale slices, (ii) distributes the feature map into a variable number of heads to balance the complexity of the self-attention operation, and (iii) includes a new strategy based on depth-first convolution to reduce the number of feature map channels sent to each Transformer block. Furthermore, to generate more consistent results, some approaches also incorporated auxiliary information to guide the model's understanding of structural information. Therefore, to deal with the problem of inconsistency between structure and texture, as well as avoid the generation of artifacts, we developed a new method for image filling that uses pencil-sketch information to guide the restoration of structural elements as well as texture. Unlike previous work that employs edges, lines, or segmentation maps, we leverage the pencil-sketch mastery and Transformers capabilities to learn long-range dependencies to properly combine structure and texture information, producing more consistent results. We conduct experiments on three datasets from the literature: Places2, CelebA, and Paris StreetView. Our experiments show that our method consistently achieved the best results for the FID and LPIPS metrics on the CelebA dataset. We obtained competitive results for the Places2 and Paris StreetView datasets compared to state-of-the-art methods. Furthermore, our model performed best in terms of model size, number of parameters, and FLOPS.
Examination Board
Headlines:
Hélio Pedrini IC / UNICAMP
Marcelo da Silva Reis IC / UNICAMP
Leo Sampaio Ferraz Ribeiro IC / UNICAMP
Samuel Botter Martins Banco Itaú
Luiz Maurílio da Silva Maciel ICE/UFJF
Substitutes:
Andre Santanche IC / UNICAMP
Fátima de Lourdes dos Santos Nunes Marques EACH / USP
Ronaldo Cristiano Prati CMCC / UFABC