Many works on violent video classification have proposed solutions ranging from local descriptors to deep neural networks. Most approaches use the entire representation of the video as input to extract the appropriate features. However, some scenes may contain noisy and irrelevant parts that confuse the algorithm. We investigated the effectiveness of attention-based models to deal with this problem. We extended the initial implementations to work with multimodal features using the late fusion approach. We performed the experiments on three datasets with different concepts of violence: Hockey Fights, MediaEval 2015, and RWF-2000. We conducted quantitative experiments, analyzing the performance of attention-based models and comparing them with traditional methods, and qualitative, analyzing the relevance scores produced by the attention-based models. Attention-based models surpassed their traditional counterpart for all cases. Also, attention-based models have achieved better results than many more expensive approaches, highlighting the advantage of their use.