News

Currently, video question answering (VideoQA) algorithms relying on video-text pretraining models employ intricate unimodal encoders and multimodal fusion Transformers, which often lead to decreased ...
As such, a natural question arises as to how to expand the receptive fields in convolutional neural networks to achieve the superior performance comparable with that of transformers. Large kernel ...