click here to download the project base paper of image super resolution.
Abstract
Transformer has recently gained considerable popularity in low-level vision tasks, including image super resolution (SR). These networks utilize self-attention along different dimensions, spatial or channel, and achieve impressive performance. This inspires us to combine the two dimensions in Transformer for a more powerful representation capability. Based on the above idea, we propose a novel Transformer model, Dual Aggregation Transformer (DAT), for image SR. Our DAT aggregates features across spatial and channel dimensions, in the inter block and intra-block dual manner. Specifically, we alternately apply spatial and channel self-attention in consecutive Transformer blocks. The alternate strategy enables DAT to capture the global context and realize inter-block feature aggregation. Furthermore, we propose the adaptive interaction module (AIM) and the spatial-gate feed-forward network (SGFN) to achieve intra-block feature aggregation. AIM complements two self-attention mechanisms from corresponding dimensions and SGFN introduces additional non-linear spatial information in the feed-forward
network. Extensive experiments show that our DAT surpasses current methods.
Introduction
Single image super-resolution (SR) is a traditional low-level vision task that focuses on recovering a high-resolution (HR) image from a low-resolution (LR) counterpart. As an ill-posed problem with multiple potential solutions for a given LR input, various approaches have emerged to tackle this challenge in recent years. Many of these methods utilize convolutional neural networks (CNNs) . However, the convolution adopts a local mechanism, which hinders the establishment of global dependencies and restricts the performance of the model. Recently, Transformer proposed in natural language processing (NLP) has performed notably in multiple high-level vision tasks. The core of Transformer Visual comparison (×4) on Urban100. CSNLN, SwinIR, and CAT-A suffer from blurring artifacts. is the self-attention (SA) mechanism, which is capable of establishing global dependencies. This property alleviates
the limitations of CNN-based algorithms. Considering the potential of Transformer, some researchers attempt to apply Transformer to low-level tasks , including image SR. They explore efficient usages of Transformer on high-resolution images from different perspectives to mitigate the high complexity of global self-attention . For the spatial aspect, some methods apply local spatial windows to limit the scope of self-attention. For the channel aspect, the “transposed” attention is proposed, which calculates self-attention along the channel dimension rather than the spatial dimension. These methods all exhibit remarkable results due to the strong modeling ability in their respective dimensions. Spatial window self-attention (SW-SA) is able to model fine-grained spatial relationships between pixels. Channel-wise self-attention (CW-SA) can model relationships among feature maps, thus exploiting global image information. Generally, both extracting spatial information and capturing channel context are crucial
to the performance of Transformer in image SR. Motivated by the aforementioned findings, we propose
the Dual Aggregation Transformer (DAT) for image SR. Our DAT aggregates spatial and channel features via the inter-block and intra-block dual way to obtain powerful representation capability. Specifically, we alternately apply spatial window and channel-wise self-attention in successive dual aggregation Transformer blocks (DATBs). Through this alternate strategy, our DAT can capture both spatial and channel context and realize inter-block feature aggregation between different dimensions. Moreover, the two self-attention mechanisms complement each other. Spatial window self-attention enriches the spatial expression of each feature map, helping to model channel dependencies. Channel-wise self-attention provides the global information between features for spatial self-attention, expanding the receptive field of window attention. Meanwhile, since self-attention mechanisms focus on modeling global information, we incorporate convolution to self-attention in parallel, to complement Transformer with the locality. To enhance the fusion of the two branches and aggregate both spatial and channel information within a
single self-attention module, we propose the adaptive interaction module (AIM). It consists of two interaction operations, spatial-interaction (S-I) and channel-interaction (C-I),which act between two branches to exchange information. Through S-I and C-I, the AIM adaptively re-weight the feature maps of two branches from the spatial or channel dimension, according to different self-attention mechanisms.
Besides, with AIM, we design two new self-attention mechanisms, adaptive spatial self-attention (AS-SA) and adaptive channel self-attention (AC-SA), based on the spatial window and channel-wise self-attention, respectively. Furthermore, another component of the Transformer block, the feed-forward network (FFN) , extracts features through fully-connected layers. It ignores modeling spatial information. In addition, the redundant information between channels obstructs further advances in feature representation learning. To cope with these issues, we design the spatial-gate feed-forward network (SGFN), which introduces the spatial-gate (SG) module between two fully connected layers of FFN. The SG module is a simple gating mechanism (depth-wise convolution and element-wise multiplication). The input feature of SG is partitioned into two segments along the channel dimension for convolution
and multiplicative bypass. Our SG module can complement FFN with additional non-linear spatial information and relieve channel redundancy. In general, based on AIM and SGFN, DAT can realize intra-block feature aggregation. Overall, with the above three designs, our DAT can aggregate spatial and channel information through the inter-block and intra-block dual way to achieve strong feature expressions. Consequently, as displayed in Fig. 1, our DAT achieves superior visual results against recent state-of-the art SR methods. Our contributions are three-fold:
- We design a new image SR model, dual aggregation Transformer (DAT). Our DAT aggregates spatial and channel features in the inter-block and intra-block dual manner to obtain powerful representation ability.
- We alternately adopt spatial and channel self-attention, realizing inter-block spatial and channel feature aggregation. Moreover, we propose AIM and SGFN to achieve intra-block feature aggregation.
- We conduct extensive experiments to demonstrate that our DAT outperforms state-of-the-art methods, while retaining lower complexity and model size.