to download project base paper


The recent amalgamation of transformer and convolutional designs has led to steady improvements in accuracy and efficiency of the models. In this work, we introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off. To this end, we introduce a novel token mixing operator, RepMixer, a building block of FastViT, that uses structural reparameterization to lower the memory access cost by removing skip-connections in the network. We further apply traintime overparametrization and large kernel convolutions to boost accuracy and empirically show that these choices ave minimal effect on latency. We show that – our model is 3.5× faster than CMT, a recent state-of-the-art hybrid transformer architecture, 4.9× faster than EfficientNet, and 1.9× faster than ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset. At similar latency, our model obtains 4.2% better Top-1 accuracy on ImageNet than MobileOne. Our model consistently outperforms competing architectures across several tasks – image classification, detection, segmentation and 3D mesh regression with significant improvement in latency on both a mobile device and a desktop GPU. Furthermore, our model is highly robust to out-of-distribution samples and corruptions, improving over competing robust models.

Vision Transformers have achieved state-of-the-art performance on several tasks such as image classification, detection and segmentation . However, these models have traditionally been computationally expensive. Recent works have proposed methods to lower the compute and memory requirements of vision transformers. Recent hybrid architectures effectively corresponding authors: {panasosaluvasu, anuragr} combine the strengths of convolutional architectures and transformers to build architectures that are highly competitive on a wide range of computer vision tasks. Our goal is to build a model that achieves state-of-the-art latency-accuracy trade-off. Recent vision and hybrid transformer models follow the Metaformer architecture, which consists of a token mixer with a skip connection followed by Feed Forward Network (FFN) with another skip connection. These skip connections account for a significant overhead in latency due to increased memory access cost To address this latency overhead, we introduce RepMixer, a fully reparameterizable token mixer that uses structural reparameterization to remove the skip-connections. The RepMixer block also uses depthwise convolutions for spatial mixing of information similar to ConvMixer [55]. However, the key difference is that our module can be reparameterized at inference to remove any branches. To further improve on latency, FLOPs and parameter count, we replace all dense k×k convolutions with their factorized version, i.e. depthwise followed by pointwise convolutions. This is a common approach used by efficient architectures to improve on efficiency metrics, but, naively using this approach hurts performance as seen in Table 1. In order to increase capacity of the these layers, we use linear train-time overparameterization as introduced in . These additional branches are only introduced during training and are reparameterized at inference. In addition, we use large kernel convolutions in our network. This is because, although self-attention based token mixing is highly effective to attain competitive accuracy, they are inefficient in terms of latency . Therefore, we incorporate large kernel convolutions in Feed Forward Network (FFN) layer and patch embedding layers. These changes have minimal impact on overall latency of the model while improving performance. Thus, we introduce FastViT that is based on three key design principles– i) use of RepMixer block to remove skip connections, ii) use of linear train-time e overparameterizaar
iii) use of large convolutional kernels to substitute self-attention layers in early stages. FastViT achieves significant improvements in latency compared to other hybrid vision transformer architectures
while maintaining accuracy on several tasks like – image classification, object detection, semantic segmentation and 3d hand mesh estimation. We perform a comprehensive analysis by deploying recent state-of-the-art architectures on an iPhone 12 Pro device and an NVIDIA RTX-2080Ti
desktop GPU. In Figure 1, we show that, at ImageNet Top-1 accuracy of 83.9%, our model is 4.9× faster than EfficientNetB5 [50], 1.6× faster than EfficientNetV2-S [51], 3.5× faster than CMT-S [17] and 1.9× faster than ConvNeXtB on an iPhone 12 Pro mobile device. At ImageNet Top-1 accuracy of 84.9% our model is just as fast as NFNetF1 [1] on a desktop GPU while being 66.7% smaller, using 50.1% less FLOPs and 42.8% faster on mobile device. At latency of 0.8ms on an iPhone 12 Pro mobile
device, our model obtains 4.2% better Top-1 accuracy on ImageNet than MobileOne-S0. For object detection and instance segmentation on MS COCO using Mask-RCNN head, our model attains comparable performance to CMTS while incurring 4.3× lower backbone latency. For semantic segmentation on ADE20K, our model improves over PoolFormer-M36 [67] by 5.2%, while incurring a 1.5×
lower backbone latency on an iPhone 12 Pro mobile device. On 3D hand mesh estimation task, our model is 1.9× faster than MobileHand [16] and 2.8× faster than recent state-ofthe-art MobRecon [4] when benchmarked on GPU. In addition to accuracy metrics, we also study the robustness of our models to corruption and out-of-distribution samples which does not always correlate well with accuracy. For example, PVT [60] achieves highly competitive performance on ImageNet dataset, but has very poor robustness to corruption and out-of-distribution samples as reported in Mao et al. [38]. In real world applications, using a robust model in such a scenario can significantly improve user experience. We demonstrate the robustness of our architecture on popular benchmarks and show that our models
are highly robust to corruption and out-of-distribution samples while being significantly faster than competing robust models. In summary, our contributions are as follows:

  • We introduce FastViT, a hybrid vision transformer that
    uses structural reparameterization to obtain lower memory access cost and increased capacity, achieving stateof-the-art accuracy-latency trade-off.
  • We show that our models are the fastest in terms of latency on two widely used platforms – mobile device and
    desktop GPU.
  • We show that our models generalize to many tasks – image classification, object detection, semantics segmentation, and 3D hand mesh regression.
  • We show that our models are robust to corruption and outof-distribution samples and significantly faster than competing robust models.


Leave a Comment


No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *