Article

PVT v2: Improved baselines with pyramid vision transformer

Wenhai WangShanghai AI Laboratory, Shanghai 200232, China; Department of Computer Science and Technology, Nanjing University, Nanjing 210023, ChinaEnze XieDepartment of Computer Science, the University of Hong Kong, Hong Kong 999077, ChinaXiang LiSchool of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210014, ChinaDeng-Ping FanComputer Vision Lab, ETH Zurich, Zurich 8092, SwitzerlandKaitao SongSchool of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210014, ChinaDing LiangSenseTime, Beijing 100080, ChinaTong LüDepartment of Computer Science and Technology, Nanjing University, Nanjing 210023, ChinaPing LuoDepartment of Computer Science, the University of Hong Kong, Hong Kong 999077, ChinaLing ShaoInception Institute of Artificial Intelligence, Abu Dhabi, United Arab Emirates

2022en

ABI

Abstract

Transformers have recently lead to encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs: (i) a linear complexity attention layer, (ii) an overlapping patch embedding, and (iii) a convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linearity and provides significant improvements on fundamental vision tasks such as classification, detection, and segmentation. In particular, PVT v2 achieves comparable or better performance than recent work such as the Swin transformer. We hope this work will facilitate state-of-the-art transformer research in computer vision. Code is available at https://github.com/whai362/PVT .

Identifiers

DOI: 10.1007/s41095-022-0274-8

Citations and references

Cited by 20 references

Metrics — AkademScholar