Vision backbone is the basic feature extraction module of computer vision models. It extracts hierarchical image features through structures such as convolution and Transformer, providing key feature support for downstream tasks such as segmentation and detection, and directly affecting model performance and efficiency.