Paper
9 January 2025 Enhanced image-text retrieval based on CLIP with YOLOv10 and Next-ViT
Xue Qian, Bo Liu
Author Affiliations +
Proceedings Volume 13486, Fourth International Conference on Computer Vision, Application, and Algorithm (CVAA 2024); 134862K (2025) https://doi.org/10.1117/12.3055876
Event: Fourth International Conference on Computer Vision, Application, and Algorithm (CVAA 2024), 2024, Chengdu, China
Abstract
In recent years, the CLIP model has achieved remarkable success in image-text retrieval tasks through contrastive learning. However, CLIP still exhibits certain limitations when handling complex backgrounds and small objects. To address these challenges, this paper proposes two key innovations: First, during inference, the YOLOv10 model is employed to detect and crop small objects and essential background information in the image, enhancing ability of CLIP to comprehend complex scenes. Second, the Next-ViT network is utilized as the backbone for image encoding. By leveraging its more efficient multi-scale feature extraction capabilities, Next-ViT improves the retrieval accuracy of small objects while also being more adaptable for deployment in industrial scenarios. Experimental results demonstrate that these two innovations significantly enhance performance of CLIP in image-text retrieval tasks and achieve a balance between accuracy and efficiency across various vision tasks.
(2025) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Xue Qian and Bo Liu "Enhanced image-text retrieval based on CLIP with YOLOv10 and Next-ViT", Proc. SPIE 13486, Fourth International Conference on Computer Vision, Application, and Algorithm (CVAA 2024), 134862K (9 January 2025); https://doi.org/10.1117/12.3055876
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Object detection

Image processing

Transformers

Image enhancement

Image retrieval

Object recognition

Visualization

Back to Top