Abstract: The Contrastive Language-Image Pre-Training (CLIP) model, pretrained on large visual text corpora, has demonstrated significant improvements in visual and linguistic tasks and has been ...