CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction Paper โข 2310.01403 โข Published Oct 2, 2023 โข 1
CLIM: Contrastive Language-Image Mosaic for Region Representation Paper โข 2312.11376 โข Published Dec 18, 2023
OMG-Seg: Is One Model Good Enough For All Segmentation? Paper โข 2401.10229 โข Published Jan 18, 2024 โข 1
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation Paper โข 2503.21979 โข Published Mar 27, 2025 โข 4