Pre-training and fine-tuning are the standard methods for vision-language models, but scaling these models comes with high storage costs and challenges in optimization.
Recent advancements in NLP led to a technique called LoRA, which aims to make fine-tuning more efficient by updating only low-rank parameters, but it suffers from significant approximation errors.
The proposed momentum imitation learning (MoIL) method improves upon LoRA by optimizing the approximation error and making the adaptation process more efficient, showing better performance in experiments across several vision-language tasks with minimal parameter updates.