Fine-tuning Multimodal Large Language Models (MLLMs) with parameter-efficient methods like Low-Rank Adaptation (LoRA) is crucial for task adaptation. However, imbalanced training dynamics across modalities often lead to suboptimal accuracy due to negative interference, a challenge typically addressed with inefficient heuristic methods such as manually tuning separate learning rates. To overcome this, we introduce MARS (Multimodal Adaptive Rank Search), an approach to discover optimal rank pairs that balance training dynamics while maximizing performance. Our key innovation, a proposed framework of dual scaling laws, enables this search: one law models module-specific convergence time to prune the search space to candidates with aligned dynamics, while the other predicts final task performance to select the optimal pair from the pruned set. By re-purposing the LoRA rank as a controller for modality-specific convergence speed, MARS outperforms baseline methods and provides a robust, automated strategy for optimizing MLLM fine-tuning.
Imbalanced training dynamics between the Vision Encoder (VE) and LLM lead to suboptimal performance. When modules converge at different rates, it causes either performance bottlenecks or training oscillations.
When the vision encoder is under-adapted (lrve ≪ lrllm), it creates a performance bottleneck that limits the overall model capability.
When the LLM is under-adapted (lrllm ≪ lrve), it causes significant training instability and oscillations.
MARS discovers optimal LoRA rank pairs that align convergence times (tve ≈ tllm), enabling stable training dynamics and maximum performance.
We identify and provide evidence that imbalanced training dynamics in MLLM fine-tuning, originating from a two-fold disparity (learning capacity and required learning budget), represent a key source of suboptimal performance.
We are the first to propose and validate dual scaling laws for MLLM fine-tuning: Scaling Law-P (models performance) and Scaling Law-C (models module-specific convergence time), making the rank search feasible.
MARS outperforms baselines with up to 12.0% higher ScienceQA accuracy and 13.2% lower LLaVA Bench perplexity, while demonstrating robust generality and an 11.5× reduction in total search and fine-tuning time.
MARS transforms the intractable search for optimal LoRA ranks into an efficient, guided procedure through a two-step process enabled by our dual scaling laws.
(a) Scaling Law-P: Performance as a function of dataset size for different LLM ranks.
(b) Scaling Law-C: Convergence time as a function of dataset size for different LLM ranks.
Predicts final task performance. Serves as the objective function to select the optimal rank pair from pruned candidates.
Models convergence time for each module. Used to prune the search space to candidates with aligned dynamics (tve ≈ tllm).
MARS uses Scaling Law-C to enforce a balance condition (tve ≈ tllm). This drastically prunes the search space to candidate pairs predicted to exhibit stable, harmonized training dynamics.
From the pruned set of stable candidates, MARS uses Scaling Law-P to predict the final task accuracy for each pair and selects the one with the best predicted outcome.
Left: Comparison across diverse multimodal benchmarks demonstrating broad generalization. Right: Fine-grained capability breakdown on MMStar.
| Model | Benchmark | LoRA (⋆, 1e-5) | LoRA (⋆, 1e-6) | LoRA (⋆, 1e-7) | MARS |
|---|---|---|---|---|---|
| LLaVA-OV-0.5B | LLaVA (↓) | 2.7336 | 2.771 | 2.8472 | 2.7188 |
| ScienceQA (↑) | 71.06 | 61.88 | 59.28 | 72.85 | |
| LLaVA-OV-7B | LLaVA (↓) | 2.2317 | 2.295 | 2.4346 | 2.1875 |
| ScienceQA (↑) | 72.26 | 69.86 | 67.27 | 74.25 | |
| Qwen2.5-VL-3B | LLaVA (↓) | 3.6156 | 3.7415 | 4.1557 | 3.5925 |
| ScienceQA (↑) | 78.04 | 76.45 | 76.25 | 79.24 | |
| Qwen2.5-VL-7B | LLaVA (↓) | 3.5032 | 3.5908 | 3.8716 | 3.3879 |
| ScienceQA (↑) | 79.84 | 76.25 | 74.25 | 79.64 |
| Model | AdaLoRA | GeoLoRA | Full-rank | LoRA (⋆, 16) | LoRA (⋆, 32) | MARS |
|---|---|---|---|---|---|---|
| LLaVA Bench (perplexity ↓) | ||||||
| LLaVA-OV-0.5B | 2.8973 | 2.8801 | 2.7209 | 2.7336 | 2.7331 | 2.7188 |
| LLaVA-OV-7B | 2.5189 | 2.4888 | 2.2693 | 2.2317 | 2.4420 | 2.1875 |
| ScienceQA (accuracy % ↑) | ||||||
| LLaVA-OV-0.5B | 62.28 | 63.52 | 69.66 | 71.06 | 69.86 | 72.85 |
| LLaVA-OV-7B | 66.27 | 67.81 | 70.46 | 72.26 | 73.65 | 74.25 |
@article{cho2026mars,
title={MARS: Harmonizing Multimodal Convergence via Adaptive Rank Search},
author={Cho, Minkyoung and Jang, Insu and Jin, Shuowei and Zhao, Zesen and Jothi, Adityan and Can, Ethem F. and Chen, Min-Hung and Mao, Z. Morley},
journal={arXiv preprint},
year={2026}
}