MARS: Harmonizing Multimodal Convergence via Adaptive Rank Search

Abstract

Fine-tuning Multimodal Large Language Models (MLLMs) with parameter-efficient methods like Low-Rank Adaptation (LoRA) is crucial for task adaptation. However, imbalanced training dynamics across modalities often lead to suboptimal accuracy due to negative interference, a challenge typically addressed with inefficient heuristic methods such as manually tuning separate learning rates. To overcome this, we introduce MARS (Multimodal Adaptive Rank Search), an approach to discover optimal rank pairs that balance training dynamics while maximizing performance. Our key innovation, a proposed framework of dual scaling laws, enables this search: one law models module-specific convergence time to prune the search space to candidates with aligned dynamics, while the other predicts final task performance to select the optimal pair from the pruned set. By re-purposing the LoRA rank as a controller for modality-specific convergence speed, MARS outperforms baseline methods and provides a robust, automated strategy for optimizing MLLM fine-tuning.

Why MARS is Needed?

Imbalanced training dynamics between the Vision Encoder (VE) and LLM lead to suboptimal performance. When modules converge at different rates, it causes either performance bottlenecks or training oscillations.

VE Slow Vision Encoder Under-adapted

When the vision encoder is under-adapted (lr_ve ≪ lr_llm), it creates a performance bottleneck that limits the overall model capability.

LLM Slow LLM Under-adapted

When the LLM is under-adapted (lr_llm ≪ lr_ve), it causes significant training instability and oscillations.

Balanced MARS Solution

MARS discovers optimal LoRA rank pairs that align convergence times (t_ve ≈ t_llm), enabling stable training dynamics and maximum performance.

Key Contributions

1

Identifying the Core Problem

We identify and provide evidence that imbalanced training dynamics in MLLM fine-tuning, originating from a two-fold disparity (learning capacity and required learning budget), represent a key source of suboptimal performance.

2

Dual Scaling Laws Framework

We are the first to propose and validate dual scaling laws for MLLM fine-tuning: Scaling Law-P (models performance) and Scaling Law-C (models module-specific convergence time), making the rank search feasible.

3

Superior Performance & Efficiency

MARS outperforms baselines with up to 12.0% higher ScienceQA accuracy and 13.2% lower LLaVA Bench perplexity, while demonstrating robust generality and an 11.5× reduction in total search and fine-tuning time.

MARS Methodology

MARS transforms the intractable search for optimal LoRA ranks into an efficient, guided procedure through a two-step process enabled by our dual scaling laws.

Dual Scaling Laws: The Predictive Foundation

(a) Scaling Law-P: Performance as a function of dataset size for different LLM ranks.

(b) Scaling Law-C: Convergence time as a function of dataset size for different LLM ranks.

Scaling Law-P (Performance)

p = A · 1/((r_ve)^α_m·(r_llm)^α_l·D_f^β) + E

Predicts final task performance. Serves as the objective function to select the optimal rank pair from pruned candidates.

Key Findings

Performance is sensitive to VE-LLM rank interplay.

Optimal rank pair involves a trade-off with dataset size.

Scaling Law-C (Convergence)

t_i = k_i · (r_i)^γ_i · D_f^δ_i + E_i

Models convergence time for each module. Used to prune the search space to candidates with aligned dynamics (t_ve ≈ t_llm).

Key Findings

Increasing dataset size increases convergence time.

Increasing rank size decreases convergence time.

Two-Step Search Process

1

Pruning via Convergence Balancing

MARS uses Scaling Law-C to enforce a balance condition (t_ve ≈ t_llm). This drastically prunes the search space to candidate pairs predicted to exhibit stable, harmonized training dynamics.

2

Selection via Performance Prediction

From the pruned set of stable candidates, MARS uses Scaling Law-P to predict the final task accuracy for each pair and selects the one with the best predicted outcome.

Results

Evaluation of Generalist Capabilities

Left: Comparison across diverse multimodal benchmarks demonstrating broad generalization. Right: Fine-grained capability breakdown on MMStar.

Comparison with Fixed-Rank Tuning (Different Learning Rates)

Model	Benchmark	LoRA (⋆, 1e-5)	LoRA (⋆, 1e-6)	LoRA (⋆, 1e-7)	MARS
LLaVA-OV-0.5B	LLaVA (↓)	2.7336	2.771	2.8472	2.7188
LLaVA-OV-0.5B	ScienceQA (↑)	71.06	61.88	59.28	72.85
LLaVA-OV-7B	LLaVA (↓)	2.2317	2.295	2.4346	2.1875
LLaVA-OV-7B	ScienceQA (↑)	72.26	69.86	67.27	74.25
Qwen2.5-VL-3B	LLaVA (↓)	3.6156	3.7415	4.1557	3.5925
Qwen2.5-VL-3B	ScienceQA (↑)	78.04	76.45	76.25	79.24
Qwen2.5-VL-7B	LLaVA (↓)	3.5032	3.5908	3.8716	3.3879
Qwen2.5-VL-7B	ScienceQA (↑)	79.84	76.25	74.25	79.64

Comparison with Adaptive Rank Search Baselines

Model	AdaLoRA	GeoLoRA	Full-rank	LoRA (⋆, 16)	LoRA (⋆, 32)	MARS
LLaVA Bench (perplexity ↓)
LLaVA-OV-0.5B	2.8973	2.8801	2.7209	2.7336	2.7331	2.7188
LLaVA-OV-7B	2.5189	2.4888	2.2693	2.2317	2.4420	2.1875
ScienceQA (accuracy % ↑)
LLaVA-OV-0.5B	62.28	63.52	69.66	71.06	69.86	72.85
LLaVA-OV-7B	66.27	67.81	70.46	72.26	73.65	74.25

BibTeX

@article{cho2026mars,
    title={MARS: Harmonizing Multimodal Convergence via Adaptive Rank Search},
    author={Cho, Minkyoung and Jang, Insu and Jin, Shuowei and Zhao, Zesen and Jothi, Adityan and Can, Ethem F. and Chen, Min-Hung and Mao, Z. Morley},
    journal={arXiv preprint},
    year={2026}
}