Logo VLA-Adapter: An Effective Paradigm for
Tiny-Scale Vision-Language-Action Model

Yihao Wang1,2,4,*,♢ Pengxiang Ding2,3,4,*,† Lingxiao Li1,4,5 Can Cui2,4 Zirui Ge3,4
Xinyang Tong2,4 Wenxuan Song4,6 Han Zhao2,3,4 Wei Zhao2,4 Pengxu Hou6
Siteng Huang2 Yifan Tang1 Wenhui Wang1 Ru Zhang1,✉ Jianyi Liu1 Donglin Wang2,✉
1 Beijing University of Posts and Telecommunications 2 Westlake University 3 Zhejiang University
4 OpenHelix Team 5 State Key Laboratory of Networking and Switching Technology 6 The Hong Kong University of Science and Technology (Guangzhou)
* Equal Contribution: yh-wang@bupt.edu.cn; dingpx2015@gmail.com
Corresponding Author Project Lead Work done during interning at Westlake University

Why Propose VLA-Adapter?

Background.   Vision-Language-Action (VLA) models typically bridge the gap between perceptual and action spaces by pre-training a large-scale vision-language model on robotic data. While this approach greatly enhances performance, it also incurs significant training costs.

An Intuitive idea. Can efficiency be improved while maintaining performance simply by reducing the backbone? We illustrate this using the recent SOTA OpenVLA-OFT method. Here, we compare OpenVLA-7B+OFT, Prismatic-VLMs (LLaMA2-7B)+OFT, and Prismatic-VLMs (Qwen2.5-0.5B)+OFT. The results are shown in Table 1. So, an effective paradigm for bridging VL to A is necessary to reduce the backbone scale while maintaining performance!

Table 1. Impact of different backbones on OpenVLA-OFT performance.
"Success rate" is the performance on the LIBERO-Long benchmark.
Framework

In this work.   We investigate how to effectively bridge vision-language (VL) representations to action (A) space. And then, we introduce VLA-Adapter, a novel paradigm designed to reduce the reliance of VLA models on large-scale vision-language models and extensive pre-training.

Performance.   VLA-Adapter not only achieves state-of-the-art performance using only a 0.5B-parameter backbone, but also offers the fast inference speed reported to date. Furthermore, VLA-Adapter enables the training of a powerful VLA model in just 8 hours on a single consumer-grade GPU, significantly lowering the barrier to deploying the VLA models.

Framework
Figure 1. Characteristics of VLA-Adapter. "↓" is that smaller values are better,
and vice versa. "Performance" is the average success rate on the four LIBERO suits.

1. Pipelines

Brief Description.   This Vision-Language Model (VLM) follows the Prismatic-VLMs architecture. We employ each-layer Raw features (vision and language representations in VLM) and ActionQuery (additional learnable tokens) features are integrated in Bridge Attention with the corresponding-layer action latent. The degree of Raw features injection is learnable, ensuring the performance and stability of training. The layer number of the Policy network is the same as the VLM's. The Policy parameters are only 97M (Million) when the backbone is Qwen2.5-0.5B .

Framework
Figure 2. The pipeline of VLA-Adapter.

2. Questions & Key Findings

2.1 About Conditions for Policy

- Question 1.1. Which Layer of Features within The VLM Is More Effective for The Policy Network?
- Question 1.2. Are The ActionQuery Features a Better Choice Than The Raw Features?
- Question 1.3. How Many ActionQueries Are Enough?
- Question 1.4. How Policy Can Better Leverage The Conditions from VLM?

Framework
Figure 3. The key components are the effective condition exploration. Four conditions about "layer" and "type" are given
on the right. "Attention" includes cross attention with conditions and self attention with itself. Since the four kinds of
conditions here are not complete VLA-Adapter conditions, "Attention" here is not completely "Bridge Attention".

- To answer Question 1.1:

   · Key Finding 1. Regarding Raw features, the middle-layer latent performs better than the deep-layer latent. Deep-layer Raw features is biased towards semantic information and less effective in action generation. The middle-layer Raw features effectively integrates image and text information, retains richer multimodal details, and facilitates action generation.

   · Key Finding 2. Regarding ActionQuery features, deep-layer latent performs better than other-layer latent. Since ActionQuery is trained from scratch, and deep-layer ActionQuery features aggregates richer multimodal details and is more effectively promoting action generation than the shallow layers.

Key Findings Illustration
Figure 4. Comparison of four conditions in VLA-Adapter on LIBERO-Long. Blue and Green lines are single-layer Raw features and ActionQuery features, as in Figure 3a) and 3b). Blue and Green columns are all-layer
Raw features and ActionQuery features, as in Figure 3c) and 3d).

   · Key Finding 3. Multi-layer features perform better. We observed that using all-layer features outperforms a single layer. Not only does it improve performance, but it also saves time on best layer selection during design.

- To answer Question 1.2:

   · Key Finding 4. ActionQuery features generally outperform Raw features. The advantages of ActionQuery features are particularly evident when all layers are used, achieving a 2.0% higher success rate.

- To answer Question 1.3:

   · Key Finding 5. Using too few ActionQuery weakens multimodal aggregation and makes it challenging to Policy. Conversely, using too many ActionQueries introduce redundancy, interfering with performance. We selected 64. It provides the balance between performance and efficiency.

Key Findings Illustration
Figure 5. Comparison of the different numbers of ActionQuery. Blue line shows the result of using only the last-layer ActionQuery feature. Red stars show the result of the full VLA-Adapter under 64 and 256 ActionQuery.

- To answer Question 1.4:

   · Key Finding 6. ActionQuery features can be fully injected, while Raw features require controlled injection.This result confirms that the injection degree in proposed Bridge Attention is effective.

Table 2. Ablation of the different injection degrees.
Key Findings Illustration

2.2 About Performance of The Proposed Bridge Paradigm

- Question 2.1. What Are The Advantages of The VLA-Adapter Compared to Other Bridge Paradigms?

Effectiveness.   To validate the effectiveness of our bridge paradigm, we compare three kinds of backbones: B1: Prismatic-VLMs (Qwen2.5-0.5B). B2: Prismatic-VLMs (LLaMA2-7B). B3: OpenVLA-7B. The first two are without pre-training on robotic data. We adopted the OpenVLA-OFT bridging way to compare.

Table 3. Effectiveness comparison with OpenVLA-OFT on the LIBERO-Long.
"Fine-tuned" is by LoRA. Bold represents the best performance. ∆ is the increment.
Framework

VLA-Adapter remains effective when the backbone is frozen. Only the ActionQuery and Policy are trained from scratch. SmolVLA is dedicated to studying frozen backbone. So, we compare with it and OpenVLA-OFT.

Table 4. Effectiveness comparison when the backbone is frozen. OpenVLA-OFT
does not work when VLM is frozen. Examples are shown in Section 3.4.
Framework

    · Conclusion 1. VLA-Adapter improvement is obvious when VLMs without robotic pre-training.
    · Conclusion 2. Even if the backbone freezes, VLA-Adapter still performs strongly.

3. Results

3.1 Numerical Comparison on The Different Benchmarks

3.1.1 LIBERO Benchmark

Table 5. Comparison on the LIBERO benchmark. Bold represents the best performance. Italics* represents
suboptimal performance. † represents that the non-based-VLM baselines. "Scratch" represents the work
without pre-training on robotic data. "Params" is the backbone scale, and its unit is Billion. The performance
of GR00T N1 is obtained by full-parameter fine-tuning.
Framework

3.1.2 CALVIN ABC→D Benchmark

Table 6. Comparison on the CALVIN ABC→D benchmark. Bold represents the best performance. Italics*
represents the suboptimal performance. † represents that the non-based-VLM method. "Params" is the
backbone scale, and its unit is Billion.
Framework

3.1.4 Throughput Comparison

Table 7. nference efficiency comparison with OpenVLA and OpenVLA-OFT. The action chunk is 8 dimensions, consistent with most VLA. “OpenVLA-OFT (wo Xg, P)” is the L1-based version where the input is without
the gripper image and proprioceptive state. It is the fastest version of OpenVLA-OFT.
Framework

3.2 Execution Examples on The Different Benchmarks

3.2.1 LIBERO-Spatial (Avg. Success Rate: 97.8%)

Pick up the black bowl between the plate and the ramekin and place it on the plate
Pick up the black bowl next to the ramekin and place it on the plate
Pick up the black bowl from table center and place it on the plate
Pick up the black bowl on the cookie box and place it on the plate
Pick up the black bowl in the top drawer of the wooden cabinet and place it on the plate
Pick up the black bowl on the stove and place it on the plate
Pick up the black bowl next to the plate and place it on the plate

3.2.2 LIBERO-Object (Avg. Success Rate: 99.2%)

Pick up the alphabet soup and place it in the basket
Pick up the salad dressing and place it in the basket
Pick up the bbq sauce and place it in the basket
Pick up the ketchup and place it in the basket
Pick up the tomato sauce and place it in the basket
Pick up the milk and place it in the basket
Pick up the chocolate pudding and place it in the basket

3.2.3 LIBERO-Goal (Avg. Success Rate: 97.2%)

Open the middle drawer of the cabinet
Put the bowl on the stove
Put the wine bottle on top of the cabinet
Open the top drawer and put the bowl inside
Put the bowl on top of the cabinet
Push the plate to the front of the stove
Put the wine bottle on the rack

3.2.4 LIBERO-Long (Avg. Success Rate: 95.0%)

Put both the cream cheese box and the butter in the basket
Put the black bowl in the bottom drawer of the cabinet and close it
Turn on the stove and put the moka pot on it
Put the white mug on the left plate and put the yellow and white mug on the right plate
Put both moka pots on the stove
Put both the alphabet soup and the cream cheese box in the basket
Put the yellow and white mug in the microwave and close it

3.2.5 CALVIN ABC->D (Avg. length: 4.42)

Open drawer➝Lift pink block table➝Place in slider➝Turn on lightbulb➝Rotate blue block left
Lift red block table➝Place in drawer➝Rotate blue block right➝Lift pink block slider ➝Stack block
Move slider right➝Turn on lightbulb➝Push pink block left➝Open drawer➝Push into drawer
Push pink block right➝Lift red block table➝Stack block➝Move slider left➝Unstack block
Turn off led➝Close drawer➝Move slider left➝Push pink block right➝Lift red block slider
Push into drawer➝Close drawer➝Move slider right ➝Lift pink block slider➝Place in slider
Turn off lightbulb➝Move slider left➝Push blue block left➝Lift pink block slider➝ Stack block

3.3 Keyframe Examples

We give two keyframe examples of the key frames of the robot arm performing a task. You can drag the progress bar in the middle bottom to view it.

Interpolate start reference image.

Start RGB observation and Proprio. state

Loading...
Interpolation end reference image.

End RGB observation and Proprio. state

Interpolate start reference image.

Start RGB observation and Proprio. state

Loading...
Interpolation end reference image.

End RGB observation and Proprio. state

3.4 Frozen VLM Backbone

Fortunately, VLA-Adapter remains effective when the backbone is frozen. Only the ActionQuery latent and Policy are trained from scratch.

Instruction:
Put both the alphabet soup and the tomato sauce in the basket

Avg. Success Rate: 0.0% (LIBERO-Long)
OpenVLA-OFT: False
Avg. Success Rate: 86.4% (LIBERO-Long)
VLA-Adapter: True

BibTeX

      @article{Wang2025VLAAdapter,
      author = {Wang, Yihao and Ding, Pengxiang and Li, Lingxiao and Cui, Can and Ge, Zirui and Tong, Xinyang and Song, Wenxuan and Zhao, Han and Zhao, Wei and Hou, Pengxu and Huang, Siteng and Tang, Yifan and Wang, Wenhui and Zhang, Ru and Liu, Jianyi and Wang, Donglin},
      title = {VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model},
      journal = {ArXiv},
      year = {2025},
    }