VLA-Adapter: An Effective Paradigm for
Tiny-Scale Vision-Language-Action Model

Yihao Wang^1,2,4,*,♢ Pengxiang Ding^2,3,4,*,† Lingxiao Li^1,4,5 Can Cui^2,4 Zirui Ge^3,4
Xinyang Tong^2,4 Wenxuan Song^4,6 Han Zhao^2,3,4 Wei Zhao^2,4 Pengxu Hou⁶
Siteng Huang² Yifan Tang¹ Wenhui Wang¹ Ru Zhang^1,✉ Jianyi Liu¹ Donglin Wang^2,✉

¹ Beijing University of Posts and Telecommunications ² Westlake University ³ Zhejiang University
⁴ OpenHelix Team ⁵ State Key Laboratory of Networking and Switching Technology ⁶ The Hong Kong University of Science and Technology (Guangzhou)
^* Equal Contribution: yh-wang@bupt.edu.cn; dingpx2015@gmail.com
^✉ Corresponding Author ^† Project Lead ^♢ Work done during interning at Westlake University

Paper ArXiv Code 🤗 Models

Twitter AK View: Loading

Why Propose VLA-Adapter?

Background. Vision-Language-Action (VLA) models typically bridge the gap between perceptual and action spaces by pre-training a large-scale vision-language model on robotic data. While this approach greatly enhances performance, it also incurs significant training costs.

An Intuitive idea. Can efficiency be improved while maintaining performance simply by reducing the backbone? We illustrate this using the recent SOTA OpenVLA-OFT method. Here, we compare OpenVLA-7B+OFT, Prismatic-VLMs (LLaMA2-7B)+OFT, and Prismatic-VLMs (Qwen2.5-0.5B)+OFT. The results are shown in Table 1. So, an effective paradigm for bridging VL to A is necessary to reduce the backbone scale while maintaining performance!

Framework — **Table 1.** Impact of different backbones on OpenVLA-OFT performance.
"Success rate" is the performance on the LIBERO-Long benchmark.

In this work. We investigate how to effectively bridge vision-language (VL) representations to action (A) space. And then, we introduce VLA-Adapter, a novel paradigm designed to reduce the reliance of VLA models on large-scale vision-language models and extensive pre-training.

Performance. VLA-Adapter not only achieves state-of-the-art performance using only a 0.5B-parameter backbone, but also offers the fast inference speed reported to date. Furthermore, VLA-Adapter enables the training of a powerful VLA model in just 8 hours on a single consumer-grade GPU, significantly lowering the barrier to deploying the VLA models.

1. Pipelines

Brief Description. This Vision-Language Model (VLM) follows the Prismatic-VLMs architecture. We employ each-layer Raw features (vision and language representations in VLM) and ActionQuery (additional learnable tokens) features are integrated in Bridge Attention with the corresponding-layer action latent. The degree of Raw features injection is learnable, ensuring the performance and stability of training. The layer number of the Policy network is the same as the VLM's. The Policy parameters are only 97M (Million) when the backbone is Qwen2.5-0.5B .

2. Questions & Key Findings

2.1 About Conditions for Policy

- Question 1.1. Which Layer of Features within The VLM Is More Effective for The Policy Network?
- Question 1.2. Are The ActionQuery Features a Better Choice Than The Raw Features?
- Question 1.3. How Many ActionQueries Are Enough?
- Question 1.4. How Policy Can Better Leverage The Conditions from VLM?

- To answer Question 1.1:

· Key Finding 1. Regarding Raw features, the middle-layer latent performs better than the deep-layer latent. Deep-layer Raw features is biased towards semantic information and less effective in action generation. The middle-layer Raw features effectively integrates image and text information, retains richer multimodal details, and facilitates action generation.

· Key Finding 2. Regarding ActionQuery features, deep-layer latent performs better than other-layer latent. Since ActionQuery is trained from scratch, and deep-layer ActionQuery features aggregates richer multimodal details and is more effectively promoting action generation than the shallow layers.

Key Findings Illustration — **Figure 4.** Comparison of four conditions in VLA-Adapter on LIBERO-Long. Blue and Green lines are single-layer Raw features and ActionQuery features, as in Figure 3a) and 3b). Blue and Green columns are all-layer
Raw features and ActionQuery features, as in Figure 3c) and 3d).

· Key Finding 3. Multi-layer features perform better. We observed that using all-layer features outperforms a single layer. Not only does it improve performance, but it also saves time on best layer selection during design.

- To answer Question 1.2:

· Key Finding 4. ActionQuery features generally outperform Raw features. The advantages of ActionQuery features are particularly evident when all layers are used, achieving a 2.0% higher success rate.

- To answer Question 1.3:

· Key Finding 5. Using too few ActionQuery weakens multimodal aggregation and makes it challenging to Policy. Conversely, using too many ActionQueries introduce redundancy, interfering with performance. We selected 64. It provides the balance between performance and efficiency.

- To answer Question 1.4:

· Key Finding 6. ActionQuery features can be fully injected, while Raw features require controlled injection. This result confirms that the injection degree in proposed Bridge Attention is effective.

2.2 About Performance of The Proposed Bridge Paradigm

- Question 2.1. What Are The Advantages of The VLA-Adapter Compared to Other Bridge Paradigms?

Effectiveness. To validate the effectiveness of our bridge paradigm, we compare three kinds of backbones: B1: Prismatic-VLMs (Qwen2.5-0.5B). B2: Prismatic-VLMs (LLaMA2-7B). B3: OpenVLA-7B. The first two are without pre-training on robotic data. We adopted the OpenVLA-OFT bridging way to compare.

VLA-Adapter remains effective when the backbone is frozen. Only the ActionQuery and Policy are trained from scratch. SmolVLA is dedicated to studying frozen backbone. So, we compare with it and OpenVLA-OFT.

· Conclusion 1. VLA-Adapter improvement is obvious when VLMs without robotic pre-training.
· Conclusion 2. Even if the backbone freezes, VLA-Adapter still performs strongly.

3. Results

3.1 Numerical Comparison on The Different Benchmarks

3.1.1 LIBERO Benchmark

3.1.2 CALVIN ABC→D Benchmark

3.1.3 Throughput Comparison

3.2 Execution Examples on The Different Benchmarks

3.2.1 LIBERO-Spatial (Avg. Success Rate: 97.8% -> VLA-Adapter-Pro: 99.6%)

Pick up the black bowl between the plate and the ramekin and place it on the plate

Pick up the black bowl next to the ramekin and place it on the plate

Pick up the black bowl from table center and place it on the plate

Pick up the black bowl on the cookie box and place it on the plate

Pick up the black bowl in the top drawer of the wooden cabinet and place it on the plate

Pick up the black bowl on the stove and place it on the plate

Pick up the black bowl next to the plate and place it on the plate

3.2.2 LIBERO-Object (Avg. Success Rate: 99.2% -> VLA-Adapter-Pro: 99.6%)

Pick up the alphabet soup and place it in the basket

Pick up the salad dressing and place it in the basket

Pick up the bbq sauce and place it in the basket

Pick up the ketchup and place it in the basket

Pick up the tomato sauce and place it in the basket

Pick up the milk and place it in the basket

Pick up the chocolate pudding and place it in the basket

3.2.3 LIBERO-Goal (Avg. Success Rate: 97.2% -> VLA-Adapter-Pro: 98.2%)

Open the middle drawer of the cabinet

Put the bowl on the stove

Put the wine bottle on top of the cabinet

Open the top drawer and put the bowl inside

Put the bowl on top of the cabinet

Push the plate to the front of the stove

Put the wine bottle on the rack

3.2.4 LIBERO-Long (Avg. Success Rate: 95.0% -> VLA-Adapter-Pro: 96.4%)

Put both the cream cheese box and the butter in the basket

Put the black bowl in the bottom drawer of the cabinet and close it

Turn on the stove and put the moka pot on it

Put the white mug on the left plate and put the yellow and white mug on the right plate

Put both moka pots on the stove

Put both the alphabet soup and the cream cheese box in the basket

Put the yellow and white mug in the microwave and close it

3.2.5 CALVIN ABC->D (Avg. length: 4.42 -> VLA-Adapter-Pro: 4.50)

Open drawer➝Lift pink block table➝Place in slider➝Turn on lightbulb➝Rotate blue block left

Lift red block table➝Place in drawer➝Rotate blue block right➝Lift pink block slider ➝Stack block

Move slider right➝Turn on lightbulb➝Push pink block left➝Open drawer➝Push into drawer

Push pink block right➝Lift red block table➝Stack block➝Move slider left➝Unstack block

Turn off led➝Close drawer➝Move slider left➝Push pink block right➝Lift red block slider

Push into drawer➝Close drawer➝Move slider right ➝Lift pink block slider➝Place in slider

Turn off lightbulb➝Move slider left➝Push blue block left➝Lift pink block slider➝ Stack block

3.3 Keyframe Examples

We give two keyframe examples of the key frames of the robot arm performing a task. You can drag the progress bar in the middle bottom to view it.

Start RGB observation and Proprio. state

End RGB observation and Proprio. state

Start RGB observation and Proprio. state

End RGB observation and Proprio. state

3.4 Frozen VLM Backbone

Fortunately, VLA-Adapter remains effective when the backbone is frozen. Only the ActionQuery latent and Policy are trained from scratch.

Instruction:
Put both the alphabet soup and the tomato sauce in the basket

Avg. Success Rate: 0.0% (LIBERO-Long)

OpenVLA-OFT: False ✖

Avg. Success Rate: 86.4% (LIBERO-Long)

VLA-Adapter: True ✔

BibTeX

      
        @article{wang2025vlaadapter,
          author={Wang, Yihao and Ding, Pengxiang and Li, Lingxiao and Cui, Can and Ge, Zirui and Tong, Xinyang and Song, Wenxuan and Zhao, Han and Zhao, Wei and Hou, Pengxu and Huang, Siteng and Tang, Yifan and Wang, Wenhui and Zhang, Ru and Liu, Jianyi and Wang, Donglin},
          title={VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model},
          journal={arXiv preprint arXiv:2509.09372},
          year={2025}
        }