Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models

Abstract

Generalist robots should be able to understand and follow user instructions, but current vision-language-action (VLA) models struggle with following fine-grained commands despite providing a powerful architecture for mapping open-vocabulary natural language instructions to robot actions. We attribute this phenomenon to a lack of semantic diversity and language grounding in existing robot datasets and, specifically, a lack of fine-grained task diversity for similar observations. To address this, we present a novel method to augment existing robot datasets by leveraging vision language models to create counterfactual labels. Our method improves the language-following capabilities of VLAs by increasing the diversity and granularity of language grounding for robot datasets by generating counterfactual language and actions.

Approach

Counterfactual Augmentation with Synthetic Trajectories (CAST)

We propose CAST, Counterfactual Augmentation with Synthetic Trajectories, a method for both expanding and balancing the distribution of trajectories in your robotics dataset.

CAST Method

We observed that often VLA policies will learn to ignore language, especially when training on sufficiently diverse language instructions, as biases in the data result in the policy only needing to pay attention to the observation to predict the correct actions during training. Therefore, the goal of CAST is to expand the distribution of instructions and actions at any given observation. We show that this forces the model to attend to language, producing a more steerable policy.

CAST takes an uncurated robot dataset queries a VLM for counterfactual instructions that could have been executed at different points in the existing trajectories. The atomic command that aligns with the instruction is also determined and used to generate an action chunk. We can then append these counterfactual actions to the existing trajectory, creating the CAST dataset, which consists of both the original data and these counterfactual trajectory-instruction pairs. We use the CAST dataset to finetune PaliGemma-3B with the typical VLA recipe.

Experiments

Object Navigation

In object navigation tasks, the policy must recognize and move towards a target object. CAST increases the coverage of object reaching behavior in the general navigation dataset we train CounterfactualVLA on, resulting in successful object navigation.

Referential Navigation

In referential navigation tasks, the policy must recognize a target object or structure in the scene and understand how to move with reference to that object, specifically moving to the "right" or "left". By explicitly generating counterfactual trajectories that reflect these referential instructions, CounterfactualVLA learns to differentiate these behaviors and perform them accordingly

In continuous navigation tasks, the policy must recognize a target structure and move with respect to the structure for a sustained amount of time. CouterfactualVLA demonstrates the ability to perform these continuous behaviors in scenes where the training data is typically biased to moving down the center of hallways or sticking to more open areas.

Chained Instructions

In these examples, the policy is provided instructions in sequence. It must execute each step correctly to move on to the next step.

Results

By training CounterfactualVLA with the CAST dataset, we can match and outperform state-of-the-art methods for open-set language following.

CAST Method

We also compare to alternative architectures, which demonstrates that having a high-capacity model in necessary to fully reap the benefits of diverse language labels.

CAST Method
@misc{glossop2025castcounterfactuallabelsimprove,
                          title={CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models}, 
                          author={Catherine Glossop and William Chen and Arjun Bhorkar and Dhruv Shah and Sergey Levine},
                          year={2025},
                          eprint={2508.13446},
                          archivePrefix={arXiv},
                          primaryClass={cs.RO},
                          url={https://arxiv.org/abs/2508.13446}, 
                    }
                  }

The website (source code) design was adapted from Nerfies.