Abstract

Generalist robots should be able to understand and follow user instructions. Despite providing a powerful architecture for mapping open-vocabulary language instructions to robot actions, current vision-language-action (VLA) models struggle to follow fine-grained commands. One cause for this is a lack of semantic diversity and language grounding in existing robot datasets and, specifically, a lack of fine-grained task diversity for similar observations. To address this, we present a novel method to augment existing robot datasets by leveraging vision-language models to create counterfactual labels. By augmenting existing datasets with these labels, we increase the diversity and granularity of language grounding for robot datasets, ultimately improving the language-following capabilities of VLAs. We evaluate the resulting model's ability to follow language instructions, ranging from simple object-centric commands to complex referential tasks, by conducting vision-language navigation experiments in 3 different indoor and outdoor environments. Our experiments show that counterfactual relabeling (without additional data collection) significantly improves instruction-following in VLA policies, outperforming state-of-the-art methods and doubling the success rate compared to VLAs trained on unaugmented data. We also evaluate our method for manipulation VLAs and find a similar gain in performance on tasks with distractors.

Approach

Counterfactual Augmentation with Synthetic Trajectories (CAST)

We propose CAST, Counterfactual Augmentation with Synthetic Trajectories, a method for both expanding and balancing the distribution of trajectories in your robotics dataset.

We observed that often VLA policies will learn to ignore language, especially when training on sufficiently diverse language instructions, as biases in the data result in the policy only needing to pay attention to the observation to predict the correct actions during training. Therefore, the goal of CAST is to expand the distribution of instructions and actions at any given observation. We show that this forces the model to attend to language, producing a more steerable policy.

CAST takes an uncurated robot dataset queries a VLM for counterfactual instructions that could have been executed at different points in the existing trajectories. The atomic command that aligns with the instruction is also determined and used to generate an action chunk. We can then append these counterfactual actions to the existing trajectory, creating the CAST dataset, which consists of both the original data and these counterfactual trajectory-instruction pairs. We use the CAST dataset to finetune PaliGemma-3B with the typical VLA recipe, resulting in navigation and manipulation VLAs that better follow language instructions.

Navigation Experiments

Object Navigation

In object navigation tasks, the policy must recognize and move towards a target object. CAST increases the coverage of object reaching behavior in the general navigation dataset we train CounterfactualVLA on, resulting in successful object navigation.

Referential Navigation

In referential navigation tasks, the policy must recognize a target object or structure in the scene and understand how to move with reference to that object, specifically moving to the "right" or "left". By explicitly generating counterfactual trajectories that reflect these referential instructions, CounterfactualVLA learns to differentiate these behaviors and perform them accordingly

In continuous navigation tasks, the policy must recognize a target structure and move with respect to the structure for a sustained amount of time. CouterfactualVLA demonstrates the ability to perform these continuous behaviors in scenes where the training data is typically biased to moving down the center of hallways or sticking to more open areas.

Chained Instructions

In these examples, the policy is provided instructions in sequence. It must execute each step correctly to move on to the next step.

Results

By training CounterfactualVLA with the CAST dataset, we can match and outperform state-of-the-art methods for open-set language following. We show results for navigation (left) and manipulation (right) VLAs trained with CAST (CounterfactualVLA) and the standard VLA trained on unaugmented data.

We also compare to alternative architectures, which demonstrates that having a high-capacity model in necessary to fully reap the benefits of diverse language labels.

We also evaluate the CAST dataset itself, asking human judges to evaluate the correctness of the objects, motion, and grounding of the hindsight and counterfactual labels.

@misc{glossop2025castcounterfactuallabelsimprove,
                          title={CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models}, 
                          author={Catherine Glossop and William Chen and Arjun Bhorkar and Dhruv Shah and Sergey Levine},
                          year={2025},
                          eprint={2508.13446},
                          archivePrefix={arXiv},
                          primaryClass={cs.RO},
                          url={https://arxiv.org/abs/2508.13446}, 
                    }
                  }