We propose CAST, Counterfactual Augmentation with Synthetic Trajectories, a method for both expanding and balancing the distribution of trajectories in your robotics dataset.
We observed that often VLA policies will learn to ignore language, especially when training on sufficiently diverse language instructions, as biases in the data result in the policy only needing to pay attention to the observation to predict the correct actions during training. Therefore, the goal of CAST is to expand the distribution of instructions and actions at any given observation. We show that this forces the model to attend to language, producing a more steerable policy.
CAST takes an uncurated robot dataset queries a VLM for counterfactual instructions that could have been executed at different points in the existing trajectories. The atomic command that aligns with the instruction is also determined and used to generate an action chunk. We can then append these counterfactual actions to the existing trajectory, creating the CAST dataset, which consists of both the original data and these counterfactual trajectory-instruction pairs. We use the CAST dataset to finetune PaliGemma-3B with the typical VLA recipe.
In object navigation tasks, the policy must recognize and move towards a target object. CAST increases the coverage of object reaching behavior in the general navigation dataset we train CounterfactualVLA on, resulting in successful object navigation.
In referential navigation tasks, the policy must recognize a target object or structure in the scene and understand how to move with reference to that object, specifically moving to the "right" or "left". By explicitly generating counterfactual trajectories that reflect these referential instructions, CounterfactualVLA learns to differentiate these behaviors and perform them accordingly
In continuous navigation tasks, the policy must recognize a target structure and move with respect to the structure for a sustained amount of time. CouterfactualVLA demonstrates the ability to perform these continuous behaviors in scenes where the training data is typically biased to moving down the center of hallways or sticking to more open areas.
In these examples, the policy is provided instructions in sequence. It must execute each step correctly to move on to the next step.
By training CounterfactualVLA with the CAST dataset, we can match and outperform state-of-the-art methods for open-set language following.
We also compare to alternative architectures, which demonstrates that having a high-capacity model in necessary to fully reap the benefits of diverse language labels.
@misc{glossop2025castcounterfactuallabelsimprove,
title={CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models},
author={Catherine Glossop and William Chen and Arjun Bhorkar and Dhruv Shah and Sergey Levine},
year={2025},
eprint={2508.13446},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2508.13446},
}
}