Latent Action Learning Requires Supervision in the Presence of Distractors

ICML 2025

¹AIRI, ²MIPT, ³Skoltech, ⁴Innopolis University, ⁵Accenture, ⁶ETH Zurich

TL;DR

Despite the great promise, existing latent action learning approaches use distractor-free data, where changes between observations are primarily explained by real actions. Unfortunately, real-world videos encompass many activities that are unrelated to the task and cannot be controlled by the agent. We study how action-correlated distractors (dynamic backgrounds, camera shake, color changes, etc) affect latent action learning and show that:

Latent action learning fails to recover useful latent actions when distractors are present.
With simple modifications (multi-step IDM, no quantization, latent-space FDM, augmentations) we can improve latent actions quality by 8x.
But architectural changes alone are not enough. Without supervision, Latent Action Models (LAMs) cannot distinguish control-related motion from distractors.
Supervision with as little as 2.5% (of full dataset) ground-truth actions during latent action model pre-training boosts downstream performance by 4x.

Our findings highlight a major limitation of existing latent action learning methods.

Improving Quality

The individual effect of each proposed change in LAOM, our modification of LAPO, which overall improves latent action quality in the presence of distractors by 8x, almost closing the gap with distractor-free setting.

Method

Simplified architecture visualization of LAPO, and LAOM - our proposed modification. LAPO consists of IDM and FMD, both with separate encoders, uses latent action quantization and predict next observation in image space. LAOM incorporates multi-step IDM, removes quantization and does not reconstruct images, relying on latent temporal consistency loss. Images are encoded by shared encoder, while IDM and FDM operate in compact latent space. When small number of ground-truth action labels is available, we use them for supervision, linearly predicting from latent actions.

Training Pipeline

Latent Action Model (LAM) is pre-trained to infer latent actions. It is then used to relabel the entire dataset with latent actions, which are subsequently used for behavioral cloning. Finally, a decoder is trained to map from latent to true actions. We do not modify this pipeline; we only examine the LAM architecture itself.

Environments

We collect datasets using Distracting Control Suite (DCS). DCS uses dynamic background videos, camera shaking and agent color change as distractors. We collect datasets with five thousand trajectories for four tasks.

Citation

@article{nikulin2025latent, title={Latent Action Learning Requires Supervision in the Presence of Distractors}, author={Nikulin, Alexander and Zisman, Ilya and Tarasov, Denis and Lyubaykin, Nikita and Polubarov, Andrei and Kiselev, Igor and Kurenkov, Vladislav}, journal={arXiv preprint arXiv:2502.00379}, year={2025} }