PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Resolving State Ambiguity in
Robot Manipulation
via Adaptive Working Memory Recoding

Qingda Hu¹, Ziheng Qiu¹, Zijun Xu¹, Kaizhao Zhang¹, Xizhou Bu¹, Zuolei Sun², Bo Zhang², Jieru Zhao³, Zhongxue Gan¹, and Wenchao Ding¹

¹ College of intelligent robotics and advanced manufacturing fudan university, Fudan University ² Westwell
³ Department of Computer Science and Engineering, Shanghai Jiao Tong University, China

Email: qdhu24@m.fudan.edu.cn, dingwenchao@fudan.edu.cn

Paper Code(Coming Soon) arXiv

Real-World Deployment Videos (1x speed)

Guessing Game: After the start signal, reveal the red block by lifting the cup.

Hold the Pot Lid: Pick up the pot lid and maintain for 2s before put it back.

Wipe the Table Twice: Wipe the table for two times with the sponge.

Wipe the Table Once: Wipe the table for one time with the sponge.

Exchange Objects: Move the fruit on the board aside and replace it with the other.

Buttons in Sequence: Press the buttons in order: Yellow, Green, Yellow and Blue.

Sponge and Square: Place the sponge and the block on the same side onto board.

Abstract

State ambiguity is common in robotic manipulation. Identical observations may correspond to multiple valid behavior trajectories. The visuomotor policy must correctly extract the appropriate types and levels of information from the history to identify the current task phase. However, naively extending the history window is computationally expensive and may cause severe overfitting.

Inspired by the continuous nature of human reasoning and the recoding of working memory, we introduce PAM, a novel visuomotor Policy equipped with Adaptive working Memory. With minimal additional training cost in a two-stage manner, PAM supports a 300-frame history window while maintaining high inference speed. Specifically, a hierarchical frame feature extractor yields two distinct representations for motion primitives and temporal disambiguation. For compact representation, a context router with range-specific queries is employed to produce compact context features across multiple history lengths. And an auxiliary objective of reconstructing historical information is introduced to ensure that the context router acts as an effective bottleneck. We meticulously design 7 tasks and verify that PAM can handle multiple scenarios of state ambiguity simultaneously. With a history window of approximately 10 seconds, PAM still supports stable training and maintains inference speeds above 20Hz.

Interpretability

PAM Pipeline — PAM uses two distinct types of features to guide action generation: **motion primitives** extracted from the current frame and **compact context features** drawn from the extended history window. The **context features** serve as working memory. (a) illustrates the frame feature extractor used to obtain these features. The extractor employs a query-based mechanism to recode multimodal inputs, enabling adaptive working memory recoding. (b) illustrates the context router, which receives the the **context features** and utilizes a set of query tokens spanning different history lengths to produce **compact context features**. PAM is trained in a two-stage manner. As shown in the figure, different subsets of model parameters are progressively activated across the two stages.

Real-world Robotic Tasks

Interpretability

Interpretation of PAM — **Left:** The attention maps of the context router reveal which history portions PAM references to resolve state ambiguity, accurately identifying key frames from preceding task stages in Wipe the Table Twice.

**Right:** The attention maps of extractor indicate which modalities PAM uses for working memory encoding. In Guessing Game, the context query extracts historical cues from visual observation of the block's position, while the motion primitives attends to joint states for posture maintenace. In Wipe the Table Twice, both visual observations and joint states provide effective contextual cues, demonstrating the effectiveness of our adaptive working memory encoding.

@article{hu2025resolving, title={Resolving State Ambiguity in Robot Manipulation via Adaptive Working Memory Recoding}, author={Hu, Qingda and Qiu, Ziheng and Xu, Zijun and Zhang, Kaizhao and Bu, Xizhou and Sun, Zuolei and Zhang, Bo and Zhao, Jieru and Gan, Zhongxue and Ding, Wenchao}, journal={arXiv preprint arXiv:2512.24638}, year={2025}, }

Resolving State Ambiguity in Robot Manipulation via Adaptive Working Memory Recoding

Real-World Deployment Videos (1x speed)

Abstract

Interpretability

Real-world Robotic Tasks

Interpretability

BibTeX

Resolving State Ambiguity in
Robot Manipulation
via Adaptive Working Memory Recoding