Resolving State Ambiguity in
Robot Manipulation
via Adaptive Working Memory Recoding

Qingda Hu1, Ziheng Qiu1, Zijun Xu1, Kaizhao Zhang1, Xizhou Bu1, Zuolei Sun2, Bo Zhang2, Jieru Zhao3, Zhongxue Gan1, and Wenchao Ding1
1 Fudan Academy for Engineering and Technology, Fudan University 2 Westwell
3 Department of Computer Science and Engineering, Shanghai Jiao Tong University, China
Email: qdhu24@m.fudan.edu.cn, dingwenchao@fudan.edu.cn
Teaser Image

Left: State ambiguity is common in robotic manipulation, and the relevant state transitions do not follow the Markov assumption; therefore, a long history window is essential. Here we illustrate a representative example of state ambiguity and elaborate on its various scenarios.
Right: Existing methods and our method for encoding long history. Analogous to human continuous reasoning, PAM’s inference process is temporally dependent, extending the history window by maintaining a long-term adaptive working memory. The model only needs to encode the current observation at each inference step.

Real-World Deployment Videos (1x speed)

Abstract

State ambiguity is common in robotic manipulation. Identical observations may correspond to multiple valid behavior trajectories. The visuomotor policy must correctly extract the appropriate types and levels of information from the history to identify the current task phase. However, naively extending the history window is computationally expensive and may cause severe overfitting.

Inspired by the continuous nature of human reasoning and the recoding of working memory, we introduce PAM, a novel visuomotor Policy equipped with Adaptive working Memory. With minimal additional training cost in a two-stage manner, PAM supports a 300-frame history window while maintaining high inference speed. Specifically, a hierarchical frame feature extractor yields two distinct representations for motion primitives and temporal disambiguation. For compact representation, a context router with range-specific queries is employed to produce compact context features across multiple history lengths. And an auxiliary objective of reconstructing historical information is introduced to ensure that the context router acts as an effective bottleneck. We meticulously design 7 tasks and verify that PAM can handle multiple scenarios of state ambiguity simultaneously. With a history window of approximately 10 seconds, PAM still supports stable training and maintains inference speeds above 20Hz.

Interpretability

PAM Pipeline
PAM uses two distinct types of features to guide action generation: motion primitives extracted from the current frame and compact context features drawn from the extended history window. The context features serve as working memory. (a) illustrates the frame feature extractor used to obtain these features. The extractor employs a query-based mechanism to recode multimodal inputs, enabling adaptive working memory recoding. (b) illustrates the context router, which receives the the context features and utilizes a set of query tokens spanning different history lengths to produce compact context features. PAM is trained in a two-stage manner. As shown in the figure, different subsets of model parameters are progressively activated across the two stages.

Real-world Robotic Tasks

Real-world Robotic Tasks
We carefully designed a set of real-world tasks as shown in the figure, with the primary types of state ambiguity indicated in the top-right corner. Each task comprises multiple subtasks, and to provide a more fine-grained evaluation, the task success rate is computed as the average completion rate across its subtasks.

Interpretability

Interpretation of PAM
Left: The attention maps of the context router reveal which history portions PAM references to resolve state ambiguity, accurately identifying key frames from preceding task stages in Wipe the Table Twice.

Right: The attention maps of extractor indicate which modalities PAM uses for working memory encoding. In Guessing Game, the context query extracts historical cues from visual observation of the block's position, while the motion primitives attends to joint states for posture maintenace. In Wipe the Table Twice, both visual observations and joint states provide effective contextual cues, demonstrating the effectiveness of our adaptive working memory encoding.

BibTeX

@article{YourPaperKey2024,
  title={Your Paper Title Here},
  author={First Author and Second Author and Third Author},
  journal={Conference/Journal Name},
  year={2024},
  url={https://your-domain.com/your-project-page}
}