UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing

1Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), 2University of Chinese Academy of Sciences, *Corresponding author
Your browser does not support SVG. Please download the file here.

UniPose can handle pose comprehension, generation and editing tasks under different instructions within a unified framework.

Abstract

Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios.

This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. To further enhance the fine-grained pose perception capabilities, we facilitate UniPose with a mixture of visual encoders, among them a pose-specific visual encoder. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities.

This work serves as the first attempt at building a general-purpose framework for pose comprehension, generation, and editing. Extensive experiments highlight UniPose's competitive and even superior performance across various pose-relevant tasks.

Method Overview

Your browser does not support SVG. Please download the file here.

UniPose comprises a Pose Tokenizer, Visual Processor and a pose-aware language LLM. Combining Pose Tokens learned by pose tokenizer, Visual Embeddings from visual processor and Text Tokens from text tokenizer, UniPose enables joint modeling of pose comprehension, generation and editing within a unified visual-language backbone.

Visualization

Image Difference

Your browser does not support SVG. Please download the file here.

Comparison with GPT-4o on Image-Diff task. We mark incorrect captions in red and correct in green. UniPose can accurately perceive a person’s orientation from images.

Image to Text

Your browser does not support SVG. Please download the file here.

Comparison with Qwen-VL and GPT-4o on Image-to-Text task.

Pose Estimation

Your browser does not support SVG. Please download the file here.

Qualitative comparison on pose estimation task. We compare multi-modal LLMs (ChatPose) and traditional HMR methods (TokenHMR) with our UniPose on LSP dataset

Related Links

For more related works, please check out the following links:

PoseScript for 3D human pose generation from text and pose description generation.

PoseFix for 3D human pose refine and pose difference description generation.

TokenHMR and 4D-Humans for 3D human pose and shape estimation from images.

LLaVA, LISA and ChatPose for multi-modal LLMs.

BibTeX

@article{li2024unipose,
  title={UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing},
  author={Li, Yiheng and Hou, Ruibing and Chang, Hong and Shan, Shiguang and Chen, Xilin},
  journal={arXiv preprint arXiv:2411.16781},
  year={2024}
}