UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing

¹Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), ²University of Chinese Academy of Sciences, ^*Corresponding author

Abstract

Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios.

This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. To further enhance the fine-grained pose perception capabilities, we facilitate UniPose with a mixture of visual encoders, among them a pose-specific visual encoder. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities.

This work serves as the first attempt at building a general-purpose framework for pose comprehension, generation, and editing. Extensive experiments highlight UniPose's competitive and even superior performance across various pose-relevant tasks.

Method Overview

Your browser does not support SVG. Please download the file here.

UniPose comprises a Pose Tokenizer, Visual Processor and a pose-aware language LLM. Combining Pose Tokens learned by pose tokenizer, Visual Embeddings from visual processor and Text Tokens from text tokenizer, UniPose enables joint modeling of pose comprehension, generation and editing within a unified visual-language backbone.

Visualization

Image Difference

Comparison with GPT-4o on Image-Diff task. We mark incorrect captions in red and correct in green. UniPose can accurately perceive a person’s orientation from images.

Image to Text

Comparison with Qwen-VL and GPT-4o on Image-to-Text task.

Pose Estimation

Qualitative comparison on pose estimation task. We compare multi-modal LLMs (ChatPose) and traditional HMR methods (TokenHMR) with our UniPose on LSP dataset