Image Difference
Comparison with GPT-4o on Image-Diff task. We mark incorrect captions in red and correct in green. UniPose can accurately perceive a person’s orientation from images.
Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios.
This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. To further enhance the fine-grained pose perception capabilities, we facilitate UniPose with a mixture of visual encoders, among them a pose-specific visual encoder. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities.
This work serves as the first attempt at building a general-purpose framework for pose comprehension, generation, and editing. Extensive experiments highlight UniPose's competitive and even superior performance across various pose-relevant tasks.
UniPose comprises a Pose Tokenizer, Visual Processor and a pose-aware language LLM. Combining Pose Tokens learned by pose tokenizer, Visual Embeddings from visual processor and Text Tokens from text tokenizer, UniPose enables joint modeling of pose comprehension, generation and editing within a unified visual-language backbone.
Comparison with GPT-4o on Image-Diff task. We mark incorrect captions in red and correct in green. UniPose can accurately perceive a person’s orientation from images.
Comparison with Qwen-VL and GPT-4o on Image-to-Text task.
Qualitative comparison on pose estimation task. We compare multi-modal LLMs (ChatPose) and traditional HMR methods (TokenHMR) with our UniPose on LSP dataset
For more related works, please check out the following links:
PoseScript for 3D human pose generation from text and pose description generation.
PoseFix for 3D human pose refine and pose difference description generation.
TokenHMR and 4D-Humans for 3D human pose and shape estimation from images.
@article{li2024unipose,
title={UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing},
author={Li, Yiheng and Hou, Ruibing and Chang, Hong and Shan, Shiguang and Chen, Xilin},
journal={arXiv preprint arXiv:2411.16781},
year={2024}
}