The integration of vision-language models into robotic systems constitutes a significant advancement in enabling machines to interact with their surroundings in a more intuitive manner. While VLMs offer rich multimodal reasoning, existing approaches lack user-specific adaptability, often relying on generic interaction paradigms that fail to account for individual behavioral, contextual, or socio-emotional nuances. When customization is attempted, ethical concerns arise from unmitigated biases in user data, risking exclusion or unfair treatment. To address these dual challenges, we propose User-VLM 360°, a holistic framework integrating multimodal user modeling with bias-aware optimization. Our approach features: (1) user-aware tuning that adapts interactions in real time using visual-linguistic signals; (2) bias mitigation via preference optimization; and (3) curated 360° socio-emotive interaction datasets annotated with demographic, emotion, and relational metadata. Evaluations across eight benchmarks demonstrate state-of-the-art results: +35.3% F1 in personalized VQA, +47.5% F1 in facial features understanding, 15% bias reduction, and 30× speedup over baselines. Ablation studies confirm component efficacy, and deployment on the Pepper robot validates real-time adaptability across diverse users. We open-source parameter-efficient 3B/10B models and an ethical verification framework for responsible adaptation.
User-aware Tuning mitigates the semantic gap arising from the misalignment between user queries and the observed scene as captured from the robot's camera perspective. While instruction-tuning could address this for large VLMs, it adds latency and reduces performance. User-VLM 360° overcomes this by natively aligning cross-modal representations, enabling robust real-time adaptation in dynamic robotic environments.
@article{rahimi2025user,
title={User-VLM: LLM Contextualization with Multimodal Pre-trained User Models},
author={Rahimi, Hamed and Abrini, Mouad and Khoramshahi, Mahdi and Chetouani, Mohamed},
year={2025}
}
@misc{rahimi2025uservlm360personalizedvision,
title={USER-VLM 360: Personalized Vision Language Models with User-aware Tuning for Social Human-Robot Interactions},
author={Hamed Rahimi and Adil Bahaj and Mouad Abrini and Mahdi Khoramshahi and Mounir Ghogho and Mohamed Chetouani},
year={2025},
eprint={2502.10636},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2502.10636},
}