User-VLM 360°

Personalized Vision Language Models with User-aware Tuning for Social Human-Robot Interactions

ISIR, Sorbonne University International University of Rabat
Code Models Dataset arXiv

Abstract

The integration of vision-language models into robotic systems constitutes a significant advancement in enabling machines to interact with their surroundings in a more intuitive manner. While VLMs offer rich multimodal reasoning, existing approaches lack user-specific adaptability, often relying on generic interaction paradigms that fail to account for individual behavioral, contextual, or socio-emotional nuances. When customization is attempted, ethical concerns arise from unmitigated biases in user data, risking exclusion or unfair treatment. To address these dual challenges, we propose User-VLM 360°, a holistic framework integrating multimodal user modeling with bias-aware optimization. Our approach features: (1) user-aware tuning that adapts interactions in real time using visual-linguistic signals; (2) bias mitigation via preference optimization; and (3) curated 360° socio-emotive interaction datasets annotated with demographic, emotion, and relational metadata. Evaluations across eight benchmarks demonstrate state-of-the-art results: +35.3% F1 in personalized VQA, +47.5% F1 in facial features understanding, 15% bias reduction, and 30× speedup over baselines. Ablation studies confirm component efficacy, and deployment on the Pepper robot validates real-time adaptability across diverse users. We open-source parameter-efficient 3B/10B models and an ethical verification framework for responsible adaptation.

Deployment on Pepper (videos coming soon...)

User-aware Tuning mitigates the semantic gap arising from the misalignment between user queries and the observed scene as captured from the robot's camera perspective. While instruction-tuning could address this for large VLMs, it adds latency and reduces performance. User-VLM 360° overcomes this by natively aligning cross-modal representations, enabling robust real-time adaptation in dynamic robotic environments.

User-aware Tuning consists of three key steps: In the first step, Vision Alignment, the model is trained to recognize and interpret human emotions, age, gender, and ethnicity based on facial features and visual signals. In the second step, Instruction Tuning, the model undergoes supervised instruction tuning, enabling it to respond effectively to general-purpose questions by incorporating visual cues. Finally, to mitigate over-personalization and prevent biased or unethical responses, the third step, Bias Mitigation, focuses on training the model to generate ethical and contextually appropriate responses.

Result

BibTeX

@article{rahimi2025user,
  title={User-VLM: LLM Contextualization with Multimodal Pre-trained User Models},
  author={Rahimi, Hamed and Abrini, Mouad and Khoramshahi, Mahdi and Chetouani, Mohamed},
  year={2025}
}

@misc{rahimi2025uservlm360personalizedvision,
      title={USER-VLM 360: Personalized Vision Language Models with User-aware Tuning for Social Human-Robot Interactions},
      author={Hamed Rahimi and Adil Bahaj and Mouad Abrini and Mahdi Khoramshahi and Mounir Ghogho and Mohamed Chetouani},
      year={2025},
      eprint={2502.10636},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2502.10636},
}