Social-LLaVA: Vision-Language Model for Robot Navigation
- Social-LLaVA is a vision-language model that fuses visual recognition and chain-of-thought reasoning to enable socially compliant robot navigation.
- It leverages a frozen CLIP ViT-L/14 encoder, a Vicuna 7B language model, and a lightweight connector module to jointly process visual and textual inputs.
- Fine-tuning on the SNEI dataset and low-latency deployment on robotics hardware demonstrate its superior performance in crowded, dynamic environments.
Social-LLaVA is a Vision-LLM (VLM) designed for socially compliant robot navigation, leveraging language reasoning to translate perception in dynamic environments into interpretable actions. The model is fine-tuned from LLaVA-v1.5-7B and specializes in integrating vision-based cues and chain-of-thought (CoT) reasoning from annotated human interactions. By coupling visual embeddings with large-scale language outputs, Social-LLaVA enables robots to reason about complex social scenarios within unstructured, crowded public spaces, outperforming existing VLMs on multiple human-judged metrics (Payandeh et al., 2024).
1. Model Architecture and Processing Pipeline
Social-LLaVA builds upon the LLaVA-v1.5-7B framework, comprising three core components:
- Vision Encoder: A frozen CLIP ViT-L/14 backbone encodes monocular RGB images captured by a robot’s camera into high-dimensional visual embeddings.
- LLM: Vicuna 7B serves as the LLM, expressing structured outputs and reasoning over textual and visual input.
- Connector Module: This lightweight network projects visual embeddings into the LLM token space for joint attention.
The pipeline processes a pre-normalized 224×224 image through the CLIP encoder; output embeddings are bridged by the connector into the LLM. The model receives a prompt structured for the navigation scenario—encompassing perception, prediction, reasoning, action, and explanation. At inference, CoT prompting is used to elicit stepwise, interpretable reasoning before action selection.
Diagrammatic high-level flow:
- image → ViT encoder → visual tokens
- navigation prompt (text) → LLM
- joint attention → CoT reasoning → action output
2. The SNEI Dataset: Structure and Annotation Regime
Social-LLaVA is fine-tuned on the Social robot Navigation via Explainable Interactions (SNEI) dataset:
- Source: 2,000 manually selected frames from the SCAND dataset, representing diverse, real-world navigation scenes.
- Annotated Samples: 40,000 Visual Question Answer (VQA) pairs (∼20 per frame) generated by 10 trained annotators.
- Annotation Categories:
- Perception ("What is the person in red doing?")
- Prediction ("Which direction is the person likely to walk next?")
- CoT Reasoning ("Given the person’s trajectory and the narrow passage...")
- Action ("Stop and wait for the person to pass.")
- Explanation ("To avoid blocking the pedestrian’s path in a narrow space.")
Both categorical variables (e.g., agent type, crowd density) and free-form natural language answers are included, yielding an average of 12 words per QA.
| Scenario count | VQA pairs | Categories per image | Human annotators | Avg. words/QA |
|---|---|---|---|---|
| 2,000 frames | 40,000 | 5 | 10 | 12 |
3. Training Regimen and Technical Implementation
Social-LLaVA adopts a lightweight fine-tuning protocol:
- Model Parameters: All LLaVA-v1.5 base weights frozen except for LoRA adapters (rank 8).
- Hardware: Trained on a single NVIDIA A100-40GB GPU.
- Hyperparameters:
- Batch size: 4 image-prompt pairs
- Epochs: 15
- Learning rate: 3e-4 (linear warmup, cosine annealing)
- Weight decay: 0.01
- Optimizer: AdamW (=0.9, =0.999, =1e-8)
- Max sequence length: 512 tokens (including visual tokens)
The model outputs two supervised sequences per training example: the full CoT trace () and final action text (). Loss is applied via standard token-level cross-entropy over the concatenated sequence: with the total number of tokens in CoT plus action outputs. Prompt tokens are masked out of the loss.
Data split:
- Training: 90%
- Validation: 5%
- Test: 5% The 50-question evaluation set comprises held-out scenes excluded from model training.
4. Evaluation Metrics and Benchmarking Results
Assessment is conducted using a panel of 15 expert human raters, scoring output from three models (GPT-4V, Gemini 1.5 Pro, Social-LLaVA) on a 1–5 scale across five tasks. The overall “Social-Reasoning Score” is the unweighted mean of taskwise scores: where represents the mean score for each task. Rater agreement is high (Krippendorff’s ).
| Task | GPT-4V | Gemini 1.5 Pro | Social-LLaVA |
|---|---|---|---|
| Perception | 3.11 | 3.45 | 4.00 |
| Prediction | 3.18 | 3.87 | 4.06 |
| Chain-of-Thought | 3.41 | 3.79 | 4.08 |
| Final Action | 2.77 | 3.46 | 4.19 |
| Explanation | 3.16 | 3.66 | 3.95 |
| Overall | 3.33 | 3.65 | 4.06 |
Social-LLaVA consistently scores higher across all categories.
5. Deployment on Robotic Hardware
Social-LLaVA is deployed onboard a mobile robot platform with the following specifications:
- Computer: NVIDIA Jetson AGX Orin (32 GB)
- Inference: LoRA adapters applied and model quantized to INT8 via TensorRT
- Processing Speed: 9.2 FPS from image acquisition to text generation
- Vision encoder + connector: 7 ms
- LLM forward pass: 95 ms
- Post-processing: 4 ms
The integration loop captures RGB frames, constructs a navigation prompt, performs vision-language inference, translates the text output to velocity commands via rule-based mapping, and transmits to the motion controller at 10 Hz.
Pseudocode (excerpt):
1 2 3 4 5 6 7 |
while True: img = camera.capture() prompt = make_navigation_prompt() cot, action_text = social_llava.infer(img, prompt) v, omega = map_action_to_velocity(action_text) send_velocity(v, omega) sleep(0.1) # 10 Hz loop |
- "stop" → ,
- "turn left"/"right" → m/s, rad/s
- "go straight" → m/s,
Demonstration scenarios include halting in a narrow corridor to yield to a pedestrian, and off-road detouring to avoid interrupting a group conversation.
6. Context and Implications
Social-LLaVA provides direct translation from perception to action by generating interpretable chain-of-thought traces and explicit action rationales in natural language. The model's use of chain-of-thought prompting (following Wei et al., (Wei et al., 2022) constitutes the reasoning mechanism rather than a dedicated reasoning network. A plausible implication is that this structure may facilitate further extensions to explainable robotics and policy grounding for social environments. The SNEI dataset's annotation regime and validation methodology (including categorical and free-form labels) ensure high diversity and fidelity of training supervision.
Comparatively, Social-LLaVA surpasses GPT-4V and Gemini 1.5 Pro in all measured dimensions of social reasoning within robot navigation tasks. The deployment demonstrates practical viability, achieving low-latency inference (9 Hz) using modest hardware and quantized models.
7. Limitations and Future Research Directions
Social-LLaVA is dependent on frozen base model weights and LoRA adapters, suggesting limited representational flexibility relative to fully fine-tuned models. Action mapping utilizes rule-based parsing, introducing potential bottlenecks in complex contextual interpretation. The SNEI dataset, while annotated by expert raters, is constrained by scale (2,000 scenes), number of annotators, and domain specificity (SCAND environments).
Prospective research avenues include:
- Increasing dataset scale and diversity for broader generalization
- Integrating dynamic multimodal inputs (e.g., temporal sequences or sensor fusion)
- Developing adaptive action mapping policies
- Extending to multi-agent and cross-cultural scenarios for robust social compliance
Social-LLaVA exemplifies the use of language-driven reasoning to interface vision and action in social robotics, establishing new benchmarks for explainable navigation within populated, real-world spaces (Payandeh et al., 2024).