Papers
Topics
Authors
Recent
Search
2000 character limit reached

Social-LLaVA: Vision-Language Model for Robot Navigation

Updated 20 January 2026
  • Social-LLaVA is a vision-language model that fuses visual recognition and chain-of-thought reasoning to enable socially compliant robot navigation.
  • It leverages a frozen CLIP ViT-L/14 encoder, a Vicuna 7B language model, and a lightweight connector module to jointly process visual and textual inputs.
  • Fine-tuning on the SNEI dataset and low-latency deployment on robotics hardware demonstrate its superior performance in crowded, dynamic environments.

Social-LLaVA is a Vision-LLM (VLM) designed for socially compliant robot navigation, leveraging language reasoning to translate perception in dynamic environments into interpretable actions. The model is fine-tuned from LLaVA-v1.5-7B and specializes in integrating vision-based cues and chain-of-thought (CoT) reasoning from annotated human interactions. By coupling visual embeddings with large-scale language outputs, Social-LLaVA enables robots to reason about complex social scenarios within unstructured, crowded public spaces, outperforming existing VLMs on multiple human-judged metrics (Payandeh et al., 2024).

1. Model Architecture and Processing Pipeline

Social-LLaVA builds upon the LLaVA-v1.5-7B framework, comprising three core components:

  • Vision Encoder: A frozen CLIP ViT-L/14 backbone encodes monocular RGB images captured by a robot’s camera into high-dimensional visual embeddings.
  • LLM: Vicuna 7B serves as the LLM, expressing structured outputs and reasoning over textual and visual input.
  • Connector Module: This lightweight network projects visual embeddings into the LLM token space for joint attention.

The pipeline processes a pre-normalized 224×224 image through the CLIP encoder; output embeddings are bridged by the connector into the LLM. The model receives a prompt structured for the navigation scenario—encompassing perception, prediction, reasoning, action, and explanation. At inference, CoT prompting is used to elicit stepwise, interpretable reasoning before action selection.

Diagrammatic high-level flow:

  • image → ViT encoder → visual tokens
  • navigation prompt (text) → LLM
  • joint attention → CoT reasoning → action output

2. The SNEI Dataset: Structure and Annotation Regime

Social-LLaVA is fine-tuned on the Social robot Navigation via Explainable Interactions (SNEI) dataset:

  • Source: 2,000 manually selected frames from the SCAND dataset, representing diverse, real-world navigation scenes.
  • Annotated Samples: 40,000 Visual Question Answer (VQA) pairs (∼20 per frame) generated by 10 trained annotators.
  • Annotation Categories:
  1. Perception ("What is the person in red doing?")
  2. Prediction ("Which direction is the person likely to walk next?")
  3. CoT Reasoning ("Given the person’s trajectory and the narrow passage...")
  4. Action ("Stop and wait for the person to pass.")
  5. Explanation ("To avoid blocking the pedestrian’s path in a narrow space.")

Both categorical variables (e.g., agent type, crowd density) and free-form natural language answers are included, yielding an average of 12 words per QA.

Scenario count VQA pairs Categories per image Human annotators Avg. words/QA
2,000 frames 40,000 5 10 12

3. Training Regimen and Technical Implementation

Social-LLaVA adopts a lightweight fine-tuning protocol:

  • Model Parameters: All LLaVA-v1.5 base weights frozen except for LoRA adapters (rank 8).
  • Hardware: Trained on a single NVIDIA A100-40GB GPU.
  • Hyperparameters:
    • Batch size: 4 image-prompt pairs
    • Epochs: 15
    • Learning rate: 3e-4 (linear warmup, cosine annealing)
    • Weight decay: 0.01
    • Optimizer: AdamW (β1\beta_1=0.9, β2\beta_2=0.999, ϵ\epsilon=1e-8)
    • Max sequence length: 512 tokens (including visual tokens)

The model outputs two supervised sequences per training example: the full CoT trace (yCOTy_\text{COT}) and final action text (yAy_A). Loss is applied via standard token-level cross-entropy over the concatenated sequence: L=t=1T  logpθ(yty<t,image)\mathcal{L} = -\sum_{t=1}^{T} \;\log p_{\theta}(y_t \mid y_{<t},\,\text{image}) with TT the total number of tokens in CoT plus action outputs. Prompt tokens are masked out of the loss.

Data split:

  • Training: 90%
  • Validation: 5%
  • Test: 5% The 50-question evaluation set comprises held-out scenes excluded from model training.

4. Evaluation Metrics and Benchmarking Results

Assessment is conducted using a panel of 15 expert human raters, scoring output from three models (GPT-4V, Gemini 1.5 Pro, Social-LLaVA) on a 1–5 scale across five tasks. The overall “Social-Reasoning Score” is the unweighted mean of taskwise scores: S=15(sˉperc+sˉpred+sˉCoT+sˉaction+sˉexpl)S = \frac{1}{5} \Bigl( \bar{s}_{\mathrm{perc}} + \bar{s}_{\mathrm{pred}} + \bar{s}_{\mathrm{CoT}} + \bar{s}_{\mathrm{action}} + \bar{s}_{\mathrm{expl}} \Bigr) where sˉ\bar{s}_* represents the mean score for each task. Rater agreement is high (Krippendorff’s α=0.82\alpha=0.82).

Task GPT-4V Gemini 1.5 Pro Social-LLaVA
Perception 3.11 3.45 4.00
Prediction 3.18 3.87 4.06
Chain-of-Thought 3.41 3.79 4.08
Final Action 2.77 3.46 4.19
Explanation 3.16 3.66 3.95
Overall 3.33 3.65 4.06

Social-LLaVA consistently scores higher across all categories.

5. Deployment on Robotic Hardware

Social-LLaVA is deployed onboard a mobile robot platform with the following specifications:

  • Computer: NVIDIA Jetson AGX Orin (32 GB)
  • Inference: LoRA adapters applied and model quantized to INT8 via TensorRT
  • Processing Speed: 9.2 FPS from image acquisition to text generation
    • Vision encoder + connector: 7 ms
    • LLM forward pass: 95 ms
    • Post-processing: 4 ms

The integration loop captures RGB frames, constructs a navigation prompt, performs vision-language inference, translates the text output to velocity commands via rule-based mapping, and transmits (v,ω)(v,\omega) to the motion controller at 10 Hz.

Pseudocode (excerpt):

1
2
3
4
5
6
7
while True:
    img = camera.capture()
    prompt = make_navigation_prompt()
    cot, action_text = social_llava.infer(img, prompt)
    v, omega = map_action_to_velocity(action_text)
    send_velocity(v, omega)
    sleep(0.1)  # 10 Hz loop
Action-mapping is keyword-driven:

  • "stop" → v=0v=0, ω=0\omega=0
  • "turn left"/"right" → v=0.2v=0.2 m/s, ω=±0.5\omega=\pm 0.5 rad/s
  • "go straight" → v=0.4v=0.4 m/s, ω=0\omega=0

Demonstration scenarios include halting in a narrow corridor to yield to a pedestrian, and off-road detouring to avoid interrupting a group conversation.

6. Context and Implications

Social-LLaVA provides direct translation from perception to action by generating interpretable chain-of-thought traces and explicit action rationales in natural language. The model's use of chain-of-thought prompting (following Wei et al., (Wei et al., 2022) constitutes the reasoning mechanism rather than a dedicated reasoning network. A plausible implication is that this structure may facilitate further extensions to explainable robotics and policy grounding for social environments. The SNEI dataset's annotation regime and validation methodology (including categorical and free-form labels) ensure high diversity and fidelity of training supervision.

Comparatively, Social-LLaVA surpasses GPT-4V and Gemini 1.5 Pro in all measured dimensions of social reasoning within robot navigation tasks. The deployment demonstrates practical viability, achieving low-latency inference (9 Hz) using modest hardware and quantized models.

7. Limitations and Future Research Directions

Social-LLaVA is dependent on frozen base model weights and LoRA adapters, suggesting limited representational flexibility relative to fully fine-tuned models. Action mapping utilizes rule-based parsing, introducing potential bottlenecks in complex contextual interpretation. The SNEI dataset, while annotated by expert raters, is constrained by scale (2,000 scenes), number of annotators, and domain specificity (SCAND environments).

Prospective research avenues include:

  • Increasing dataset scale and diversity for broader generalization
  • Integrating dynamic multimodal inputs (e.g., temporal sequences or sensor fusion)
  • Developing adaptive action mapping policies
  • Extending to multi-agent and cross-cultural scenarios for robust social compliance

Social-LLaVA exemplifies the use of language-driven reasoning to interface vision and action in social robotics, establishing new benchmarks for explainable navigation within populated, real-world spaces (Payandeh et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Social-LLaVA.