Papers
Topics
Authors
Recent
Search
2000 character limit reached

SocialNav-MoE: Socially Compliant Navigation

Updated 20 December 2025
  • The paper introduces SocialNav-MoE, a model that integrates a sparse MoE backbone, a frozen SigLIP encoder, and small language models for socially compliant robotic navigation.
  • It employs a three-stage pipeline—SFT, reinforcement fine-tuning using a semantic similarity reward, and MoE fine-tuning—to optimize decision-making in complex human environments.
  • Empirical results demonstrate a 10.6% SMS increase and real-time performance (1.7 FPS) on resource-constrained platforms, underscoring its practical efficiency.

SocialNav-MoE is a Mixture-of-Experts (MoE) vision-LLM designed for efficient, socially compliant navigation of robots in human-populated environments. It addresses both safety and social compliance by leveraging a sparse MoE backbone, small LLMs, a frozen SigLIP vision encoder, and a semantic similarity reward function within reinforcement fine-tuning (RFT). The framework enables real-time deployment on resource-constrained robotics platforms without the computational and latency costs typical of large-scale vision-LLMs (Kawabata et al., 15 Dec 2025).

1. Architectural Overview

SocialNav-MoE utilizes a three-stage sequential pipeline:

  1. Supervised Fine-Tuning (SFT): The model is initialized from LLaVA-1.5-58K and further trained on the SNEI dataset to align multimodal (vision and language) observations to ground-truth navigation actions. Training cycles apply a learning rate of 1×1031\times10^{-3} for initialization and 2×1052\times10^{-5} for SFT.
  2. Reinforcement Fine-Tuning (RFT): Policy learning uses Group Sequence Policy Optimization (GSPO)—a PPO variant—with a novel semantic similarity reward (SSR) designed to incentivize socially compliant decision-making over a group batch of 8 sampled answer sequences.
  3. MoE Fine-Tuning (MoEFT): The sparse MoE expert layers and the router are jointly trained using multi-turn conversational data, enhancing coherence over extended dialogues.

Inference operates via multi-turn prompting; each observation is encoded into visual (via vision encoder) and textual histories, which jointly form the model context. The core Transformer alternates conventional and MoE layers, invoking only a sparse subset of experts per token to maintain computational efficiency.

2. Mixture-of-Experts Design and Routing

MoE blocks are interleaved in half the Transformer layers. Each block contains KK small feed-forward experts FiF_i and a router function ff. For an input embedding α\alpha, the router outputs logits f(α)RKf(\alpha)\in\mathbb{R}^K, producing weights via softmax:

W(α)i=exp(f(α)i)j=1Kexp(f(α)j)W(\alpha)_i = \frac{\exp(f(\alpha)_i)}{\sum_{j=1}^K \exp(f(\alpha)_j)}

The output of the MoE block is computed by aggregating the kk selected top experts:

MoE(α)=iTop-kW(α)iFi(α){\rm MoE}(\alpha) = \sum_{i\in \text{Top-}k} W(\alpha)_i F_i(\alpha)

Empirical studies support K=4K=4 experts with k=1k=1; this configuration yields a 10.6% increase in sentence-mover’s similarity (SMS) versus using only one expert (from 0.473 to 0.523). Increasing kk to higher values incurs additional latency and risks conflicting outputs due to limited data (Kawabata et al., 15 Dec 2025).

3. Vision Encoder and LLM Choices

The model compares CLIP and SigLIP as feature extractors, with both frozen and fine-tuned variants. With four experts (top-1 routing), a frozen SigLIP encoder achieves 0.523 SMS, outperforming both CLIP (0.514) and fine-tuned SigLIP (0.515). Freezing the encoder avoids overfitting, especially given the limited (530 image) SNEI augmentation set (Kawabata et al., 15 Dec 2025).

For the language backbone, three small LLMs (SLMs) were evaluated:

SLM Parameters SMS
Phi-2-2.7B 2.7 B 0.523
Qwen-1.8B 1.8 B 0.520
StableLM-1.6B 1.6 B 0.461

Phi-2-2.7B demonstrated the highest semantic accuracy, showing that SLM choice is significant even among models smaller than 3B parameters.

4. Reinforcement Fine-Tuning and Semantic Similarity Reward

RFT employs GSPO with the SSR. The optimization objective:

JGSPO(θ)=E[1Gi=1Gmin(si(θ)Ai,clip(si(θ),1ϵ,1+ϵ)Ai)βDKL(πθπref)]J_{\mathrm{GSPO}}(\theta) = \mathbb{E}\left[\frac{1}{G} \sum_{i=1}^G \min\left(s_i(\theta)A_i,\,\mathrm{clip}(s_i(\theta),1-\epsilon,1+\epsilon)A_i\right) - \beta D_{\mathrm{KL}}(\pi_\theta\|\pi_{\mathrm{ref}})\right]

The SSR computes BERTScore-F1 between generated and ground-truth actions:

  • For generated yy and ground-truth gg, calculate cosine similarity Sjk=cos(eyj,egk)S_{jk} = \cos(e^j_y,e^k_g).
  • Compute recall RR and precision PP:

R=1mk=1mmax1jnSjk,P=1nj=1nmax1kmSjkR = \frac{1}{m} \sum_{k=1}^m \max_{1 \leq j \leq n} S_{jk}, \quad P = \frac{1}{n} \sum_{j=1}^n \max_{1 \leq k \leq m} S_{jk}

  • The SSR reward: BERTScore-F1=2PRP+R\mathrm{BERTScore}\text{-}F1 = \frac{2PR}{P+R}

SSR supersedes hard-level and character-level rewards, improving SMS by significant margins (0.551 for SSR, vs. 0.532 for hard and 0.510 for character-level on the augmented SNEI set). With top-1 and 4 experts, GSPO+SSR outperforms alternative RFT methods (Table 9) (Kawabata et al., 15 Dec 2025).

5. Empirical Evaluation and Ablation

Experiments were conducted on the SNEI dataset (325 scenes, augmented to 530 training images). Evaluation metrics include model size, throughput (FPS), BERTScore, SBERT cosine, and sentence-mover’s similarity (SMS). Key findings:

Model Params FPS SMS (orig.) SMS (aug.)
GPT-4o 200 B 0.212 0.376
Claude 175 B 0.087 0.417
SocialNav-MoE 5.74 B 1.709 0.489 0.551

Ablation studies revealed:

  • Four-expert, top-1 routing increases SMS by 10.6% over a single expert.
  • Freezing SigLIP outperforms both frozen CLIP and fine-tuned SigLIP.
  • Multi-turn conversational training yields up to 3% performance gain over single-turn.
  • GSPO+SSR policy optimization produces materially higher semantic alignment than alternatives.

6. Real-Time Deployment and Efficiency Considerations

SocialNav-MoE’s architecture is explicitly engineered for onboard efficiency: its 5.74B parameter footprint allows 1.7 frames per second inference on a single NVIDIA T4—satisfying the 1 Hz control loop of indoor robots. The pipeline, from image to action command, completes in under 600ms. Key deployment factors include:

  • Sparse MoE with top-1 routing,
  • Frozen SigLIP encoder,
  • On-policy mini-batches of size 8 for RFT,
  • Reduced memory and energy requirements—over 20× fewer parameters and 8× higher FPS than 200B parameter baselines.

A plausible implication is that such sparsity and architectural choices position SocialNav-MoE as a practical solution for real-time social navigation on computationally constrained platforms (Kawabata et al., 15 Dec 2025).

7. Summary and Implications

SocialNav-MoE combines SLMs, sparse Mixture-of-Experts routing, a frozen SigLIP visual pipeline, and reinforcement learning with semantic similarity rewards to achieve state-of-the-art efficiency and semantic alignment in socially compliant robotic navigation. Its design enables substantial inference speedups and energy savings over competitive large-scale VLMs, without sacrificing reasoning or adherence to social navigation norms. The cumulative effect of expert layer sparsity, semantic-aware RFT, and aggressive encoder freezing explicitly bridges the gap between high-level social reasoning and low-latency robotic action generation (Kawabata et al., 15 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SocialNav-MoE.