SocialNav-MoE: Socially Compliant Navigation

Updated 20 December 2025

The paper introduces SocialNav-MoE, a model that integrates a sparse MoE backbone, a frozen SigLIP encoder, and small language models for socially compliant robotic navigation.
It employs a three-stage pipeline—SFT, reinforcement fine-tuning using a semantic similarity reward, and MoE fine-tuning—to optimize decision-making in complex human environments.
Empirical results demonstrate a 10.6% SMS increase and real-time performance (1.7 FPS) on resource-constrained platforms, underscoring its practical efficiency.

SocialNav-MoE is a Mixture-of-Experts (MoE) vision-LLM designed for efficient, socially compliant navigation of robots in human-populated environments. It addresses both safety and social compliance by leveraging a sparse MoE backbone, small LLMs, a frozen SigLIP vision encoder, and a semantic similarity reward function within reinforcement fine-tuning (RFT). The framework enables real-time deployment on resource-constrained robotics platforms without the computational and latency costs typical of large-scale vision-LLMs (Kawabata et al., 15 Dec 2025).

1. Architectural Overview

SocialNav-MoE utilizes a three-stage sequential pipeline:

Supervised Fine-Tuning (SFT): The model is initialized from LLaVA-1.5-58K and further trained on the SNEI dataset to align multimodal (vision and language) observations to ground-truth navigation actions. Training cycles apply a learning rate of $1\times10^{-3}$ for initialization and $2\times10^{-5}$ for SFT.
Reinforcement Fine-Tuning (RFT): Policy learning uses Group Sequence Policy Optimization (GSPO)—a PPO variant—with a novel semantic similarity reward (SSR) designed to incentivize socially compliant decision-making over a group batch of 8 sampled answer sequences.
MoE Fine-Tuning (MoEFT): The sparse MoE expert layers and the router are jointly trained using multi-turn conversational data, enhancing coherence over extended dialogues.

Inference operates via multi-turn prompting; each observation is encoded into visual (via vision encoder) and textual histories, which jointly form the model context. The core Transformer alternates conventional and MoE layers, invoking only a sparse subset of experts per token to maintain computational efficiency.

2. Mixture-of-Experts Design and Routing

MoE blocks are interleaved in half the Transformer layers. Each block contains $K$ small feed-forward experts $F_i$ and a router function $f$ . For an input embedding $\alpha$ , the router outputs logits $f(\alpha)\in\mathbb{R}^K$ , producing weights via softmax:

$W(\alpha)_i = \frac{\exp(f(\alpha)_i)}{\sum_{j=1}^K \exp(f(\alpha)_j)}$

The output of the MoE block is computed by aggregating the $k$ selected top experts:

${\rm MoE}(\alpha) = \sum_{i\in \text{Top-}k} W(\alpha)_i F_i(\alpha)$

Empirical studies support $K=4$ experts with $k=1$ ; this configuration yields a 10.6% increase in sentence-mover’s similarity (SMS) versus using only one expert (from 0.473 to 0.523). Increasing $k$ to higher values incurs additional latency and risks conflicting outputs due to limited data (Kawabata et al., 15 Dec 2025).

3. Vision Encoder and LLM Choices

The model compares CLIP and SigLIP as feature extractors, with both frozen and fine-tuned variants. With four experts (top-1 routing), a frozen SigLIP encoder achieves 0.523 SMS, outperforming both CLIP (0.514) and fine-tuned SigLIP (0.515). Freezing the encoder avoids overfitting, especially given the limited (530 image) SNEI augmentation set (Kawabata et al., 15 Dec 2025).

For the language backbone, three small LLMs (SLMs) were evaluated:

SLM	Parameters	SMS
Phi-2-2.7B	2.7 B	0.523
Qwen-1.8B	1.8 B	0.520
StableLM-1.6B	1.6 B	0.461

Phi-2-2.7B demonstrated the highest semantic accuracy, showing that SLM choice is significant even among models smaller than 3B parameters.

4. Reinforcement Fine-Tuning and Semantic Similarity Reward

RFT employs GSPO with the SSR. The optimization objective:

$J_{\mathrm{GSPO}}(\theta) = \mathbb{E}\left[\frac{1}{G} \sum_{i=1}^G \min\left(s_i(\theta)A_i,\,\mathrm{clip}(s_i(\theta),1-\epsilon,1+\epsilon)A_i\right) - \beta D_{\mathrm{KL}}(\pi_\theta\|\pi_{\mathrm{ref}})\right]$

The SSR computes BERTScore-F1 between generated and ground-truth actions:

For generated $y$ and ground-truth $g$ , calculate cosine similarity $S_{jk} = \cos(e^j_y,e^k_g)$ .
Compute recall $R$ and precision $P$ :

$R = \frac{1}{m} \sum_{k=1}^m \max_{1 \leq j \leq n} S_{jk}, \quad P = \frac{1}{n} \sum_{j=1}^n \max_{1 \leq k \leq m} S_{jk}$

The SSR reward: $\mathrm{BERTScore}\text{-}F1 = \frac{2PR}{P+R}$

SSR supersedes hard-level and character-level rewards, improving SMS by significant margins (0.551 for SSR, vs. 0.532 for hard and 0.510 for character-level on the augmented SNEI set). With top-1 and 4 experts, GSPO+SSR outperforms alternative RFT methods (Table 9) (Kawabata et al., 15 Dec 2025).

5. Empirical Evaluation and Ablation

Experiments were conducted on the SNEI dataset (325 scenes, augmented to 530 training images). Evaluation metrics include model size, throughput (FPS), BERTScore, SBERT cosine, and sentence-mover’s similarity (SMS). Key findings:

Model	Params	FPS	SMS (orig.)	SMS (aug.)
GPT-4o	200 B	0.212	0.376	—
Claude	175 B	0.087	0.417	—
SocialNav-MoE	5.74 B	1.709	0.489	0.551

Ablation studies revealed:

Four-expert, top-1 routing increases SMS by 10.6% over a single expert.
Freezing SigLIP outperforms both frozen CLIP and fine-tuned SigLIP.
Multi-turn conversational training yields up to 3% performance gain over single-turn.
GSPO+SSR policy optimization produces materially higher semantic alignment than alternatives.

6. Real-Time Deployment and Efficiency Considerations

SocialNav-MoE’s architecture is explicitly engineered for onboard efficiency: its 5.74B parameter footprint allows 1.7 frames per second inference on a single NVIDIA T4—satisfying the 1 Hz control loop of indoor robots. The pipeline, from image to action command, completes in under 600ms. Key deployment factors include:

Sparse MoE with top-1 routing,
Frozen SigLIP encoder,
On-policy mini-batches of size 8 for RFT,
Reduced memory and energy requirements—over 20× fewer parameters and 8× higher FPS than 200B parameter baselines.

A plausible implication is that such sparsity and architectural choices position SocialNav-MoE as a practical solution for real-time social navigation on computationally constrained platforms (Kawabata et al., 15 Dec 2025).

7. Summary and Implications

SocialNav-MoE combines SLMs, sparse Mixture-of-Experts routing, a frozen SigLIP visual pipeline, and reinforcement learning with semantic similarity rewards to achieve state-of-the-art efficiency and semantic alignment in socially compliant robotic navigation. Its design enables substantial inference speedups and energy savings over competitive large-scale VLMs, without sacrificing reasoning or adherence to social navigation norms. The cumulative effect of expert layer sparsity, semantic-aware RFT, and aggressive encoder freezing explicitly bridges the gap between high-level social reasoning and low-latency robotic action generation (Kawabata et al., 15 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

SocialNav-MoE: A Mixture-of-Experts Vision Language Model for Socially Compliant Navigation with Reinforcement Fine-Tuning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SocialNav-MoE.