Study 2 • benchmark results plus controller-side integration study

VLM Social Navigation

From motion control to human-aware decisions.

Research question: once the robot can move, how should it decide what to do around people? The reported benchmark is offline, while the broader study also includes online integration experiments and safety-projected decision paths. It should not be described as raw closed-loop VLM motor control.

Decision Representation Back to Overview

Curated Social-Nav Bags

Primary Consensus Accuracy

Decision Labels

SOCIAL DECISION BENCHMARK OFFLINE

Benchmark Dashboard

This panel summarizes the offline decision benchmark. It is not a live robot-control dashboard.

Decision Explorer

STOP

Meaning: Yield or pause when a person is directly ahead or approaching.

Typical situation: Approaching or directly blocked person.

Why it exists: Keeps the policy conservative in socially occupied space.

Deployment note: High-level decision only, not a direct brake command.

Frames → VLM Prompt → Social Decision → Safety Projection

scope offline evaluation over curated Go1 bags

boundary high-level labels only, not raw motor commands

Why Imitation Learning Was Not Enough

The move to VLM evaluation came from a representation bottleneck. Motion control is not the same as social decision-making once people appear in the scene.

What the earlier interface was missing

STOP and FORWARD alone cannot express yielding left, yielding right, or deferring under uncertainty.
Crossing direction, receding motion, and late-entering people can collapse into the same low-level response.
The limitation was not only control quality. It was the lack of a clearer decision representation.

What changed in this study

The project uses pretrained VLMs as a semantic reasoning module rather than training a new controller from scratch.
The intervention is the decision interface: a richer output space and prompt structure over short image sequences.
The goal is not direct actuation. It is better decision-level interpretation around people.

Decision Representation

The action space is deliberately higher level than motor control. These labels describe what the robot should decide socially, not how each joint should move.

Action Space

STOP
FORWARD
LEFT
RIGHT
REVIEW

Why REVIEW matters

Ambiguous social scenes should not always be forced into a hard binary action.
REVIEW makes uncertainty visible instead of hiding it.
It is best understood as an uncertainty interface, not as proof that calibration is solved.

Benchmark Setup

The reported benchmark is evaluated offline over extracted front-camera frames from curated Go1 rosbags. The broader study also includes controller-side integration experiments and safety-projected online decision tests.

Data

Curated Go1 social-navigation bags
Decision-level labels rather than low-level control labels
Crossing, approaching, receding, entering-late, and review-oriented cases

Sequence Setting

Final setting: 10-frame windows
At most 5 images sent per VLM call
Capped temporal subsampling from extracted frame windows

Models

Qwen3-VL-30B
InternVL-3.5-14B
Single-image and sequence-based prompting

Observations from the Saved Run

The dashboard above contains the benchmark summary. The points below are the parts that mattered most when interpreting the run.

Where sequence helped

Sequence reasoning helped in the receding-person case where both sequence methods returned FORWARD correctly.
The richer action space made lateral intent and uncertainty visible instead of collapsing everything into stop/go.
REVIEW gave the evaluator a more honest place to put ambiguity.

What remained difficult

Crossing and entering-late cases still failed at the consensus level in several bags.
Sequence context did not automatically fix ambiguity; sometimes it diluted the person cue that mattered.
The interface improved faster than perception reliability did under the current sampling regime.

Diagnostic Signals

Review Bags Without Review Consensus 1 / 3

InternVL Sequence Unsafe-Forward 0.610

Qwen Sequence Unsafe-Forward 0.432

Unsafe-forward mattered more than raw accuracy alone because these are the cases most likely to violate a conservative social policy.

Interpretation

The result is more useful as a diagnostic signal than as a success metric.
Sequence reasoning did not automatically solve ambiguity.
The main open problem remained perception consistency and activation on the relevant person or motion cue.

Deployment Boundary

The reported benchmark is offline, while the broader study also includes wrapper integration, control-path experiments, and safety-projected online decision studies. It should still not be described as deployed raw VLM robot control.

WHAT WAS EVALUATED

🧪 Offline

Single-image and sequence-based social-navigation policies
Decision labels over curated Go1 rosbag scenarios

WHAT WAS NOT DONE

⚠️ Not Deployed

No full closed-loop VLM deployment on Go1
No raw VLM motor control path

PROPOSED SYSTEM VIEW

✔ Safety-Projected Path

Fast controller handles real-time motion and local safety
Slower VLM layer provides semantic guidance

Why the split matters

VLM latency and missing geometric guarantees make raw low-level control a poor fit. The more defensible use is a slower semantic layer whose outputs are projected through a safety boundary before anything executable is considered.

Social Navigation Video

This clip is included as study media for the later Go1 social-navigation phase. It belongs with the broader controller-side integration work and still should not be read as evidence of raw closed-loop VLM motor control.

VLM Social Navigation Clip

Local mp4 render of the VLM social-navigation study media.

Future Work

The next steps are mostly about making the semantic layer more reliable and tightening how it hands decisions to the fast safety loop.

Near-term directions

Reduce latency in the decision loop.
Improve REVIEW calibration under ambiguous scenes.
Strengthen the safety-projected handoff from semantic label to executable action.

Open scenario gaps

Improve crossing detection and entering-late robustness.
Expand scenario coverage beyond the current benchmark mix.
Make online decision loops more stable without turning the VLM into a raw motor controller.

Related Notes

The full archive now lives in the shared Research Notebook. The links here stay focused on the notes most relevant to the VLM social-navigation study.

VLM Social Navigation

Why Imitation Learning Was Not Enough

What the earlier interface was missing

What changed in this study

Decision Representation

Action Space

Why REVIEW matters

Benchmark Setup

Data

Sequence Setting

Models

Observations from the Saved Run

Where sequence helped

What remained difficult

Diagnostic Signals

Interpretation

Deployment Boundary

🧪 Offline

⚠️ Not Deployed

✔ Safety-Projected Path

Why the split matters

Social Navigation Video

Future Work

Near-term directions

Open scenario gaps

Related Notes

VLM Social Navigation Overview

Decision Representation

Benchmark Design

Deployment Boundary

Future Work