Study 2 โข benchmark results plus controller-side integration study
VLM Social Navigation
From motion control to human-aware decisions.
Research question: once the robot can move, how should it decide what to do around people? The reported benchmark is offline, while the broader study also includes online integration experiments and safety-projected decision paths. It should not be described as raw closed-loop VLM motor control.
boundary high-level labels only, not raw motor commands
Why Imitation Learning Was Not Enough
The move to VLM evaluation came from a representation bottleneck. Motion control is not the same as social decision-making once people appear in the scene.
What the earlier interface was missing
STOP and FORWARD alone cannot express yielding left, yielding right, or deferring under uncertainty.
Crossing direction, receding motion, and late-entering people can collapse into the same low-level response.
The limitation was not only control quality. It was the lack of a clearer decision representation.
What changed in this study
The project uses pretrained VLMs as a semantic reasoning module rather than training a new controller from scratch.
The intervention is the decision interface: a richer output space and prompt structure over short image sequences.
The goal is not direct actuation. It is better decision-level interpretation around people.
Decision Representation
The action space is deliberately higher level than motor control. These labels describe what the robot should decide socially, not how each joint should move.
Action Space
STOP
FORWARD
LEFT
RIGHT
REVIEW
Why REVIEW matters
Ambiguous social scenes should not always be forced into a hard binary action.
REVIEW makes uncertainty visible instead of hiding it.
It is best understood as an uncertainty interface, not as proof that calibration is solved.
Benchmark Setup
The reported benchmark is evaluated offline over extracted front-camera frames from curated Go1 rosbags. The broader study also includes controller-side integration experiments and safety-projected online decision tests.
Data
Curated Go1 social-navigation bags
Decision-level labels rather than low-level control labels
Crossing, approaching, receding, entering-late, and review-oriented cases
Sequence Setting
Final setting: 10-frame windows
At most 5 images sent per VLM call
Capped temporal subsampling from extracted frame windows
Models
Qwen3-VL-30B
InternVL-3.5-14B
Single-image and sequence-based prompting
Observations from the Saved Run
The dashboard above contains the benchmark summary. The points below are the parts that mattered most when interpreting the run.
Where sequence helped
Sequence reasoning helped in the receding-person case where both sequence methods returned FORWARD correctly.
The richer action space made lateral intent and uncertainty visible instead of collapsing everything into stop/go.
REVIEW gave the evaluator a more honest place to put ambiguity.
What remained difficult
Crossing and entering-late cases still failed at the consensus level in several bags.
Sequence context did not automatically fix ambiguity; sometimes it diluted the person cue that mattered.
The interface improved faster than perception reliability did under the current sampling regime.
Diagnostic Signals
Review Bags Without Review Consensus1 / 3
InternVL Sequence Unsafe-Forward0.610
Qwen Sequence Unsafe-Forward0.432
Unsafe-forward mattered more than raw accuracy alone because these are the cases most likely to violate a conservative social policy.
Interpretation
The result is more useful as a diagnostic signal than as a success metric.
Sequence reasoning did not automatically solve ambiguity.
The main open problem remained perception consistency and activation on the relevant person or motion cue.
Deployment Boundary
The reported benchmark is offline, while the broader study also includes wrapper integration, control-path experiments, and safety-projected online decision studies. It should still not be described as deployed raw VLM robot control.
WHAT WAS EVALUATED
๐งช Offline
Single-image and sequence-based social-navigation policies
Decision labels over curated Go1 rosbag scenarios
WHAT WAS NOT DONE
โ ๏ธ Not Deployed
No full closed-loop VLM deployment on Go1
No raw VLM motor control path
PROPOSED SYSTEM VIEW
โ Safety-Projected Path
Fast controller handles real-time motion and local safety
Slower VLM layer provides semantic guidance
Why the split matters
VLM latency and missing geometric guarantees make raw low-level control a poor fit. The more defensible use is a slower semantic layer whose outputs are projected through a safety boundary before anything executable is considered.
Social Navigation Video
This clip is included as study media for the later Go1 social-navigation phase. It belongs with the broader controller-side integration work and still should not be read as evidence of raw closed-loop VLM motor control.
VLM Social Navigation Clip
Local mp4 render of the VLM social-navigation study media.
Future Work
The next steps are mostly about making the semantic layer more reliable and tightening how it hands decisions to the fast safety loop.
Near-term directions
Reduce latency in the decision loop.
Improve REVIEW calibration under ambiguous scenes.
Strengthen the safety-projected handoff from semantic label to executable action.
Open scenario gaps
Improve crossing detection and entering-late robustness.
Expand scenario coverage beyond the current benchmark mix.
Make online decision loops more stable without turning the VLM into a raw motor controller.
Related Notes
The full archive now lives in the shared Research Notebook. The links here stay focused on the notes most relevant to the VLM social-navigation study.
VLM Social Navigation Overview
Short summary of the VLM study, its reported benchmark, and its controller-side integration boundary.