Study 2 โ€ข benchmark results plus controller-side integration study

VLM Social Navigation

From motion control to human-aware decisions.

Research question: once the robot can move, how should it decide what to do around people? The reported benchmark is offline, while the broader study also includes online integration experiments and safety-projected decision paths. It should not be described as raw closed-loop VLM motor control.

0
Curated Social-Nav Bags
0
Primary Consensus Accuracy
0
Decision Labels
SOCIAL DECISION BENCHMARK OFFLINE
Benchmark Dashboard

This panel summarizes the offline decision benchmark. It is not a live robot-control dashboard.

Decision Explorer
STOP

Meaning: Yield or pause when a person is directly ahead or approaching.

Typical situation: Approaching or directly blocked person.

Why it exists: Keeps the policy conservative in socially occupied space.

Deployment note: High-level decision only, not a direct brake command.

Frames โ†’ VLM Prompt โ†’ Social Decision โ†’ Safety Projection
scope offline evaluation over curated Go1 bags
boundary high-level labels only, not raw motor commands

Why Imitation Learning Was Not Enough

The move to VLM evaluation came from a representation bottleneck. Motion control is not the same as social decision-making once people appear in the scene.

What the earlier interface was missing

  • STOP and FORWARD alone cannot express yielding left, yielding right, or deferring under uncertainty.
  • Crossing direction, receding motion, and late-entering people can collapse into the same low-level response.
  • The limitation was not only control quality. It was the lack of a clearer decision representation.

What changed in this study

  • The project uses pretrained VLMs as a semantic reasoning module rather than training a new controller from scratch.
  • The intervention is the decision interface: a richer output space and prompt structure over short image sequences.
  • The goal is not direct actuation. It is better decision-level interpretation around people.

Decision Representation

The action space is deliberately higher level than motor control. These labels describe what the robot should decide socially, not how each joint should move.

Action Space

  • STOP
  • FORWARD
  • LEFT
  • RIGHT
  • REVIEW

Why REVIEW matters

  • Ambiguous social scenes should not always be forced into a hard binary action.
  • REVIEW makes uncertainty visible instead of hiding it.
  • It is best understood as an uncertainty interface, not as proof that calibration is solved.

Benchmark Setup

The reported benchmark is evaluated offline over extracted front-camera frames from curated Go1 rosbags. The broader study also includes controller-side integration experiments and safety-projected online decision tests.

Data

  • Curated Go1 social-navigation bags
  • Decision-level labels rather than low-level control labels
  • Crossing, approaching, receding, entering-late, and review-oriented cases

Sequence Setting

  • Final setting: 10-frame windows
  • At most 5 images sent per VLM call
  • Capped temporal subsampling from extracted frame windows

Models

  • Qwen3-VL-30B
  • InternVL-3.5-14B
  • Single-image and sequence-based prompting

Observations from the Saved Run

The dashboard above contains the benchmark summary. The points below are the parts that mattered most when interpreting the run.

Where sequence helped

  • Sequence reasoning helped in the receding-person case where both sequence methods returned FORWARD correctly.
  • The richer action space made lateral intent and uncertainty visible instead of collapsing everything into stop/go.
  • REVIEW gave the evaluator a more honest place to put ambiguity.

What remained difficult

  • Crossing and entering-late cases still failed at the consensus level in several bags.
  • Sequence context did not automatically fix ambiguity; sometimes it diluted the person cue that mattered.
  • The interface improved faster than perception reliability did under the current sampling regime.

Diagnostic Signals

Review Bags Without Review Consensus 1 / 3
InternVL Sequence Unsafe-Forward 0.610
Qwen Sequence Unsafe-Forward 0.432

Unsafe-forward mattered more than raw accuracy alone because these are the cases most likely to violate a conservative social policy.

Interpretation

  • The result is more useful as a diagnostic signal than as a success metric.
  • Sequence reasoning did not automatically solve ambiguity.
  • The main open problem remained perception consistency and activation on the relevant person or motion cue.

Deployment Boundary

The reported benchmark is offline, while the broader study also includes wrapper integration, control-path experiments, and safety-projected online decision studies. It should still not be described as deployed raw VLM robot control.

WHAT WAS EVALUATED

๐Ÿงช Offline

  • Single-image and sequence-based social-navigation policies
  • Decision labels over curated Go1 rosbag scenarios
WHAT WAS NOT DONE

โš ๏ธ Not Deployed

  • No full closed-loop VLM deployment on Go1
  • No raw VLM motor control path
PROPOSED SYSTEM VIEW

โœ” Safety-Projected Path

  • Fast controller handles real-time motion and local safety
  • Slower VLM layer provides semantic guidance

Why the split matters

VLM latency and missing geometric guarantees make raw low-level control a poor fit. The more defensible use is a slower semantic layer whose outputs are projected through a safety boundary before anything executable is considered.

Social Navigation Video

This clip is included as study media for the later Go1 social-navigation phase. It belongs with the broader controller-side integration work and still should not be read as evidence of raw closed-loop VLM motor control.

VLM Social Navigation Clip
Local mp4 render of the VLM social-navigation study media.

Future Work

The next steps are mostly about making the semantic layer more reliable and tightening how it hands decisions to the fast safety loop.

Near-term directions

  • Reduce latency in the decision loop.
  • Improve REVIEW calibration under ambiguous scenes.
  • Strengthen the safety-projected handoff from semantic label to executable action.

Open scenario gaps

  • Improve crossing detection and entering-late robustness.
  • Expand scenario coverage beyond the current benchmark mix.
  • Make online decision loops more stable without turning the VLM into a raw motor controller.

Related Notes

The full archive now lives in the shared Research Notebook. The links here stay focused on the notes most relevant to the VLM social-navigation study.