A Review of Robbyant’s Early-2026 Work

Robbyant is a company under Ant Group, dedicated to building the foundational platform for Embodied AI, bridging the gap between digital intelligence and the physical world.

Since the company is still relatively new, I want to quickly review its recent work. In particular, I will study four embodied intelligence model models: spatial perception model, VLA model, world model, and video action model.

This diagram in the homepage of Robbyant reflects the vision for embodied intelligence: starting from sensory input, the system first builds spatial intelligence to understand the physical world, then relies on an action model to make decisions and interact with the environment, and finally improves through environmental reward.

In this sense, embodied AI is not a single model, but a complete closed loop of perception, understanding, action, and feedback.

Based on this vision, Robbyant’s current work can be organized into four representative model directions: LingBot-Depth for spatial perception, LingBot-VLA for vision-language-action, LingBot-World for world modeling, and LingBot-VA for video action modeling.

LingBot-Depth

Problem

A core challenge in embodied intelligence is obtaining accurate, dense, and metric 3D perception in real time. For robots, autonomous vehicles, and other physical agents, depth is not just an auxiliary signal: it is the basis for localization, scene understanding, and reliable action.

In practice, a useful depth system should satisfy three conditions at once: it should provide absolute metric scale, pixel-aligned dense geometry, and real-time sensing without heavy post-processing. Among existing approaches, RGB-D sensors are one of the few practical choices that can meet these requirements in real-world deployment.

However, RGB-D cameras are far from perfect. Even strong commercial sensors often fail in difficult scenes (as illustrated in the left of following figure), especially on texture-less surfaces, reflective materials, and under complex lighting.

This creates a gap between the promise of sensor-based depth and the actual perception quality required by downstream tasks such as manipulation, tracking, and navigation.

LingBot-Depth is motivated by the idea that these failured sensor should not simply be treated as useless noise. Instead, the paper views missing or inaccurate depth as a meaningful signal that reveals where geometry is ambiguous and where perception is hardest.

Based on this view, the problem becoms recovering reliable dense depth from imperfect RGB-D observations, while also extending naturally to the monocular setting, which doesn’t leverage depth sensor input, predicting depth using only RGB.

In other words, the paper addresses a practical and fundamental perception problem:

Model Details

Architecture.

LingBot-Depth adopts a Vision Transformer (Large) as the encoder to jointly process RGB and depth inputs. RGB tokens provide full visual context, while depth tokens encode sparse or corrupted sensor depth. These two modalities are fused in the transformer latent space to learn geometry-aware representations. On top of the encoder, the model uses a multi-scale decoder to progressively recover dense depth structure, and the final prediction head performs depth regression.

Encoder: Vision Transformer (Large) with RGB-D fusion
Decoder: Multi-scale feature pyramid with specialized heads
Heads: Depth regression

Input Format.

RGB image. The RGB input is a tensor of shape [B, 3, H, W], normalized to [0, 1] and stored in float32 format. It provides the dense visual appearance cues that guide the recovery of missing or ambiguous depth regions.
Depth map. The depth input has shape [B, H, W] and is represented in meters. Invalid or unreliable regions are marked as 0 or NaN. In the depth completion setting, this input is partial and noisy; during training, masked regions indicate where the model must infer missing geometry.
Camera intrinsics. The model can also take camera intrinsics of shape [B, 3, 3] in normalized form. These parameters preserve the metric relationship between image coordinates and 3D geometry, which is important for producing physically meaningful depth and point cloud outputs.

Output Format.

LingBot-Depth outputs a refined dense depth map of shape [B, H, W]. It can also produce a 3D point cloud of shape [B, H, W, 3] in camera coordinates.

{
    'depth': torch.Tensor,   # Refined depth [B, H, W]
    'points': torch.Tensor,  # Point cloud [B, H, W, 3] in camera space
}

This makes the model useful not only for depth refinement itself, but also as a perception frontend for downstream 3D tasks.

Training

Pretraining objective.

LingBot-Depth is pretrained with Masked Depth Modeling (MDM): the model sees the full RGB image together and the unmasked depth tokens, and learns to reconstruct the full target depth map. This turns corrupted or missing sensor depth into a learning signal rather than discarding it.

Masking strategy.

The masking is applied only to the depth tokens, while the RGB image remains fully visible. Regions where depth is completely missing are always masked. Regions with partially valid depth are more likely to be masked as well. If the overall masked area is still too small, the model further masks some valid depth regions at random, so that the final masking ratio stays between 60%-90%.

Encoder-decoder setup.

The model uses a 24-layer ViT-L/14 encoder initialized from DINOv2, while the convolutional decoder is randomly initialized. After encoding, latent depth tokens are discarded, and the retained contextual tokens are decoded into dense depth with a ConvStack decoder.

Optimization.

Training uses AdamW with different learning rates for the pretrained encoder and the newly initialized decoder, plus warm-up for the encoder and step decay for both. They also use gradient clipping, BF16 mixed precision, and standard augmentations such as random resized crop, flip, color jitter, JPEG artifacts, motion blur, and shot noise.

Data.

Pretraining uses a total of about 10M RGB-D samples: roughly 3.2M self-curated samples (real + synthetic) plus several public RGB-D datasets. For public datasets without natural missing depth, the paper applies artificial masking to match the target mask ratio.

Supervision.

The prediction is supervised with an L1 depth loss, computed only on pixels with valid ground-truth depth. Full training runs for 250k iterations with a global batch size of 1024 on 128 GPUs.

Inference

Depth completion mode.

At inference, the standard setting is RGB + incomplete / noisy depth as input. The model uses RGB context together with the remaining valid depth observations to predict a dense refined depth map.

What happens inside.

The encoder processes all RGB tokens and only the unmasked depth tokens. After encoding, the latent depth tokens are discarded, the contextual RGB-side tokens are kept, and the ConvStack decoder reconstructs the final dense depth, which is then resized back to the original image resolution.

Output.

The main inference output is a refined depth map; the released implementation can also convert this into a 3D point cloud in camera space for downstream use.

Monocular inference.

For pure monocular depth estimation, the paper does not keep the original two-input pipeline unchanged. Instead, it removes the depth embedding branch and the ConvStack decoder, and uses the pretrained LingBot-Depth encoder as an RGB-only backbone to initialize MoGe. This shows that the geometry learned during MDM pretraining transfers to RGB-only depth prediction.

Practical takeaway.

So in deployment there are really two inference usages:

RGB + partial depth → depth completion, and
RGB only, use encoder as feature extractor to incorporate into monocular depth estimation model.

Experiments