Spatial, Visual, and Embodied: The Convergence of Edge-Native Foundation Models
How WildDet3D, MolmoWeb, and HY-Embodied are building the foundation of sovereign intelligence.
At Atman, our vision has always been clear: Intelligence should be sovereign. It shouldn’t be a rented utility residing in a corporate cloud; it should be a decentralized, resilient resource that lives on your hardware.
To achieve this, we need more than just "chatbots." We need models that can see in 3D, navigate the digital world visually, and act in physical space—all while being small enough to run on your phone or edge device. Today, we are looking at three groundbreaking papers that prove this "sovereign edge" is no longer a distant dream, but a technical reality.
1. Seeing the World in 3D: WildDet3D
For an AI to truly be "embedded in your world," it needs spatial intelligence. WildDet3D introduces a unified, geometry-aware architecture for open-vocabulary monocular 3D object detection.
"Understanding objects in 3D from a single image is a cornerstone of spatial intelligence... It must understand where they are, how large they are, and how they are oriented in 3D space."
— WildDet3D (Huang et al., 2026)
Unlike previous models that only recognized 2D boxes, WildDet3D can "lift" a single image into a 3D scene, understanding the metric depth, orientation, and size of over 13,000 categories of objects.
WildDet3D-Bench Performance (AP3D)
| Prompt Type | Monocular (RGB) | With GT Depth |
|---|---|---|
| Text Prompt | 22.6 | 41.6 |
| Box Prompt | 24.8 | 47.2 |
2. Navigating the Digital World: MolmoWeb
If Atman is to be "woven into your device," it must be able to act on your behalf across the web. MolmoWeb is a family of open multimodal web agents that operate purely on visual screenshots, mirroring how humans use the internet.
"MolmoWeb agents operate as instruction-conditioned visual-language action policies... requiring no access to HTML, accessibility trees, or specialized APIs."
— MolmoWeb (Gupta et al., 2026)
By ignoring the underlying "code" and focusing on the pixels, MolmoWeb becomes more resilient to website updates and far more efficient for on-device deployment.
Comparison on Web-Use Benchmarks (Success Rate %)
| Model | Size | WebVoyager | DeepShop |
|---|---|---|---|
| Fara-7B | 7B | 73.5 | 26.2 |
| MolmoWeb | 4B | 75.2 | 35.6 |
| MolmoWeb | 8B | 78.2 | 42.3 |
* MolmoWeb-8B achieves 94.7% success with parallel rollouts (Pass@4).
3. Real-World Action: HY-Embodied-0.5
The final pillar is the ability to plan and act. HY-Embodied-0.5 presents a family of foundation models purpose-built for real-world agents. It bridges the gap between general VLMs and the physical demands of robotics.
Average Score (22 Embodied Benchmarks)
| Model Variant | Activated Params | Avg. Score (%) |
|---|---|---|
| Qwen3-VL | 4B | 47.8 |
| RoboBrain 2.5 | 4B | 49.4 |
| HY-Embodied MoT-2B | 2B | 58.0 |
The HY-Embodied team’s breakthrough is the Mixture-of-Transformers (MoT) architecture, which allows a small 2B-parameter model to punch way above its weight class by specializing its "neurons" for vision or language on the fly.
The Convergence: A Decentralized Future
When we combine these three advancements, the Atman vision becomes a technical reality:
- WildDet3D gives your device the eyes to see and measure your physical surroundings.
- MolmoWeb gives it the agency to navigate and act in the digital world on your behalf.
- HY-Embodied gives it the compact, edge-ready brain to plan and execute tasks in real-time.
The intelligence that belongs to you is finally arriving. By focusing on edge-native models and vision-centric approaches, we are removing the strings of the cloud and empowering you with Sovereign Intelligence.
References
Ready to own your intelligence?
Join the WaitlistShare this article