Introducing SPEAR-1
SPEAR-1 is the first European-made open-weight robotic AI foundation model of its kind. SPEAR-1 outperforms existing state-of-the-art models while using 20x less robotic data, a breakthrough due to the fact that robotic data is hard, slow and expensive to collect. The key innovation behind SPEAR-1 is a new AI model training approach which enables learning from both robotic data as well as non-robotic 3D data that is much more abundant and easier to obtain. This ability to understand and locate objects in 3D space massively boosts SPEAR-1's robot manipulation capabilities. SPEAR-1 is a major advance in solving the data bottleneck in robotics and a key step to enabling real-world adoption of AI-enabled robots.
General Robotic Intelligence
Single open-source foundation model for multiple tasks on multiple robots
Amplified by Non-Robotic Data
Outperforms state-of-the-art, while using 20x less robot demonstrations
Robotic Foundation Model made in Europe
Cutting-edge robotic AI, developed in Europe
Robots hold the promise to make our day to day lives easier: clean our homes, do our laundry, prepare our food, all the way to automate manufacturing or work in hazardous environments. Robot hardware has already taken a massive leap forward, with humanoid robots becoming more affordable than cars today. However, building the software to control these systems remains an open problem.
Today's robots remain narrow specialists operating in strictly controlled environments such as warehouses and factories. These manual systems require months or years to develop a specialized solution - one that performs the same repetitive motion, on the same robot, over the same visual background, with specific object types, sizes, shapes and positions. Such solutions fail to generalize beyond that particular environment. For example, a robot packing food boxes in a warehouse would miserably fail to do so in your house.
Robotic foundation models (RFMs) hold the promise to solve these limitations. One can think of RFMs as the 'ChatGPT equivalent' for robots - a single model that can perform a wide range of tasks, in any type of environment, no matter the specific robot. Given images of the environment and an instruction in human language, a RFM would output the low-level motor commands for a robot to move and complete the required task. Crucially, these models would be able to generalize to new environments and eliminate the need to manually program robots for every new task.
Over the past year we have developed SPEAR-1: a robotic foundation model, which significantly improves the generalization capabilities of robots, using 20x less data and bringing robots closer to real world deployment.
Just like a person learns the skill of pouring water in a glass through multiple trials, RFMs also learn from data containing thousands of demonstrations for a particular task and skill. However, most RFMs are trained on 2D images and then asked to control robots that move in 3D space. This discrepancy shows up in practice: RFMs can look impressive in short videos yet fail when objects, positions, layouts, or backgrounds change.
The discrepancy is deeply rooted in the architecture of Vision-Language-Action models (VLAs), a type of RFM. Standing on the shoulders of giants, most VLAs combine three major components:
- A large language model (LLM) that brings reasoning capabilities and knowledge about the world
- A vision module that 'understands' 2D images and together with the LLM forms a Vision-Language Model (VLM)
- A control module that takes the VLM understanding of the world and outputs 3D coordinates for the robot to move to
Training each of these components is a challenging task on its own and robot VLAs rely on a pretrained VLMs to 'understand' the world around them. These state-of-the-art VLMs recognize and describe scenes with remarkable conceptual fidelity, yet they remain essentially two-dimensional. They don't know the physical distance between two objects on the image or the physical sizes of objects. In contrast, robots need to reason about 3D geometry and the precise 3D positions of objects in order to successfully perform a task. For example, a surgical robot unaware of the precise 3D distances and sizes can easily end up performing a deadly cut. We bridge this gap with SPEAR-1.
We train SPEAR-1 in two stages:
- We extend the capabilities of a VLM from the flat 2D pixels to the physical 3D world using openly accessible large-scale non-robotic data (SPEAR-VLM)
- Afterwards we combine SPEAR-VLM with a control module and train it to 'mimic' robot movements collected with human teleoperation (SPEAR-1)
Importantly, by embedding 3D geometrical knowledge into SPEAR-VLM, we provide strong understanding necessary for robotic movements in the 3D world.
Stage 1: VLM training (SPEAR-VLM)
Vision-Language-Action models (VLAs) inherit their visual and language knowledge about the world from Vision-Language Models (VLMs), e.g. (GPT-5, Gemini, Claude). VLMs are pre-trained on data gathered from the web and tasks such as answering questions about an image or locating an object on the image. However, their capabilities remain inherently limited to the flat 2D world of pixels, unable to understand the 3D relationships between objects.
We take a smaller 3 billion parameter VLM as a starting point and expand its knowledge to the physical 3D world. We first integrate a specialized depth model that can estimate the 3D distance from the camera to every pixel of an object on the image. Then we train SPEAR-VLM to answer geometrical 3D questions such as "Provide the 3D bounding box coordinates of the bowl", "Estimate the x, y, z distance between the mug and the spoon". To successfully answer these questions, SPEAR-VLM needs to learn to locate objects on the image from their language description, estimate their 3D coordinates and distances between them.
Importantly, such training data is significantly cheaper to obtain and widely available compared to expensive-to-collect robotic data. In fact, with only 200k examples of 3D data, SPEAR-1 is capable of outperforming state-of-the-art robotic models trained with 900M more examples of robotic data.
Stage 2: VLA training (SPEAR-1)
Once 3D understanding is embedded in SPEAR-VLM, we attach a specialized control module to produce SPEAR-1, such that it can now generate a trajectory that the robot can follow.
While SPEAR-VLM is trained to output language, the control module of SPEAR-1 is trained to output motor control commands and 3D target positions for the robot in real time. We train SPEAR-1 via flow matching, a variant of diffusion models. SPEAR-1 is trained to ‘imitate’ robot movements and trajectories that were collected by humans teleoperating the robot to perform specific tasks multiple times. Thanks to SPEAR-1’s 3D understanding capabilities, SPEAR-1 requires approximately 20x fewer robot demonstrations.
Results
We evaluate the performance of SPEAR-1 on various manipulation tasks on the DROID platform and compare it to the current state-of-the-art open-source robotic foundation models, Physical Intelligence's -FAST and . These models have been trained on significantly larger amounts of robot data, collected in more diverse environments, but their training recipe does not include 3D pre-training. Since we are interested in assessing the models' performance in zero-shot control scenarios, we do not fine-tune SPEAR-1 or any of the baseline models for the specific tasks or environments.
Robotic Foundation Models can often struggle when the target object positions change, e.g. when the carrot is placed on the left or on the right of the robot for the task ‘put the carrot on the plate’. To make things particularly challenging, we vary the positions of the target objects both horizontally and vertically. SPEAR-1 successfully completes these tasks no matter the 3D positions of the target objects.
Wipe the stain
Put the carrot on the blue plate
Close the drawer
Put the marker in the cup
Put the spoon in the drawer
Cover the pot
We include tasks that require fine 3D positioning and calibration skills such as ‘Put the marker in the cup’. This task is quite challenging because picking a whiteboard marker from a flat surface requires very precise positioning, both vertically and horizontally. The task is successful only if the marker is correctly inserted in the cup.
Despite that -FAST has been trained on 20× more robotic data, SPEAR-1 achieves 57% higher performance. SPEAR-1 is able to match the performance of - a state-of-the-art model which integrates additional architectural improvements, a 'basic form of reasoning' and has been trained in at least 5× more environments. This indicates that acquiring 3D knowledge about the world can significantly reduce the robotic data needed to achieve reliable robot performance.
Put the piece on the chessboard
Put the corn between the cups
Put the eggplant in the pot
Put the pink cup on the blue cup
Put the pink cup on the blue plate
We also evaluate SPEAR-1 on the WidowX robot - a different robot arm. Since -FAST and models are not available for that platform, we compare to another state-of-the-art model for that robot - OpenVLA.
SPEAR-1 also outperforms OpenVLA - a twice bigger model. This also demonstrates SPEAR-1's capabilities of generalizing across different robot embodiments.
What's next?
Our goal at INSAIT robotics is to build foundation models such that any robot can autonomously perform any task in any environment - from hospitals to factories to ordinary homes. Our work so far has demonstrated how we can make use of 3D data which is much cheaper to obtain in order to reduce the amount of robot demonstrations needed to teach a robot to perform a task and significantly increase the reliability of the model. However, robotic foundation models still have a long way to go - from long-horizon planning to autonomous self-improvement to completing unseen tasks. Following our 3D work, we believe that similar improvements can be obtained from other sources of data, pushing the limitations and making robotic foundation models even more capable.
Succeeding at this will require not only technological improvements at all levels from low-level hardware to high-level software, but also collective efforts spanning across institutions and industries. If you are interested in collaborating with us, please reach out.
Team
Authors
Nikolay Nikolov, Giuliano Albanese, Sombit Dey, Aleksandar Yanev, Luc Van Gool, Jan-Nico Zaech, Danda Pani Paudel
Acknowledgments
We thank our INSAIT colleagues for helpful discussions and support: Martin Vechev, Alexander-Marc Spiridonov, Anna-Maria Halacheva, Hristo Venev, Kamen Pavlov, Borislav Petrov.