While we sometimes refer to chatbots such as Gemini and ChatGPT, “robots,” generative AI also plays a growing role in physical robots. Google DeepMind, which announced Gemini Robotics in the beginning of this year, has now revealed an on-device VLA model (vision language action), to control robots. The previous release was not available.
No cloud componentallows robots to operate in full autonomy.
Carolina Parada is the head of robotics for Google DeepMind. She says that this approach to AI robots could make robots reliable in challenging situations. This is the first version of Google’s robotics model which developers can customize to their own needs.
The robotics problem is unique for AI, because the robot not only exists in the physical world but also changes its surroundings. It’s difficult to predict all the situations a robot may encounter, whether you’re asking it to move blocks or tie your shoelaces. The traditional method of teaching a robot to perform an action using reinforcement was slow, but generative artificial intelligence allows for a much greater generalization.
“It’s drawing from Gemini’s multimodal world understanding in order to do a completely new task,” explains Carolina Parada. “What that enables is in that same way Gemini can produce text, write poetry, just summarize an article, you can also write code, and you can also generate images. It also can generate robot actions.”
No cloud needed for general robots
The previous Gemini Robotics (which is still “best” Google’s robotics technology) used a hybrid system that ran a small robot on the robot, and a larger robot running in the cloud. Chatbots “think” are usually able to generate output in a matter of seconds, but robots must react much faster. You don’t want the robot to pause each time it generates a step. The local model is designed for quick adaptation while the server-based version can assist with complex reasoning tasks. Google DeepMind has released the local model, which is a standalone VLA. It’s surprisingly robust.