ChatGPT for Robots - Impact of Large Language Models

Category Artificial Intelligence

tldr #

Google DeepMind's robotic transformer model RT-2 is able to almost double the success rate of robots performing unfamiliar tasks. The model uses text and image data from the internet as well as PowerAI Vision and PowerAI Language (PaLI-X and PaLM-E) models to learn new skills. Robot actions are represented as tokens, which are natural language text used in machine learning to quickly process tasks assigned to them.

content #

Ever since ChatGPT exploded onto the tech scene in November of last year, it’s been helping people write all kinds of material, generate code, and find information. It and other large language models (LLMs) have facilitated tasks from fielding customer service calls to taking fast food orders. Given how useful LLMs have been for humans in the short time they’ve been around, how might a ChatGPT for robots impact their ability to learn and do new things? Researchers at Google DeepMind decided to find out and published their findings in a blog post and paper released last week.

The robotic transformer models for robots are trained using enormous datasets that contain large amounts of language and visual data

They call their system RT-2. It’s short for robotics transformer 2, and it’s the successor to robotics transformer 1, which the company released at the end of last year. RT-1 was based on a small language and vision program and specifically trained to do many tasks. The software was used in Alphabet X’s Everyday Robots, enabling them to do over 700 different tasks with a 97 percent success rate. But when prompted to do new tasks they weren’t trained for, robots using RT-1 were only successful 32 percent of the time.

Robots using RT-2 (Robotics Transformer 2) are able to successfully perform new tasks 62 percent of the time, compared to robots using RT-1 (Robotics Transformer 1) which were only successful 32 percent of the time.

RT-2 almost doubles this rate, successfully performing new tasks 62 percent of the time it’s asked to. The researchers call RT-2 a vision-language-action (VLA) model. It uses text and images it sees online to learn new skills. That’s not as simple as it sounds; it requires the software to first "understand" a concept, then apply that understanding to a command or set of instructions, then carry out actions that satisfy those instructions.

The robotic transformer models are trained using PowerAI Vision and PowerAI Language (PaLI-X and PaLM-E) models.

One example the paper’s authors give is disposing of trash. In previous models, the robot’s software would have to first be trained to identify trash. For example, if there’s a peeled banana on a table with the peel next to it, the bot would be shown that the peel is trash while the banana isn’t. It would then be taught how to pick up the peel, move it to a trash can, and deposit it there.

RT-2 works a little differently, though. Since the model has trained on loads of information and data from the internet, it has a general understanding of what trash is, and though it’s not trained to throw trash away, it can piece together the steps to complete this task.

Robots using RT-2 use tokens which represent natural language text to quickly process tasks assigned to them.

The LLMs the researchers used to train RT-2 are PaLI-X (a vision and language model with 55 billion parameters), and PaLM-E (what Google calls an embodied multimodal language model, developed specifically for robots, with 12 billion parameters). "Parameter" refers to an attribute a machine learning model defines based on its training data. In the case of LLMs, they model the relationships between words in a sentence and weigh how likely it is that a given word will be preceded or followed by another word.

Google DeepMind's robotic transformer model RT-2 has the potential to significantly improve the success rate of robots in performing unfamiliar tasks.

Through finding the relationships and patterns between words in a giant dataset, the models learn from their own inferences. They can eventually figure out how different concepts relate to each other and discern context. In RT-2’s case, it translates that knowledge into generalized instructions for robotic actions.

Those actions are represented for the robot as tokens, which are usually used to represent natural language text in machine learning for faster processing. RT-2 converts new tasks it’s asked to do into tokenized robot commands that can be understood by robotic software.

Humans continue to develop robotics to utilize more advanced robotic transformer models than RT-2, for greater accuracy in robotic tasks.

hashtags #
worddensity #