How PlayStation’s Latest Research Teaches AI to Read Human Actions in 3D

“Feature image showing a PlayStation-themed AI graphic with a 3D motion cube, directional arrows, and a camera icon, illustrating how PlayStation’s research teaches AI to read human actions in 3D.

By Jon Scarr

PlayStation has been releasing a steady stream of AI research lately, and a lot of it focuses on one thing. Helping models understand how humans act, move, and react to the world around them. Not long ago, I covered this in my article How PlayStation Is Teaching AI To Play Games Smarter With Supervised Contrastive Imitation Learning, where PlayStation’s researchers showed how AI can learn the reasons behind gameplay decisions. That work looked at how an agent can watch gameplay footage and connect each move to the reason behind it.

This new paper explores something different, but still connected. Instead of studying gameplay, it focuses on real-world actions and how objects move in 3D space. The research uses egocentric vision, which is basically first-person video, to teach AI models how people hold, rotate, pick up, and place everyday objects. It also uses third-person footage and text prompts to give the model more context about what’s happening.

The idea is simple: show the AI a lot of examples of real actions, describe those actions in plain text, and let it learn the full 3D motion needed to perform them. Sony’s researchers built this system using large-scale video datasets and extracted thousands of hand-object interactions to create training material. The result is a model that can read an action description and generate a detailed 6DoF movement path (meaning full position and rotation in 3D space) that lines up with what a person would do.

It’s a very different problem than the gameplay-focused work I covered earlier, but it still fits into the same bigger picture. Sony is exploring how AI can learn from human behavior in more natural ways, whether that’s in games or everyday interactions.

What the Research Is About

The paper is titled Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision, and it brings together researchers from Sony Interactive Entertainment, Kyoto University, and the National Institute of Informatics. At its core, the goal is to teach an AI model how real objects move when handled by a person. Instead of looking at gameplay, this work focuses on everyday physical actions.

To do that, the team uses something called egocentric vision. It’s a fancy way of saying the model learns from video recorded from a first-person perspective, similar to a head-mounted camera. This gives the AI a clear view of how hands approach an object, how it rotates, and how the full movement plays out. The researchers pair this first-person footage with third-person video to give the model more context about the scene.

They also include simple text descriptions of each action, like “pick up the bowl” or “place the cup on the table.” Each description is tied to a full 6DoF trajectory, which includes the object’s position and rotation in 3D space. The model uses this combination of video and text to learn what each action looks like physically.

To train everything, the team pulls examples from large datasets such as Ego4D and Exo-Ego4D, which include thousands of clips showing people interacting with real objects. They then extract the exact motion paths from these clips, building a massive collection of demonstrations without having to record each one manually. According to the paper, this approach creates a diverse set of examples across many objects and verbs.

The end result is a model that can read an action description and generate a complete motion trajectory that matches how a person would perform that task. It’s about capturing the fine details of human-object interaction in a way that can be reused for research, robotics, or future studies in AI movement understanding.

TL;DR: Sony’s research teaches AI to understand how real objects move by pairing first-person video with simple text descriptions. The model learns full 3D motion paths for everyday actions using thousands of real-world examples.

Diagram showing the steps Sony’s AI uses to extract 3D motion trajectories from egocentric video, including action localization, object tracking, point cloud generation, and rotation sequence extraction.
Workflow from the research paper showing how the model extracts 3D object trajectories from first-person video using segmentation, tracking, point clouds, and projection.

How It Connects to PlayStation’s Previous AI Work

This research sits in a different space than the gameplay-focused work I covered earlier, but the two projects line up in an interesting way. In that earlier article, PlayStation’s engineers explored how AI can learn why certain gameplay decisions happen by watching how humans play. The idea was to help an agent understand the reasoning behind each move instead of just copying inputs. It focused on reading on-screen situations and connecting them to the actions a human player chose to take.

This new paper tackles a different type of understanding. Instead of looking at gameplay choices, it studies how physical actions unfold in the real world. The AI model isn’t watching a character dodge an attack or time a jump. Instead, it’s watching how a person reaches for a cup, rotates their wrist, or places an object on a surface. The goal is to learn the full 3D movement behind that action.

Together, the two projects show a broader theme in Sony’s research. One model learns human intent inside a game. The other learns human motion in everyday tasks. They deal with different problems, but both try to build AI systems that learn from human demonstrations in a more natural way.

This connection makes the research easy to follow even if the subjects sound technical at first. Whether it’s understanding a risky dodge in Returnal or a simple “pick up the bowl” action in first-person video, both projects try to help AI move beyond raw imitation. They each look for the context behind what a person is doing, which is what gives the learning process more meaning.

Side-by-side comparison from Astro Bot showing baseline AI versus Supervised Contrastive Imitation Learning, illustrating how the SCIL model reaches the checkpoint more effectively.

Quick check in:

If you’re still with me, awesome. This stuff can sound heavy at first, but the big picture is simple. Sony is trying to teach AI to understand what we do, whether we’re playing a game or just picking up a cup. If anything feels unclear so far, you’re not alone. I had to read a few parts twice myself.

Breaking Down the Model

At a high level, the model takes in three things: a short text description, a video example of the action, and the extracted 3D movement path. From there, it learns how the motion plays out in full detail. The text gives the AI a clear label for what’s happening, while the video shows how the hands and objects move. The 6DoF trajectory ties everything together by mapping the complete position and rotation of the object throughout the action.

Once the model has seen enough examples, it can generate a new motion path from just the text description alone. So if you feed it something like “pick up the cup,” it creates a trajectory showing how the object moves through space based on everything it learned. The point isn’t to create anything fancy. It’s simply trying to match the physical movement that a person would make in similar situations.

The researchers trained two versions of the model. One focuses on visual information from the video clips, while the other uses point cloud data to understand the scene in 3D. Both approaches help the AI understand how objects change orientation or move through space. They then tested the models using the HOT3D dataset, which contains detailed examples of human-object interactions in a first-person view.

The results showed that the models could reliably generate realistic motion paths that lined up with the action descriptions. It’s still research-level work, but the performance is consistent enough to act as a baseline for future studies. The main accomplishment is creating a system that ties together text, video, and 3D trajectories in a way that allows the AI to recreate those motions from simple descriptions.

TL;DR: The model learns real object movements by combining text, first-person video, and 3D trajectory data. Once trained, it can generate full 6DoF motion paths from simple action descriptions.

First-person images showing Sony’s AI generating 3D object manipulation trajectories, including picking up a phone, stirring a bowl, and lifting a plate, with colored motion paths visualized over each frame.
Here’s what the model actually “sees” and how it maps out real object movements from short action descriptions.

Why PlayStation Would Be Interested in This Research

While this paper focuses on real object handling instead of gameplay, it still fits the kind of long-term research Sony often does in interaction, animation, and spatial understanding. AI models that can read human-object movement are useful in many areas because they help systems understand how actions unfold in 3D space. They also make it easier to study how people perform everyday tasks without having to record each motion by hand.

For example, research like this is commonly used in fields such as robotics and VR, where models need to understand how an object should move when a person picks it up, rotates it, or sets it down. Having a detailed 6DoF trajectory makes it easier to analyze those movements or generate new ones for testing. It also helps with studying natural motion patterns, since the examples come directly from first-person video instead of scripted animations.

Another reason work like this matters is because it builds large training datasets automatically. By extracting trajectories from massive video collections, the researchers avoid the need for manual motion capture sessions or hand-labeled examples. That makes the research more flexible and opens the door to a wider range of possible actions.

It’s a very different type of study than the imitation learning project I covered earlier, but both point toward the same idea. Sony is looking at how AI can understand different aspects of human behavior, whether that means reading decisions in a fast-paced game or tracking how objects move in real life. Each project shows a different angle on the same broader goal of learning from human demonstrations.

How This Study Complements PlayStation’s Other AI Projects

What makes this research interesting is how much it manages to capture from simple inputs. By combining first-person video, third-person views, text descriptions, and extracted 3D trajectories, the team builds a training setup that doesn’t rely on scripted animations or hand-recorded motion data. Instead of manually crafting examples, they pull thousands of real actions from large video datasets and turn them into a flexible source of demonstrations. This scale gives the model a wide range of object types, motion patterns, and natural variations to learn from.

It also stands out because it creates a baseline for a task that hasn’t been explored much at this level. Generating full 6DoF object movements from text alone is a specific challenge, and the paper sets up a clear foundation for others to build on. It shows that pairing language with physical motion data is not only possible but reliable enough for structured testing.

When you compare this to the imitation learning project I covered earlier, it becomes easier to see a theme. One paper studies how AI can understand why a gameplay decision happens, while this one focuses on how a real-world action unfolds. Each project looks at a different part of human behavior, but both aim to teach AI through natural demonstrations rather than strict rules.

It’s not about predicting where the technology goes. It’s about watching how Sony’s researchers explore the pieces that help AI understand context, motion, and intent. Even though the topics are different, both studies highlight the same idea: learning from how people actually act, whether that’s dodging an attack or placing a cup on a table.