By Jon Scarr
PlayStation has been releasing a steady stream of AI research lately, and a lot of it focuses on one thing. Helping models understand how humans act, move, and react to the world around them. Not long ago, I covered this in my article How PlayStation Is Teaching AI To Play Games Smarter With Supervised Contrastive Imitation Learning, where PlayStation’s researchers showed how AI can learn the reasons behind gameplay decisions. That work looked at how an agent can watch gameplay footage and connect each move to the reason behind it.
This new paper explores something different, but still connected. Instead of studying gameplay, it focuses on real-world actions and how objects move in 3D space. The research uses egocentric vision, which is basically first-person video, to teach AI models how people hold, rotate, pick up, and place everyday objects. It also uses third-person footage and text prompts to give the model more context about what’s happening.
The idea is simple: show the AI a lot of examples of real actions, describe those actions in plain text, and let it learn the full 3D motion needed to perform them. Sony’s researchers built this system using large-scale video datasets and extracted thousands of hand-object interactions to create training material. The result is a model that can read an action description and generate a detailed 6DoF movement path (meaning full position and rotation in 3D space) that lines up with what a person would do.
It’s a very different problem than the gameplay-focused work I covered earlier, but it still fits into the same bigger picture. Sony is exploring how AI can learn from human behavior in more natural ways, whether that’s in games or everyday interactions.
What the Research Is About
The paper is titled Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision, and it brings together researchers from Sony Interactive Entertainment, Kyoto University, and the National Institute of Informatics. At its core, the goal is to teach an AI model how real objects move when handled by a person. Instead of looking at gameplay, this work focuses on everyday physical actions.
To do that, the team uses something called egocentric vision. It’s a fancy way of saying the model learns from video recorded from a first-person perspective, similar to a head-mounted camera. This gives the AI a clear view of how hands approach an object, how it rotates, and how the full movement plays out. The researchers pair this first-person footage with third-person video to give the model more context about the scene.
They also include simple text descriptions of each action, like “pick up the bowl” or “place the cup on the table.” Each description is tied to a full 6DoF trajectory, which includes the object’s position and rotation in 3D space. The model uses this combination of video and text to learn what each action looks like physically.
To train everything, the team pulls examples from large datasets such as Ego4D and Exo-Ego4D, which include thousands of clips showing people interacting with real objects. They then extract the exact motion paths from these clips, building a massive collection of demonstrations without having to record each one manually. According to the paper, this approach creates a diverse set of examples across many objects and verbs.
The end result is a model that can read an action description and generate a complete motion trajectory that matches how a person would perform that task. It’s about capturing the fine details of human-object interaction in a way that can be reused for research, robotics, or future studies in AI movement understanding.
TL;DR: Sony’s research teaches AI to understand how real objects move by pairing first-person video with simple text descriptions. The model learns full 3D motion paths for everyday actions using thousands of real-world examples.
![]() |
| Workflow from the research paper showing how the model extracts 3D object trajectories from first-person video using segmentation, tracking, point clouds, and projection. |
How It Connects to PlayStation’s Previous AI Work
This research sits in a different space than the gameplay-focused work I covered earlier, but the two projects line up in an interesting way. In that earlier article, PlayStation’s engineers explored how AI can learn why certain gameplay decisions happen by watching how humans play. The idea was to help an agent understand the reasoning behind each move instead of just copying inputs. It focused on reading on-screen situations and connecting them to the actions a human player chose to take.
This new paper tackles a different type of understanding. Instead of looking at gameplay choices, it studies how physical actions unfold in the real world. The AI model isn’t watching a character dodge an attack or time a jump. Instead, it’s watching how a person reaches for a cup, rotates their wrist, or places an object on a surface. The goal is to learn the full 3D movement behind that action.
Together, the two projects show a broader theme in Sony’s research. One model learns human intent inside a game. The other learns human motion in everyday tasks. They deal with different problems, but both try to build AI systems that learn from human demonstrations in a more natural way.
This connection makes the research easy to follow even if the subjects sound technical at first. Whether it’s understanding a risky dodge in Returnal or a simple “pick up the bowl” action in first-person video, both projects try to help AI move beyond raw imitation. They each look for the context behind what a person is doing, which is what gives the learning process more meaning.
Quick check in:
If you’re still with me, awesome. This stuff can sound heavy at first, but the big picture is simple. Sony is trying to teach AI to understand what we do, whether we’re playing a game or just picking up a cup. If anything feels unclear so far, you’re not alone. I had to read a few parts twice myself.
Breaking Down the Model
![]() |
| Here’s what the model actually “sees” and how it maps out real object movements from short action descriptions. |



