Learning Visual Parkour from Generated Images

CoRL 2024

Alan Yu^*1, Ge Yang^*1,2, Ran Choi¹, Yajvan Ravan¹, John Leonard¹, Phillip Isola¹

¹MIT CSAIL²Institute of AI and Fundamental Interactions (IAIFI)

^*Indicates equal contribution.

We generate physically correct video sequences to train a visual parkour policy for a quadruped robot, that has a single RGB camera without depth sensors. The robot generalizes to diverse, real-world scenes despite having never seen real-world data.

Abstract

Fast and accurate physics simulation is an essential component of robot learning, where robots can explore failure scenarios that are difficult to produce in the real world and learn from unlimited on-policy data. Yet, it remains challenging to incorporate RGB-color perception into the sim-to-real pipeline that matches the real world in its richness and realism. In this work, we train a robot dog in simulation for visual parkour. We propose a way to use generative models to synthesize diverse and physically accurate image sequences of the scene from the robot's ego-centric perspective. We present demonstrations of zero-shot transfer to the RGB-only observations of the real world on a robot equipped with a low-cost, off-the-shelf color camera.

Technical Summary Video

LucidSim Gallery

Select a task from the bottom row to view an example unroll from LucidSim, where we show the conditioning images, optical flow, and the resulting training images. Use the slider to scrub through the trajectory.

WARNING! The generated images may be very flashy.

Parkour Benchmark Environments

In addition to evaluating in the real world, we also chose to evaluate in simulation. This is challenging because creating a photorealistic simulator is exactly the problem we had set out to solve!

Using 3D Gaussian Splatting, we built a set of visually complex and diverse evaluation environments designed to test our policy's ability to generalize from simulation to real-world scenarios. Each scene in this suite required approximately five hundred color and depth image pairs, with manual alignment between the reconstructed mesh and splats, enabling us to render the robot's ego view via the gsplat library.

Citation

@inproceedings{yu2024learning,
    title={Learning Visual Parkour from Generated Images},
    author={Alan Yu and Ge Yang and Ran Choi and Yajvan Ravan and John Leonard and Phillip Isola},
    booktitle={8th Annual Conference on Robot Learning},
    year={2024}
}